EMA White Paper - Data Warehouse

14 downloads 234 Views 1MB Size Report
analysis. Large subsets of detailed information can be easily searched and ... In this scenario, the data lake is portra
THE CONTEXTUAL DATA LAKE MAXIMIZING DATA LAKE VALUE VIA HYBRID ENVIRONMENTS THAT PROVIDE COMPLETENESS, CONTEXT, AND ACCELERATED ANALYTICS CAPABILITY

The Contextual Data Lake

SUMMARY Businesses are increasingly turning to data lakes as a means of addressing the challenges associated with managing big data. These organizations face a certain amount of confusion and ambiguity when setting out to implement a data lake solution because there is no single definitive “data lake” model, but rather a variety of options around how this component of the enterprise data fabric can be architected and implemented. Much of the current discussion about data lakes centers on Hadoop, which is – without question – a core big data technology. However, an exclusive focus on Hadoop is misdirected, as would be an exclusive focus on traditional data warehousing technology. The false dichotomy between these two approaches tends to obscure the fact that a hybrid environment offers many advantages over an exclusive focus and is currently the most likely option for most organizations. SAP offers solutions for real-time operations, data warehousing, and managing big data to support a wide range of options for implementing and managing a hybrid data lake environment. Properly managed, a hybrid environment enables the implementation of a true contextual data lake, an evolutionary step up from the non-contextual data lake (the data swamp) to the real-time, virtual data lake environments to come.

Page 2

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

INTRODUCTION Rethinking Data Architecture In a world that is increasingly information-driven, data management is no longer merely a reflection of your organization’s administrative competency, but rather a unique strategic differentiator that can mean the difference between success and failure in the marketplace. The companies that realize success and growth in this new era will be those that can adapt to this challenge with a core strategy that uses big data to transform their businesses. Such transformation requires recognizing that big data is more than just a few new concepts and 1 technologies; it is a whole new paradigm. With this new paradigm come new challenges. Primary among these is the fact that traditional data architectures are inadequate to deal with the new demands placed upon them by a massive influx of new information. Large social media companies like Facebook deal with hundreds of terabytes of information each day. A fleet of commercial jets can create similar amounts of data with from a single day of operations. Traffic cameras, environmental sensors, and cell phones create and use untold billions of pieces of data that range from simple numerical data to voice, text, and video information. All of that data is moving at speeds that make velocity a critical concern for data management. Whether it is click imprints on an online ad, data exchange between machines, online gaming information, just-in-time inventory updates, or the constant tracking of activity in the world’s stock exchanges, the speed at which data is created and moves is dizzying. Increasingly, the world of big data is also the world of real-time data. Businesses no longer have the luxury of choosing whether they want to go big or they want to go fast. They must do both. Moreover, the variety of data types poses a unique challenge. Gone are the days when the enterprise data architecture could assume that all needed data would be structured or that it would fit easily into a conventional enterprise data model. Today’s environments encompass more than just text, numbers, and the occasional image, and more than just transactions from a traditional OLTP. They involve complex audio, video, and 3D image files; telemetry, log files, and other machine data; and a whole host of social media and other user- or customergenerated data. All of these changes add up to the need for a new way of looking at data architecture.

1

Beyer, Mark (June 27, 2011.) “Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data.” Gartner. http://www.gartner.com/newsroom/id/1731916

Page 3

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

The Enterprise Data Warehouse vs. The Data Lake Introduced as a means of managing growing volumes of data and consolidating enterprise data from multiple sources, the enterprise data warehouse (EDW) is, in many ways, the original “big data” solution. The EDW supports reporting and analysis on (usually) highly structured data, playing a critical role for many organizations in enabling both internal and external reporting – ensuring, among other things, SLA (service-level agreement) and regulatory compliance – as well as more strategic analysis of business processes, market performance, etc. The information managed by these environments complies with a defined schema, which optimizes searches, reporting, and analysis. Large subsets of detailed information can be easily searched and analyzed, enabling both data mining and more advanced predictive analysis. What the EDW cannot do, however, is effectively manage much of the new data variety that makes up a growing share of the typical big data environment. While most EDW architectures are scalable, their reliance on a predefined schema limits their flexibility, and therefore applicability, as repositories for the full enterprise dataset in the era of big data. The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data to conform to a pre-defined schema. Rather, it stores data in its original format. Most often 2 relying on Hadoop (or, less frequently, a NoSQL database ) as the enterprise repository, and implemented on commodity hardware – or via the cloud – the data lake addresses the challenges stemming from massive growth of data volumes as well as those arising from widespread disparity and incompatibility among data types. For example, many hospitals have discovered that data lakes are an ideal way to manage the millions of patient records they maintain – records that can range in format from x-rays to physicians’ notes to lab results. With a data lake, the hospital stores all of that disparate data in its original format, calling upon specific types of record when needed, converting the data into uniform structures only when the situation calls for it. Because each record remains in its original format, the data lake supports a variety of contextual possibilities that a standard database structure cannot offer. Of course, the data lake model is not without associated risks and limitations. Notwithstanding those limitations, some organizations are experimenting with simply leaving data in the data lake while they wait to see how they might eventually use it. But without an organized system for managing all that information, it is easy to lose track of it. Additionally, relying on late binding techniques rather than well-defined metadata protocols can make data less accessible to those who need direct query or analytical access to the data. The potential contextual capabilities of the data lake are extremely valuable, but – on its own – the drawbacks add a level of risk that most businesses would find unacceptable.

2

Brantne , Matthias (2015). "Filling in the Gaps in NoSQL Document Stores and Data Lakes." PWC.

Page 4

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

When viewed as alternatives, both the traditional EDW model and the data lake have their strong points and their weaknesses. In trying to choose between them, organizations often have to make difficult choices. As Figure 1, below, demonstrates, organizational requirements often put the two models at odds with one another. A business that needs to deal with very large volumes of data and keep server and storage costs down while meeting numerous compliance and reporting mandates will often be confused and frustrated by this dichotomy.

Figure 1: Viewed as alternatives, the data lake and EDW models are incomplete when it comes to addressing the kinds of challenges businesses currently face

The answer may lie with a hybrid of the two models. By combining the best features of an enterprise data warehouse with the greater storage flexibility and contextual value provided by a data lake, your company can more efficiently manage growing data volumes and complexity. Executed properly, a hybrid environment can provide the ease of analysis you need today, while maintaining a repository of all data in its original format. Such an approach can address analytical needs while ensuring that context is preserved for both immediate and future needs.

Page 5

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

Figure 2: A hybrid solution can fill the gaps The following sections provide an overview of some of the more typical deployment options of a data lake within a larger data management framework.

Page 6

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

DATA LAKE AS A TECHNOLOGY SOLUTION As data volumes grow, and the variety and complexity of data types increase, traditional data management tools are pushed to their limits. In particular, the EDW – with its extensive data preparation and cleansing requirements – is unable to keep up. A data lake can offer significant assistance to an environment relying solely on an EDW and facing such pressures. Typically, a data lake is proposed as primarily a technology solution, albeit one that addresses certain organizational concerns. In this scenario, the data lake is portrayed as the ideal way to advance your organization’s big data strategy. Advocates of this approach argue that the data lake is a project that you can simply “turn over” to your IT department, along with a deadline and an appropriate commitment from your budget. Once it is developed, you will have a repository for all the disparate pieces of data that your organization collects. Some would even argue that a data lake provides the ideal way to break the pattern of tension that routinely exists between IT and the business. IT has traditionally worked to drive information into centralized EDW architecture 3 solutions (including datamarts), even as the business tends towards less centralized solutions such as shadow IT 4 efforts and “spreadmarts,” Microsoft Excel-based quasi-data mart solutions that tend to proliferate throughout organizations, providing incomplete and often conflicting sets of analyses. Increasing demand for access to data throughout the organization has only increased that tension in recent years. The need for business analysts, managers, and other decision-makers to have hands-on access to data and analysis is growing. A data lake enables more localized solutions and broader access to the data – while maintaining a consistent, IT-managed repository – all incorporated without incurring significant costs. As a technology solution, the data lake is intended to serve as an effective supplement to your EDW architecture, decreasing reliance on centralized solutions and offering new flexibility. Once built, the idea is that IT can leave the analysis to those who have need of it. However, the business analysts who need access to the data typically lack the broad set of skills required to perform analysis on such a wide variety of data types. Business users naturally tend towards solutions that they find intuitive and easy to use, which is why spreadmarts remain a problem.

Meeting Big Data Challenges Probably the most important benefit of creating a central repository of an organization’s data is the context that it provides. Context can transform raw data into usable knowledge to better inform both day-to-day decisions and long-term strategies. The data lake provides the whole picture; nothing is left out because it doesn’t fit the schema. Consider the sensor and other machine information that Internet of Things (IoT) environments must manage. Daily or hourly (or even considerably more frequent) readings on location, temperature, or any of thousands of other

3

Guest, Vawns and Bolger, Patrick. "Managing shadow IT." ComputerWeekly.com. http://goo.gl/gyBBvH

4

Eckerson, Wayne (July 2002). "Taming Spreadsheet Jockeys". TDWI Case Studies and Solutions. TDWI.

Page 7

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

variables can factor in to major business decisions. But that is a massive amount of data to try to wedge into a conventional EDW schema, and those data points are only relevant for a certain class of queries and analysis. Often, the ETL requirements alone preclude putting this kind of information into a conventional data warehouse As an example of how a data lake can provide new opportunities for more powerful analysis of data, consider the data lake system that GE recently implemented. More than two dozen airlines now stream a wide variety of jet engine performance data directly into a data lake. The data is analyzed by service crews at the airlines so that they can more easily detect problems with performance. Even the smallest anomalies can be detected with this level of 5 analysis, by comparing variables such as engine temperature, the engine type, and its overall service records.

The Data Annex None of this suggests, however, that data lakes can simply serve as a replacement for an enterprise data 6 warehouse. They cannot. When deployed as a technology solution, the data lake becomes an annex to the EDW, a supplemental environment that addresses some requirements the EDW cannot address on its own.

Technology Solutions Big Data Solutions

Data Annex

Figure 3: A data annex / supplemental data lake is typically introduced as a technology-focused initiative

5

"Angling in the Data Lake. (August 10, 2014.)" GE Reports. http://goo.gl/YWZg9T

6

Elliott, Timo (April, 2014). "No, Hadoop Isn’t Going To Replace Your Data Warehouse." Business Analytics. http://goo.gl/GtXb1V

Page 8

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

When implemented via Hadoop or a NoSQL database, data lakes function as repositories where disparate types of information are stored in their native format. The lack of structure is, in one sense, necessary. At present, the demands of Big Data necessitate such a repository for the storage of the many different varieties of data that are captured. Analyzing this kind of data, as GE is doing with the jet engine data mentioned above, requires manipulating metadata. What analysis can be achieved with this approach is severely limited in scope, although – as noted – it can provide a tremendous business impact. In addition to a lack of structure, such data has no lineage. It can thus be extremely difficult to determine how and where the data was generated, as well as other factors that would make it easier to classify and categorize. The basic assumptions that standard enterprise data management practices have established around data do not apply. In the appropriate context, and with the metadata tweaked just right, the data lake produces value. In another context (even potentially a closely related one) the same data may be of substantially lower value, or no value at all. As explained by Gartner: Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And 7 without metadata, every subsequent use of data means analysts start from scratch. In a standard data warehouse environment, on the other hand, data tends to be broadly applicable across the widest variety of analytical use cases. Banks, accounting firms, manufacturers, and other data-intensive organizations can more readily access the information they need and enjoy much greater flexibility in the queries and types of analysis they apply to it. Where a data lake may be implemented as a technology solution, a data warehouse is almost always implemented as a business solution. This above all is why – for all the success that GE has achieved with their jet engine data lake – no one is suggesting that such an environment replaces the data warehouse(s) such a large company needs to manage day-to-day operations, support broad reporting and analysis of the overall operation, and maintain regulatory and SLA compliance.

7

"Gartner Says Beware of the Data Lake Fallacy" (July 28, 2014). Gartner. http://goo.gl/vOKEs3

Page 9

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

THE DATA LAKE AS A COMPONENT OF THE DATA ARCHITECTURE Not long ago, there was a flurry of media discussion about whether big data environments will bring about the end of the data warehouse. While that question is not being asked as frequently today, the tendency to view data lakes and EDWs as competing or alternative models persists, leading to ongoing misunderstandings, and also obscuring what is really happening within many organizations. The reality for many businesses is that there is no choice to be made. Health care, financial services, utilities, and many other industries are highly regulated. In those settings, the structure and use of the data warehouse becomes a matter of law. Moreover, even in less regulated industries, when companies commit to complex and detailed service level agreements, compliance often involves a structured approach to managing data that only an EDW can provide. Operating outside of the law or violating core business agreements with vital customers or partners is not an option. Of necessity, such companies will retain an EDW as part of their overall data architecture. If the data lake is most often deployed as a technology solution, the data warehouse is most often retained as a business solution. Whatever value the data lake may bring, it generally cannot deliver core functionality that the business requires.

Business Solutions EDW Solutions

Business Action Machine

Figure 4: A business action machine solves specific business problems and is integrated with the overall enterprise data fabric However, not every “data warehouse” qualifies as a true enterprise data warehouse. In addition to data marts, reporting servers, and related systems, many organizations implement data warehousing solutions that are in fact quite limited in their scope and abilities, and often devoted to a single main purpose – such as managing security data or monitoring product performance. These “business action machines” often work in parallel with the true EDW, interacting with it as required depending on the specific function being carried out.

Page 10

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

A business action machine may be an important component of the overall data architecture, but by definition it is not an alternative to an EDW. In the GE example provided above, the data lake provides exactly this kind of limited-scope function. The applicability of a data lake architecture to such a system will depend in large part on what task is being undertaken. As outlined above, many functions can only be reliably performed via a data warehouse. Of course, the limitations cited here primarily stem from the use of Hadoop or a NoSQL database. However, a business can implement what is, effectively, a business-action data lake using any of a number of technologies, including more traditional database technology. In such an environment, there will still be a clear distinction between the data repository and the active, governed environment.

Page 11

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

THE DATA LAKE AS AN ENTERPRISE CONTEXT MACHINE Successfully incorporating a data lake into a hybrid architecture requires balancing expectations of the advantages it can bring with a thorough understanding of its limitations. To understand why, consider some of the more obvious limitations. A data lake is not a good option for high-performance analysis of structured data. Because the data in a data lake remains in its native format, it lacks the structure that would facilitate versatile querying and high-performance analysis. The big data answer to this problem is the Hadoop principle known as “schema on read,” where the data structure is applied at the time the query is issued. But this approach is not without risks. Some of the data included in big data environments can be highly volatile, creating a disconnect between assumed structure of the data and what is actually there. This creates a situation where it is easy to overlook errors in the data, and performance inevitably takes a hit. The best way to mitigate against these risks involves some level of vetting of incoming data. This requires reintroducing at least some parts of the Extract Transform Load (ETL) process for EDW 8 that a big data solution is supposed to eliminate. A stand-alone data lake is not the best option for real-time analysis for several reasons. The most obvious is that the data lake model is, by definition, a repository. A business looking to do real-time analysis will need to add realtime functionality onto the data lake, via Spark streaming or some other interface, to support such a use case. Additionally, there are all the potential issues around data quality discussed above. The risk that the schema applied at read time is not going to find all relevant data – that some data is going to slip between the cracks – is a very real one. If important data is not finding its way out of the data lake, high speed analysis only serves to accelerate the errors. Access and Context As noted above, the primary advantage of a hybrid environment is that it brings context to enterprise analysis and reporting. In the age of big data, businesses are swimming in a sea of context. When deployed properly, contextual data can add tremendous business value. Businesses can explore machine- and user-generated data to segment customers by behavior as well as demographics, and to make surprising connections between seemingly unrelated factors. A retailer looking at point-of-sale transactions can compare receipts with external data such as searches trending on social media to better understand why customers are deciding to buy (or not buy.) A shipping company can dig deeper than just seasonal variations and begin forecasting work volume and likely delays by cross-referencing orders placed with changes in weather. Manufacturing, logistics, hospitality, health care, retail, entertainment – all industries can benefit from digging deeper into data that sheds unexpected light on their operations. While context from a data lake can provide tremendous insights, it is of little use if there is no system in place to deliver reliable answers to those who need it – and deliver it in a way that helps to facilitate the decision-making

8

"Why Hadoop Projects Fail — and How to Make Yours a Success." Venturebeat. http://goo.gl/oGI7yH

Page 12

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

process. To accommodate that need, the best option is an architecture that leverages the advantages of both the enterprise data warehouse and the data lake. The warehouse can still provide the rapid analysis of structured data, while the data lake can support the warehouse by providing the context that can better inform decision-making at every level of the enterprise.

Figure 5: The contextual data lake combines the scalability and flexibility of big data solutions with the reliability and business focus of a traditional EDW

A hybrid environment can serve as a true enterprise context machine, bridging the gap between the power and scalability of big data technologies and the reliability and business focus of an enterprise data warehouse. In the next section, we examine some of the technologies that support an enterprise context machine.

Page 13

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

SAP SOLUTIONS SUPPORT THE CONTEXTUAL DATA LAKE SAP provides a full palette of technologies to support the contextual data lake. Proven EDW technology provides the structure for a true enterprise-grade solution, backed by unmatched knowledge of and integration with endto-end business processes. The real-time platform directly integrates the EDW with the business, and can also serve as a bridge between the business, the data warehouse, and the data lake. High-end traditional data management technologies round out the ecosystem and provide flexibility in structuring a hybrid environment.

Big Data Platform SAP’s Big Data Platform is SAP HANA, an in-memory, column-oriented, RDBMS (relational database management system) designed to handle both high transaction rates and complex query processing on the same platform. HANA radically simplifies data management architectures by combining in-memory processing of transactional data with the EDW to enable business processes to run 1,000 to 100,000 times faster than in environments relying on traditional architectures. With an embedded web server and version control repository that can be used for application development, HANA provides a full real-time computing platform that enables businesses to realize maximum value from their big data assets. eBay’s HANA implementation story provides an example of how the platform supports a true contextual data lake environment. With more than 90 million users worldwide and a system that processes millions of transactions per day, eBay’s online auction service has accumulated some 50 petabytes of data, and is still growing. In order to provide actionable intelligence to their users, eBay requires a system that can support analyzing tens of thousands of variables within that massive collection of data in order to identify shopping patterns and purchasing trends as they emerge. Because the value of the assets traded on eBay can be highly variable, understanding and leveraging these trends as they occur is vitally important to the success of the sellers, who seek to maximize the value of each sale they make. The 50 petabytes of data is curated in a massive conventional data warehouse. eBay has a team of more than 300 analysts studying the data from the North America marketplace on a full-time basis; these individuals are responsible for understanding online shopping patterns within specific product categories. With HANA, eBay has implemented an early pattern detection system that uses predictive analytics to enable these analysts to discover trends as they emerge in real time. As noted above, a contextual data lake need not be built on a “big data” platform per se. In this instance, eBay is running real-time predictive analysis on data in a HANA-based data mart, and dipping into the data lake (that is, the conventional data warehouse) as needed for additional context. Holidays, sporting events, movie releases, and many other news items and emerging patterns on social media can drive rapid and substantial changes to the value of specific items. Analysts can now observe patterns as they emerge and dig deeper to get the full story on what is happening in the market. For example, noting that shoe sales are spiking is helpful, but observing that the real upward trend is among athletic shoes helps queue the right sellers to take advantage of the trend. Moreover, linking the spike to a particular brand and make of shoe that has suddenly become hot because of a sporting event or a tweet from a

Page 14

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

star athlete enables the specific sellers who have those to offer to leverage the rapidly changing market. Real-time transactional data combined with context from the data lake open up a whole new view of what is happening within eBay.

EDW SAP Business Warehouse (BW) is an Enterprise Data Warehouse solution, enabling business to integrate, transform, and consolidate relevant business information both from SAP applications and external data sources. SAP BW provides businesses with a high-performance infrastructure that enables them to evaluate and interpret data, fully integrated with an overall SAP ecosystem, and leveraging the unique and comprehensive understanding of business process. It enables reporting, analysis, and interpretation of business data that is crucial to preserve and enhance the competitive edge of companies by optimizing processes and enabling them to react quickly to meet market opportunity. Decision makers can make well-founded decisions and identify target-oriented activities on the basis of the analyzed data. BW was traditionally implemented on standard RDBMS technology, which it still supports, and is now frequently deployed on SAP HANA to take advantage of the full integration of operations and analytics that HANA provides, including real-time capability. Alliander, a Dutch energy distribution company serving 3.5 million customers, provides a good example of how businesses can leverage HANA’s real-time potential with SAP Business Warehouse. The environment supports both real-time analysis and historical / contextual analysis depending on the use case. A critical process for Alliander is load forecasting. Providing too much power causes waste and has a negative environmental impact; providing too little causes customer dissatisfaction and puts vital services at risk. Striking a balance between the two requires intensive analysis. By using BW on HANA, Alliander was able to cut their load forecasting process from 10 weeks to three days, providing significant savings and greater assurance of forecast accuracy. The company has also implemented advanced analytics for asset management, using comparative analysis of nearby assets to predict failure of infrastructure before it occurs.

Data Management SAP also provides a suite of traditional database management systems which can provide critical infrastructure to hybrid data lake environments. These include SAP ASE, which is full-featured, enterprise OLTP database; SAP IQ, a columnar RDBMS optimized for big data analytics and data warehousing; and SAP SQL Anywhere, an embedded SQL database management system that supports custom mobile database applications.

HANA EDW Deployment Options For structured data, SAP HANA Smart Data Access enables businesses to merge data in heterogeneous EDW landscapes and to access remote data without having to replicate the data to the SAP HANA database first. In addition to Apache Hadoop, HANA Smart Data Access supports a variety of data sources, including Teradata

Page 15

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

database, SAP Sybase ASE, SAP Sybase IQ, and the Intel Distribution for SAP HANA. SAP HANA handles the data like local tables on the database. Automatic data type conversion makes it possible to map data types from databases connected via SAP HANA Smart Data Access to SAP HANA data types.

Figure 6: Hybrid Data Lake environment using structured data and conventional RDBMS / EDW technology Another option for structured data is SAP HANA Dynamic Tiering, which provides the ability to keep data either in memory or on disk in a columnar format via SAP IQ, allowing users to assign hot (active) data to in-memory, while handling warm or cooler data on disk. From the user point of view, HANA tables on disk and in-memory are not distinguishable and can be queried and modified using standard SQL statements, like any other SAP HANA tables. Dynamic Tiering provides a valuable intermediate solution for organizations managing very large sets of structured data. While in some instances it may not be practicable to put the full dataset into HANA, there is very little advantage, other than storage costs, to moving such data into Hadoop. Using SAP IQ to offload less active data enables your business to get the full benefit of a dedicated columnar RDBMS while keeping all administrative and other processes HANA-centric. Dynamic Tiering provides much faster access to the data in SAP IQ than would be possible for the same data in Hadoop, and without the administrative overhead of having to perform SQL queries into Hadoop.

Page 16

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

Figure 7: Hybrid Data Lake environment using multi-structured data and Hadoop HANA Deployments with Hadoop A hybrid environment enables your business to combine the in-memory processing power of SAP HANA with Hadoop’s ability to store and process very large datasets without regard to structure. Such a solution can process massive amounts of data – up to 100 petabytes or more – at a relatively low cost by distributing data processing via Hadoop to scale across commodity hardware. Combining HANA and Hadoop provides an environment that leverages the advantages of both technologies, creating a highly dynamic and scalable data ecosystem. Such a solution can shrink data management costs to a fraction of conventional database total cost of ownership. Perhaps more importantly, such an environment opens up capability that the business simply did not have before. The McLaren Group, known for designing and building winning Formula One cars and deploying winning Grand Prix racing teams, demonstrates such capability with their hybrid HANA and Hadoop environment. The McLaren race car has over 1,000 sensors, which track more than 30,000 bytes per second of data. In real time, HANA enables the driver to evaluate his performance against competitors, and to avoid collisions, while pit mechanics track component performance and predict failure in order to minimize the number of required pit stops. After the race, the full set of data stored in the Hadoop data lake enables the team to analyze overall vehicle and driver performance for improved future performance; that same contextual data enables the manufacturer to perform sophisticated analysis that drives design changes to subsequent versions of the vehicle. HANA can connect with Hadoop via Smart Data Access or using SAP Data Services, which provide the options of pushing down to Hadoop via HiveSQL or Pig scripts. Via an ETL process, Hadoop data can then be bulk loaded into an EDW running on HANA.

Page 17

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

To bring real-time performance and contextual analysis even closer together, SAP has introduced HANA Vora, a solution that combines Hadoop with HANA using Spark as a distribution engine, connecting the HANA in-memory technology with the Hadoop file system. The new solution provides an in-memory, massively distributed data processing engine within Hadoop to provide simple business-oriented scale-out processing of data. As shown in Figure 8., below, such an approach will prove particularly effective in the growing number of environments in which there is no single data lake, but rather a widely distributed set of big data collections. Early testing shows 9 that this solution’s performance greatly improves performance.

Figure 8: Vora Integrates HANA and Hadoop Such a solution is an evolutionary step towards the vision of a fully in-memory solution. Advances that move contextual analysis and high-performance, high-integrity data management systems closer together serve an important function. But it is possible that the data lake model, while likely to be with us for a while to come, is not the end-game.

9

Leukert, Bernd. "Run Simple: Reimagine the Promise at the Heart of Your Business" (May 6, 2015). SAP Sapphire (Keynote address.) http://goo.gl/r33eu8

Page 18

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

Ultimately, SAP HANA has evolved to solve business problems. That is less true for Hadoop, which has evolved primarily to address technical issues. HANA can integrate every component of the enterprise data ecosystem, uniting real-time data access and contextual analysis of data in a way that no other solution can. While hybrid data lake environments are a sound choice for the present and near future, a solution that brings all enterprise data together in a single real-time environment may prove to be the answer in the long run.

Figure 9: Pure HANA real-time big data environment

Page 19

The Contextual Data Lake

Copyright © 2015 SAP, Inc.

The Contextual Data Lake

CONCLUSION The dichotomy of the data lake and the enterprise data warehouse reflects an older, deeper rift between the need to solve business problems and the attraction that new technologies often represent. The importance of Hadoop to the big data landscape would be difficult to overstate, but it is a mistake to confuse Hadoop with the entire landscape. An exclusive focus on Hadoop, or on the data lake model, or on any technology or implementation model, can become a distraction in the face of business challenges. Ultimately, your organization needs an architecture driven by business need rather than by what technologies you have (or what new technologies are available.) Existing EDW technologies and practices are vital because of the ongoing problems they address and the integrity they provide for the data ecosystem. Big data technologies and practices are critically important because of the new opportunities they provide for business insight in an environment that demands increasingly expanded and accelerated results. A hybrid approach as outlined in this paper can bridge the gap between the conflicting requirements your businesses increasingly faces – e.g., doing more with ever greater volumes of data in ever-diminishing time frames – and the disparate technologies that provide such capabilities. A structured data warehouse informed by a data lake can bring transactions and records together with demographic, historical, and other contextual data to provide new and often completely unexpected insights, allowing you to avoid risks and leverage opportunities that would have been invisible before. And your business can realize these benefits while maintaining proper data governance, and while meeting all business and external (e.g. regulatory) requirements. Moreover, when implemented with the right technologies, a hybrid contextual data lake can help make your business future-ready, better prepared for the merging of the real-time and big data paradigms which is on its way.

Page 20

The Contextual Data Lake

Copyright © 2015 SAP, Inc.