Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

14 downloads 191 Views 643KB Size Report
Jun 7, 2016 - Cloudera data services, management services, and other .... in mind, Lenovo RackSwitch G8272 is ideal for
Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark Initial Release: 7 June 2016 Version 1.0

Gary Cudak Ajay Dholakia CRN: BDACLDSPK62

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

Table of Contents 1

Introduction.............................................................................................. 1

2

Business problem and business value .................................................. 2

3

4

5

6

2.1

Business problem .................................................................................................... 2

2.2

Business value ......................................................................................................... 2

Requirements........................................................................................... 4 3.1

Functional requirements .......................................................................................... 4

3.2

Non-functional requirements .................................................................................... 4

Solution overview .................................................................................... 5 4.1

Software components .............................................................................................. 5

4.2

Hardware components ............................................................................................. 9

4.2.1

Lenovo System x3650 M5 Server ............................................................................................... 9

4.2.2

Additional Hardware Components ............................................................................................. 10

Deployment considerations .................................................................. 11 5.1

Server / Compute Nodes ....................................................................................... 11

5.2

Networking ............................................................................................................. 11

5.3

Test and validation ................................................................................................. 11

5.4

Deployment examples ........................................................................................... 12

Appendix: Bill of materials (optional) .................................................. 13 6.1

BOM for compute servers ...................................................................................... 13

Resources.................................................................................................... 14 Trademarks and special notices ................................................................ 16

ii

version 1.0

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

1 Introduction This document describes the reference configuration for the Lenovo Solution for Cloudera Enterprise with Apache Spark. It provides a predefined hardware infrastructure for the Cloudera Enterprise (version 5.7), a distribution of Hadoop and Spark with value added capabilities from Cloudera. This reference configuration is an extension of the Lenovo Big Data Reference Architecture for Cloudera Distribution for Hadoop (see Resources on page 14). Cloudera brings the power of Apache Hadoop and Spark to the enterprise. Apache Hadoop and Spark are open source software frameworks used to reliably manage large volumes of structured and unstructured data. Cloudera enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features. The result is that you get a more developer and user-friendly solution for complex, large-scale analytics. The predefined configuration provides a baseline configuration for a big data solution, which can be modified, based on the specific customer requirements, such as lower cost, improved performance, and increased reliability. The intended audience of this document is IT professionals, technical architects, sales engineers, and consultants to assist in planning, designing, and implementing the big data solution with Lenovo hardware. It is assumed that you are familiar with Apache Hadoop and Spark components and capabilities. For more information about Hadoop, see Resources on page 14.

1

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

2 Business problem and business value This section describes the business problem that is associated with big data environments and the value that is offered by the Cloudera solution that uses Lenovo hardware.

2.1 Business problem By 2009, the world generated 800 billion GB of data, a level that is expected to increase to 40 trillion GB by 2020. In all, 90% of the data in the world today was created in the last two years alone. This data comes from everywhere, including sensors that are used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone global positioning system (GPS) signals. This data is big data. Big data spans the following dimensions: 

Volume: Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.



Velocity: Often time-sensitive, big data must be used as it is streaming into the enterprise to maximize its value to the business.



Variety: Big data extends beyond structured data, including unstructured data of all varieties, such as text, audio, video, click streams, and log files.

Big data is more than a challenge; it is an opportunity to find insight into new and emerging types of data to make your business more agile. Big data also is an opportunity to answer questions that, in the past, were beyond reach. Until now, there was no effective way to harvest this opportunity. Today, Cloudera uses the latest big data technologies, such as the massive map-reduce scale-out capabilities of Hadoop, to open the door to a world of possibilities. As businesses adopt big data technologies and bring them into their IT infrastructure, an additional challenge of managing the different big data clusters has emerged. Typically, data centers have one cluster for off-line batch processing of large datasets, another for real-time stream processing and yet another for running analytics applications like machine learning and graph data processing. These types of mixed environments for big data processing are becoming common driven by emerging technologies like the Internet of Things (IoT).

2.2 Business value Apache Hadoop is the open source software framework that is used to reliably manage large volumes of structured and unstructured data. Cloudera enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features. The result is that you get a more developer and user-friendly solution for complex, large-scale analytics. How can businesses process tremendous amounts of raw data in an efficient and timely manner to gain actionable insights? Cloudera allows organizations to run large-scale, distributed analytics jobs on clusters of cost-effective server hardware. This infrastructure can be used to tackle large data sets by breaking up the data into “chunks” and coordinating data processing across a massively parallel environment. After the raw data is stored across the nodes of a distributed cluster, queries and analysis of the data can be handled

2

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

efficiently, with dynamic interpretation of the data formatted at read time. The bottom line: Businesses can finally get their arms around massive amounts of untapped data and mine that data for valuable insights in a more efficient, optimized, and scalable way. Apache Spark is gaining popularity as a big data processing framework that offers the ability to run different types of big data and analytics applications on a common framework. This includes batch, streaming, machine learning and others. The challenge of maintaining different codebases running on different clusters can be effectively addressed by deploying Spark applications on a single cluster. Cloudera that is deployed on Lenovo System x servers with Lenovo networking components provides superior performance, reliability, and scalability. The reference architecture supports entry through high-end configurations and the ability to easily scale as the use of big data grows. A choice of infrastructure components provides flexibility in meeting varying big data analytics requirements.

3

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

3 Requirements The functional and non-functional requirements for this reference configuration are desribed in this section.

3.1 Functional requirements A big data solution supports the following key functional requirements: 

Various application types, including batch and real-time analytics



Industry-standard interfaces so that applications can work with Cloudera



Real-time streaming and processing of data



Various data types and databases



Various client interfaces



Large volumes of data

3.2 Non-functional requirements Customers require their big data solution to be easy, dependable, and fast. The following non-functional requirements are key: 







4

Easy: o

Ease of development

o

Easy management at scale

o

Advanced job management

o

Multi-tenancy

Dependable: o

Data protection with snapshot and mirroring

o

Automated self-healing

o

Insight into software/hardware health and issues

o

High availability (HA) and business continuity

Fast: o

Superior performance

o

Scalability

Secure: o

Strong authentication and authorization

o

Kerberos support

o

Data confidentiality and integrity

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

4 Solution overview Figure 1 shows the main features of the Cloudera reference configuration that uses Lenovo hardware. Users can log into the Cloudera client from outside the firewall by using Secure Shell (SSH) on port 22 to access the Cloudera solution in the corporate network. Cloudera provides several interfaces that allow administrators and users to perform administration and data functions, depending on their roles and access level. Hadoop and Spark application programming interfaces (APIs) can be used to access and process data. Cloudera APIs can be used for cluster management and monitoring. Cloudera data services, management services, and other services run on the nodes in cluster. Storage is a component of each data node in the cluster. Data can be incorporated into the Cloudera Hadoop storage through the Hadoop APIs or network file system (NFS), depending on the needs of the customer. A database is required to store the data for Cloudera manager, hive meta store, and so on. Cloudera provides an embedded database for test or proof of concept (POC) environments and an external database is required for production environment.

Figure 1. Cloudera architecture overview

The version of CDH that includes Apache Spark is Cloudera Enterprise.

4.1 Software components The CDH solution provides features and capabilities that meet the functional and non-functional requirements of customers. It supports mission-critical and real-time big data analytics across different industries, such as

5

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

financial services, retail, media, healthcare, manufacturing, telecommunications, and government organizations and leading Fortune 100 and Web 2.0 companies. CDH is the world’s most complete, tested, and popular distribution of Apache Hadoop and related projects. All of the packaging and integration work is done for you, and the entire solution is thoroughly tested and fully documented. By taking the guesswork out of building out your Hadoop deployment, CDH gives you a streamlined path to success in solving real business problems with big data. CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified batch processing, real-time stream processing, interactive SQL, and interactive search and discovery, and role-based access controls. The Cloudera platform for big data can be used for various use cases from batch applications that use MapReduce or Spark with data source, such as click streams, to real-time applications that use sensor data. Figure 2 shows the CDH key capabilities that meet the functional requirements of customers.

Figure 2. CDH key capabilities

The various components of Cloudera Enterprise shown above are described in [Ref: Cloudera RA document]. In this section, we focus on Apache Spark by providing an overview of its architecture and describing the capabilities of its main components.

Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified big data applications that combine batch, streaming, and

6

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

interactive analytics on all your data. In collaboration with Databricks (the company that is leading the development of Spark) Cloudera offers commercial support for Spark with Cloudera Enterprise. Spark is 10 – 100 times faster than MapReduce, which delivers faster time to insight on more data and results in better business decisions and user outcomes.

. The Apache Spark project has recently become very popular and is being adopted as a preferred framework for a variety of big data use-cases ranging from batch applications that use MapReduce or Spark with data sources such as click streams, to real-time applications that use sensor data. The Apache Spark stack is shown in Figure 3. As depicted, the foundational component is the Spark Core. Spark is written in the Scala programming language and offers simple APIs in Python, Java, Scala and SQL.

Figure 3. The Spark stack

In additional to the Spark Core, the framework allows extensions in the form of libraries. Most common extensions are Spark MLlib for machine learning, Spark SQL for queries on structured data, Spark Streaming for real-time stream-processing and Spark GraphX for handling graph databases. Other extensions are also available [Ref]. The Spark architecture shown in Figure 3 enables a single framework to be used for multiple projects. Typical big data usage scenarios to date have deployed the Hadoop stack for batch processing separately from another framework for stream processing, and yet another one for advanced analytics such as machine learning. Apache Spark combines these frameworks in a common architecture, thereby allowing easier management of the big data code stack and also enabling reuse of a common data repository. The Spark stack shown in Figure 3 can run in a variety of environments. It can run alongside the Hadoop stack, leveraging Hadoop YARN for cluster management. It can run over Apache Mesos and also includes a simple cluster manager Standalone Scheduler. 7

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

Spark applications can run in a distributed mode on a cluster using a master/slave architecture that uses a central coordinator called “driver” and potentially large number of “worker” processes that execute individual tasks in a Spark job. The Spark executor processes also provide reliable in-memory storage of data distributed across the various nodes in a cluster. The components of a distributed Spark application are shown in Figure 4.

Spark Driver

Cluster Manager e.g. Hadoop YARN

Cluster Worker

Cluster Worker

Cluster Worker

Executor

Executor

Executor

Figure 4. Distributed Spark application component model

A key distinguishing feature of Spark is the data model, based on RDDs (Resilient Distributed Datasets). This model enables a compact and re-usable organization of data-set that can reside in the main memory and can be accessed by multiple tasks. Iterative processing algorithms can benefit from this feature by not having to store and retrieve data-sets from disks between iterations of computation. These capabilities are what deliver the significant performance gains compared to Hadoop MapReduce. RDDs support two types of operations: Transformations and Actions. Transformations are operations that return a new RDD, while Actions return a result to the driver program. Spark groups operations on together to reduce the number of passes taken over the data. This so-called lazy evaluation technique enables faster data processing. Spark also allows caching data in memory for persistence to enable multiple uses of the same data. This is another technique contributing to faster data processing.

8

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

4.2 Hardware components A Cloudera deployment consists of cluster nodes, networking, power, and racks. The predefined configurations can be implemented as is or modified based on specific customer requirements, such as lower cost, improved performance, and increased reliability. Key workload requirements, such as the data growth rate, sizes of datasets, and data ingest patterns help in determining the proper configuration for a specific deployment. A best practice when a Cloudera cluster infrastructure is designed is to conduct the proof of concept testing by using representative data and workloads to ensure that the proposed design works

The Lenovo Big Data Reference Architecture for Cloudera Distribution for Hadoop

uses Lenovo System

x3650 M5 and x3550 M5 servers and Lenovo RackSwitch G8052 and G8272 top of rack switches. The reference configuration described in this document extends in particular the data node design based on Lenovo System x3650 M5. In this section, the capabilities of this server are briefly described, followed by a summary of the other hardware components used in the total solution. For additional information on these components, see Resources on page 10.

4.2.1 Lenovo System x3650 M5 Server The Lenovo System x3650 M5 server (as shown in Figure 3) is an enterprise class 2U two-socket versatile server that incorporates outstanding reliability, availability, and serviceability (RAS), security, and high-efficiency for business-critical applications and cloud deployments. It offers a flexible, scalable design and simple upgrade path to 26 2.5-inch hard disk drives (HDDs) or solid-state drives (SSDs), or 14 3.5-inch HDDs with doubled data transfer rate through 12 Gbps serial-attached SCSI (SAS) internal storage connectivity and up to 1.5 TB of TruDDR4 Memory. On-board it provides four standard embedded Gigabit Ethernet ports and two optional embedded 10 Gigabit Ethernet ports without occupying PCIe slots.

Figure 3. Lenovo System x3650 M5 ®

®

Combined with the Intel Xeon processor E5-2600 v3 product family, the Lenovo x3650 M5 server offers an even higher density of workloads and performance that lowers the total cost of ownership (TCO) per virtual machine. Its pay-as-you-grow flexible design and great expansion capabilities solidify dependability for any kind of virtualized workload with minimal downtime. The x3650 M5 server provides internal storage density of up to 87.6 TB in a 2U form factor with its impressive array of workload-optimized storage configurations. It also offers easy management and saves floor space and power consumption for most demanding storage virtualization use cases by consolidating storage and server into one system.

9

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

The reference architecture recommends the storage-rich System x3650 M5 model for the following reasons: 

Storage capacity: The nodes are storage-rich. Each of the 14 3.5-inch drives has raw capacity up to 6 TB and each of two 2.5-inch drives has raw capacity of 1.8 TB for a total of 87.6 TB per node and over 1 petabyte per rack.



Performance: This hardware supports the latest Intel Xeon processors and TruDDR4 Memory.



Flexibility: Server hardware uses embedded storage, which results in simple scalability (by adding nodes).



More PCIe slots: Up to 8 PCIe slots are available if rear disks are not used, and up to 2 PCIe slots if both Rear 3.5-inch HDD Kit and Rear 2.5-inch HDD Kit are used. They can be used for network adapter redundancy and increased network throughput.



Better power efficiency: Innovative power and thermal management provides energy savings.



Reliability: Lenovo is first in the industry in reliability and has exceptional uptime with reduced costs.

For more information, see the following Lenovo System x3650 M5 website: Lenovopress.com/tips1193.

4.2.2 Additional Hardware Components Lenovo System x3550 M5 Server The Lenovo System x3550 M5 server is a cost- and density-balanced 1U two-socket rack server. The x3550M5 features a new, innovative, energy-smart design with up to two Intel Xeon processors of the high-performance E5-2600 v3 product family processors a large capacity of faster, energy-efficient TruDDR4 Memory, up to twelve 12Gb/s SAS drives, and up to three PCI Express (PCIe) 3.0 I/O expansion slots in an impressive selection of sizes and types.

Lenovo System Networking RackSwitch G8052 The Lenovo System Networking RackSwitch G8052 is an Ethernet switch that is designed for the data center and provides a virtualized, cooler, and simpler network solution. The Lenovo RackSwitch G8052 offers up to 48 1 GbE ports and up to 4 10 GbE ports in a 1U footprint.

Lenovo RackSwitch G8272 Designed with top performance in mind, Lenovo RackSwitch G8272 is ideal for today’s big data, cloud, and optimized workloads. The G8272 switch offers up to 72 10 Gb SFP+ ports in a 1U form factor and is expandable with four 40 Gb QSFP+ ports. It is an enterprise-class and full-featured data center switch that delivers line-rate, high-bandwidth switching, filtering, and traffic queuing without delaying data. Large data center grade buffers keep traffic moving. Redundant power and fans and numerous HA features equip the switches for business-sensitive traffic.

10

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

5 Deployment considerations This section describes the validation environment used for the compute configuration for Apache Spark on Cloudera, described in this document.

Additional detail regarding the full configuration can be found in the

Lenovo Big Data Reference Architecture for Cloudera Distribution for Hadoop.

5.1 Server / Compute Nodes The compute system was comprised of eight x3650 M5 systems, configured as two distinct clusters. Four were configured based on the initial Reference Architecture for Cloudera, containing 128GB of memory in the form of 8x 16GB DIMMs, with 2x 2650v3 10C 2.3Ghz processors. The four additional systems deviated from the initial RA, by having additional processor and memory capability; specifically, each system was configured with 512GB of memory, in the form of 16x 16GB DIMMs and 2x Intel 2966v3 18C 2.3Ghz processors.

5.2 Networking The configuration included two switches: 1x Lenovo RackSwitch G8052 and 1x Lenovo RackSwitch G8272. The G8052 was configured flat and was used for xCAT management and IMM access. Clusters with fewer than 12 nodes may also consider the Lenovo RackSwitch G7028 as an option. The G8272 was also configured flat and received two connections from each system, in a bonded configuration.

The G8272 was used for Hadoop and Spark traffic.

5.3 Test and validation Spark batch and streaming workloads were both employed during the validation.

The batch workload utilized

was SparkPI java application, included as an example workload within the Spark distribution. The streaming workload utilized was the NetworkWordCount python application, also included with the Spark distribution.

The Primary NameNode ran several instances of NetworkWordCount, listening on different ports.

Other systems, that were not participants on the Spark Cluster, established raw TCP connections to those ports and transmitted varying text data.

Data provided to the NetworkWordCount processes included, from

one system, a redirection of tcpdump output; the other was a continuous stream of all of non-binary files within the hosts file system, resulting in 5.5GB text files. In multiple executions, resource assignments were handled by YARN and by varying static resource assignments. Such an implementation could be representative of a real-world use case. sentiment processing.

In one case, one process delivers social media entries to a spark process, similar to

NetworkWordCount, which catalogues appearance rates words. analysis.

11

Consider, for example, real-time

Such data can be used for sentiment

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

5.4 Deployment examples Methods from the Lenovo Big Data Reference Architecture for Cloudera Distribution for Hadoop apply for general Apache Spark considerations as well; however, there are additional considerations. Conceptually, Apache Spark is similar in nature to high performance computing. It is important that memory capacity be carefully considered, as both the execution and storage of Spark should be able to reside fully in memory, to achieve maximum performance, however there continue to be performance benefits even when an application doesn’t fully fit within memory Disk access, for storage or caching, is very costly to Spark processing. The memory capacity considerations are highly dependent on the application.

To get an estimate, load an RDD of a desired dataset, into cache, and evaluate the consumption.

Generally, for workloads with high execution and storage requirements, capacity is primary consideration. Additional considerations for memory configuration include the bandwidth and latency requirements. Applications with high transactional memory usage should focus on DIMM configurations that result in four DIMMs per channel using dual rank DIMMs.

The following table provides ideal data node memory

configurations for bandwidth/latency sensitive workloads:

Capacity DIMM Description 16GB TruDDR4 Memory (2Rx4, 1.2V) PC4-17000 CL15 2133MHz LP 128GB RDIMM 32GB TruDDR4 Memory (2Rx4, 1.2V) PC4-17000 CL15 2133MHz LP 256GB RDIMM 32GB TruDDR4 Memory (2Rx4, 1.2V) PC4-17000 CL15 2133MHz LP 384GB RDIMM 32GB TruDDR4 Memory (2Rx4, 1.2V) PC4-17000 CL15 2133MHz LP 512GB RDIMM

Feature Quantity A5B7

8

A5UJ

8

A5UJ

12

A5UJ

16

Similarly, processor selection may vary based on the level of desired level of parallelism for the workloads. For example, Apache recommends 2-3 tasks per CPU core.

Large working sets of data can drive memory

constraints, which can be alleviated through further increasing parallelism, resulting in smaller input sets per task. In this case, higher core counts can be beneficial.

Additionally, the type of the operations needs to be

considered, as they may be simple evaluations or complex algorithms.

12

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

6 Appendix: Bill of materials (optional) This appendix includes the Bill of Materials (BOMs) for different configurations of hardware for the Big Data Solution for Apache Spark on Cloudera deployments. There is only a section for the data nodes. The BOM includes the part numbers, component descriptions, and quantities. The BOM lists in this appendix are not meant to be exhaustive and must always be verified with the configuration tools. Any discussion of pricing, support, and maintenance options is outside the scope of this document. This BOM information is for the United States; part numbers and descriptions can vary in other countries. Other sample configurations are available from your Lenovo sales team. Components are subject to change without notice.

6.1 BOM for compute servers Quantity

PN

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 16 1 2 1 1 1 1 1 1 1 1

5462AC1 A5EY A5GE A3YY A5EA ARYJ ARYT A5FX A5FV A5FF ASQE A5FN A4Z6 A2EC ASQE A3PN 5977 A3W9 A5UJ A3WF A52A A5FC A5FH A5G5 A5FZ A5FT A5G1 A5V5 A5FM

13

Description Lenovo System x3650 M5 System Documentation and Software-US English x3650 M5 12x 3.5" HS HDD Assembly Kit N2215 SAS/SATA HBA System x3650 M5 Planar Intel Xeon Processor E5-2699 v3 18C 2.3GHz 45MB Cache 2133MHz 145W Intel Xeon Processor E5-2699 v3 18C 2.3GHz 45MB 2133MHz 145W System x Enterprise 2U Cable Management Arm (CMA) System x Enterprise Slides Kit System x3650 M5 12x 3.5" Base without Power Supply System x 1500W High Efficiency Platinum AC Power Supply System x3650 M5 PCIe Riser 1 (1 x16 FH/FL + 1 x8 FH/HL Slots) Broadcom NetXtreme Dual Port 10GbE SFP+ Adapter Intel x520 Dual Port 10GbE SFP+ Adapter System x 1500W High Efficiency Platinum AC Power Supply Mellanox ConnectX-3 40GbE / FDR IB VPI Adapter Select Storage devices - no configured RAID required 4TB 7.2K 6Gbps NL SATA 3.5" G2HS HDD 32GB TruDDR4 Memory (2Rx4, 1.2V) PC4-17000 CL15 2133MHz LP RDIMM 3U Bracket for Mellanox ConnectX-3 FDR VPI IB/E Adapter 2U Bracket for Broadcom NetXtreme Dual Port 10GbE SFP+ Adapter System x3650 M5 WW Packaging System x3650 M5 Agency Label GBM System x3650 M5 Riser Bracket System x3650 M5 Riser Filler System x3650 M5 Power Paddle Card System x3650 M5 EIA Plate System x3650 M5 Right EIA for Storage Dense Model System x3650 M5 System Level Code

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

Resources more information, see the following resources: 

Lenovo Big Data Reference Architecture for Cloudera Distribution for Hadoop o





Lenovo System x3650 M5 (Cloudera Data Node): o

Product page: shop.lenovo.com/us/en/systems/servers/racks/systemx/x3650-m5/

o

Lenovo Press product guide: lenovopress.com/tips1193

Lenovo System x3550 M5 (Cloudera Management Node): o o



o o o

Cloudera Distribution for Hadoop (CDH): cloudera.com/content/cloudera/en/products-and-services/cdh.html Cloudera products and services: cloudera.com/content/cloudera/en/products-and-services.html Cloudera solutions: cloudera.com/content/cloudera/en/solutions.html Cloudera resources: cloudera.com/content/cloudera/en/resources.html

Open source software: o o o o o o o o o o o o o o

14

Product page: shop.lenovo.com/us/en/servers/thinkserver/system-management/xclarity Lenovo Press product guide: lenovopress.com/tips1200

Cloudera: o



Product page: shop.lenovo.com/us/en/systems/browsebuy/lenovo-rackswitch-g8272.html Lenovo Press product guide: lenovopress.com/tips1267

Lenovo XClarity Administrator: o o



Product page: shop.lenovo.com/us/en/systems/browsebuy/%20rackswitch-g8052.html Lenovo Press product guide: lenovopress.com/tips0813

Lenovo RackSwitch G8272 (10GbE Switch): o o



Product page: shop.lenovo.com/us/en/systems/servers/racks/systemx/x3550-m5/ Lenovo Press product guide: lenovopress.com/tips1194

Lenovo RackSwitch G8052 (1GbE Switch): o o



Lenovo Press: lenovopress.com/tips1329

Hadoop: hadoop.apache.org Spark: spark.apache.org Flume: flume.apache.org HBase: hbase.apache.org Hive: hive.apache.org Hue: gethue.com Impala: rideimpala.com Oozie: oozie.apache.org Mahout: mahout.apache.org Pig: pig.apache.org Sentry: entry.incubator.apache.org Sqoop: sqoop.apache.org Whirr: whirr.apache.org ZooKeeper: zookeeper.apache.org

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

o 

15

Parquet: parquet.apache.org

xCat: xcat.sourceforge.net

Lenovo Configuration Guide for Cloudera Enterprise with Apache Spark

Trademarks and special notices © Copyright Lenovo 2016. References in this document to Lenovo products or services do not imply that Lenovo intends to make them available in every country. Lenovo, the Lenovo logo, ThinkCentre, ThinkVision, ThinkVantage, ThinkPlus and Rescue and Recovery are trademarks of Lenovo. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used Lenovo products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-Lenovo products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by Lenovo. Sources for non-Lenovo list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. Lenovo has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-Lenovo products. Questions on the capability of non-Lenovo products should be addressed to the supplier of those products. All statements regarding Lenovo future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local Lenovo office or Lenovo authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in Lenovo product announcements. The information is presented here to communicate Lenovo’s current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard Lenovo benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-Lenovo websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this Lenovo product and use of those websites is at your own risk.

16