Chapter 6 Apache HBase Fully-Distributed Mode - Pinlabs Blog

0 downloads 132 Views 472KB Size Report
Nov 25, 2014 - 1.2.2 Apache Hadoop NextGen MapReduce (YARN) . ..... with JobTracker, but we recommend running JobHistory
Hadoop Installation Guide

Hadoop Installation Guide (for Ubuntu- Trusty) v1.0, 25 Nov 2014

Naveen Subramani

Hadoop Installation Guide (for Ubuntu - Trusty) v1.0, 25 Nov 2014

’Hadoop’ and the Hadoop Logo are registered trademarks of Hadoop. Read Hadoop’s trademark policy here. ’Ubuntu’, the Ubuntu Logo and ’Canonical’ are registered trademarks of Canonical. Read Canonical’s trademark policy here. All other trademarks mentioned in the book belong to their respective owners. This book is aimed at making it easy/simple for a beginner to build a Hadoop Cluster. This book will be updated periodically based on the suggestions, ideas, corrections, etc., from readers. Mail Feedback to: [email protected] Released under Creative Commons - Attribution-ShareAlike 4.0 International license. A brief description of the license A more detailed license text

Preface About this guide We have been working on Hadoop for quite sometime. To share our knowledge of hadoop I write this guide to help people install Hadoop easily. This guide is based on Hadoop installation on Ubuntu 14.04 LTS.

Target Audience Our aim has been to provide a guide for a beginners who are new to Hadoop Implementation. Some familiarity with Big HIVE_HOME="$HOME/hive-0.14.0" HBASE_HOME="$HOME/hbase-0.98.8-hadoop2" PIG_HOME="$HOME/pig-0.14.0" PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin

Esc -> :wq for save and quit from vim editor $ source .bashrc

9

Hadoop Installation Guide

11

Chapter 2

Apache YARN Pseudo-Distributed Mode 2.1

Supported Modes

Apache Hadoop cluster can be installed in one of the three supported modes • Local (Standalone) Mode - Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. • Pseudo-Distributed Mode - pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. • Fully-Distributed Mode - Master, Slave cluster setup where daemons runs on seperate machines.

2.2

Pseudo-Distributed Mode

2.2.1

Requirements for Pseudo-Distributed Mode

• Ubuntu Server 14.04 • Jdk 1.7 • Apache Hadoop-2.5.1 Package • ssh-server

2.2.2

Installation Notes

2.2.2.1

Get some tools

Before begin with installation let us update the Ubuntu packages with latest contents and get some tools for edition $ sudo apt-get update $ sudo apt-get install vim $ cd

2.2.2.2

Install Jdk 1.7

For Running Apache hadoop java jdk1.7 is required. Install open-jdk1.7 or oracle jdk1.7 on your ubuntu machine, for installing open-jdk-1.7 read this and for oracle jdk1.7 read this.

2.2.2.2.1

Installing open-Jdk 1.7

For installing open-jdk 1.7 follow these steps $ sudo apt-get install openjdk-7-jdk $ vim .bashrc

Add this line at end of .bashrc file export JAVA_HOME=/usr export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin

Esc -> :wq for save and quit from vim editor $ source .bashrc $ java -version

2.2.2.2.2

Installing oracle-Jdk 1.7

For installing oracle-jdk 1.7 follow these steps $ $ $ $

wget https://dl.dropboxusercontent.com/u/24798834/Hadoop/jdk-7u51-linux-x64.tar.gz tar xzf jdk-7u51-linux-x64.tar.gz sudo mv jdk1.7.0_51 /opt/ vim .bashrc

Add this line at end of .bashrc file export JAVA_HOME=/opt/jdk1.7.0_51 export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin

Esc -> :wq for save and quit from vim editor $ source .bashrc $ java -version

Console output : java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

2.2.2.3

Setup passphraseless ssh

Setup password less ssh access for hadoop daemon $ $ $ $ $ $

sudo apt-get install ssh ssh-keygen -t rsa -P "" ssh-copy-id -i ~/.ssh/id_rsa.pub localhost ssh localhost exit cd

Hadoop Installation Guide

2.2.2.4

13

Setting Hadoop Package

Download and install hadoop 2.5.1 package on home dir of ubuntu. $ wget http://apache.cs.utah.edu/hadoop/common/stable/hadoop-2.5.1.tar.gz $ tar xzf hadoop-2.5.1.tar.gz $ cd hadoop-2.5.1/etc/hadoop

Ensure that JAVA_HOME is set in hadoop-env.sh and points to the Java installation you intend to use. You can set other environment variables in hadoop-env.sh to suit your requirments. Some of the default settings refer to the variable HADOOP_HOME. The value of HADOOP_HOME is automatically inferred from the location of the startup scripts. HADOOP_HOME is the parent directory of the bin directory that holds the Hadoop scripts. In this instance it is $HADOOP_INSTALL/hadoop Configure JAVA_HOME in hadoop-env.sh file by uncommenting the line export JAVA_HOME= and replace with below contents for open-jdk export JAVA_HOME=/usr

Configure JAVA_HOME in hadoop-env.sh file by uncommenting the line export JAVA_HOME= and replace with below contents for oracle-jdk export JAVA_HOME=/opt/jdk1.7.0_51

2.2.2.5

Configuring core-site.xml

etc/hadoop/core-site.xml: Edit core-site.xml available in ($HADOOP_HOME/etc/hadoop/core-site.xml) with the contents below. fs.default.name hdfs://localhost:9000

2.2.2.6

Configuring hdfs-site.xml

etc/hadoop/hdfs-site.xml: Edit hdfs-site.xml available in ($HADOOP_HOME/etc/hadoop/hdfs-site.xml) with the contents below. dfs.replication 1 dfs.namenode.name.dir file:/home/ubuntu/yarn/yarn_ encoding="UTF-8"?> mapreduce.framework.name yarn

Where mapreduce.framework.name specifies the Mapreduce Version to be used MR1 or MR2(YARN).

2.2.2.8

Configuring yarn-site.xml

etc/hadoop/yarn-site.xml: Edit yarn-site.xml available in ($HADOOP_HOME/etc/hadoop/yarn-site.xml) with the contents below. yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler

2.2.3

Execution

Now all the configuration has be done next step is to formate the name node and to start the hadoop cluster. Frmate the name node using the below command available at $HADOOP_HOME directory. $ bin/hadoop namenode -format

Hadoop Installation Guide

2.2.3.1

15

Start the hadoop cluster

Start the hadoop cluster with the below command available ad $HADOOP_HOME directory. $ sbin/start-all.sh

2.2.3.2

Verify the hadoop cluster

After starting the hadoop cluster you can verify for the 5 hadoop daemons using the java profiling tool (jps) jps command which will display the daemons with its pid.It should list all 5 daemons 1.Namnode, 2. encoding="UTF-8"?> fs.default.name hdfs://:9000

3.1.2.3

Configuring hdfs-site.xml

etc/hadoop/hdfs-site.xml: Edit hdfs-site.xml available in ($HADOOP_HOME/etc/hadoop/hdfs-site.xml) with the contents below. dfs.replication 3 dfs.namenode.name.dir /home/ubuntu/yarn/yarn_ encoding="UTF-8"?> mapreduce.framework.name yarn

Where mapreduce.framework.name specifies the Mapreduce Version to be used MR1 or MR2(YARN).

3.1.2.5

Configuring yarn-site.xml

etc/hadoop/yarn-site.xml: Edit yarn-site.xml available in ($HADOOP_HOME/etc/hadoop/yarn-site.xml) with the contents below. yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.resource-tracker.address :8025 yarn.resourcemanager.scheduler.address :8030 yarn.resourcemanager.address :8040

3.1.3

Add slave Node details ( for all machines)

After configuring the hadoop config files . Now we have to add list of slave machines on slaves file located in $HADOOP_HOME/etc/hado directory. Edit the file and remove localhost entry and append the lines with the fillowing contents. Kindly replace appropiate ip address with your encoding="UTF-8"?> hbase.rootdir file:///home/ubuntu/yarn/hbase_ encoding="UTF-8"?> hbase.rootdir hdfs://localhost:9000/hbase hbase.zookeeper.property. encoding="UTF-8"?> hbase.rootdir hdfs://localhost:9000/hbase hbase.zookeeper.property.dataDir /home/ubuntu/yarn/hbase_data/zookeeper hbase.cluster.distributed true hbase.zookeeper.quorum node1.sample.com,node2.sample.com,node3.sample.com

Configure HBase to use node2 as a backup master

Edit or create file conf/backup-masters and add a new line to it with the hostname for node2. In this demonstration, the hostname is node2.example.com. Note:

Everywhere in your configuration that you have referred to node1 as localhost, change the reference to point to the hostname that the other nodes will use to refer to node1. In these examples, the hostname is node1.sample.com. Prepare node2 and node3

Everywhere in your configuration that you have referred to node1 as localhost, change the reference to point to the hostname that the other nodes will use to refer to node1. In these examples, the hostname is node1.sample.com. Note:

node2 will run a backup master server and a ZooKeeper instance.

Download and unpack HBase.

Download and unpack HBase to node-b, just as you did for the standalone and pseudo-distributed. Copy the configuration files from node1 to node2.and node3.

Each node of your cluster needs to have the same configuration information. Copy the contents of the conf/ directory to the conf/ directory on node2 and node3. Start HBase Cluster

Important:

Be sure HBase is not running on any node. Start Hbase by runninng shell script available in bin/start-hbase.sh.after you started jps command should give the list of java process responsible for hbase (HMaster,HRegionServer,HQuorumPeer) on various nodes.ZooKeeper starts first, followed by the master, then the RegionServers, and finally the backup masters. $ bin/start-hbase.sh

Node1: jps output

5605 HMaster 5826 Jps 5545 HQuorumPeer

Node2: jps output

5605 5826 5545 5930

HMaster Jps HQuorumPeer HRegionServer

Node3: jps output

5826 Jps 5545 HQuorumPeer 5930 HRegionServer

Browse to the Web UI

In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16610 for the Master and 16030 for the RegionServer.Once your installation has been done properly you can able to access UI for the Master http://node1.sample.com:60110/ or the secondary master at http://node2.sample.com:60110/ for the secondary master, using a web browser. For debuging kindly refer logs directory. $ bin/hbase shell

Hadoop Installation Guide

35

Get started with HBase Shell

After installing HBase now its time to get started with HBase shell.Lets fire up Hbase shell using command bin/hbase shell command. $ bin/hbase shell

Create a table

Use create command to create table.you must specify table name and column family as argument for Create command.In this below command we are creating a table called employeedb with column family finance. hbase> create 'employeedb', 'finance' 0 row(s) in 1.2200 seconds

List tables

Use list command to list all tables in hbase. hbase> list 'employeedb' TABLE employeedb 1 row(s) in 0.0350 seconds => ["employeedb"]

Insert data in to table

Use put command to insert data into table in hbase. hbase> put 'employeedb', 'row1', 'finance:name', 'Naveen' 0 row(s) in 0.1770 seconds hbase> put 'employeedb', 'row2', 'finance:salary', 20000 0 row(s) in 0.0160 seconds hbase> put 'employeedb', 'row3', 'finance:empid', 10124 0 row(s) in 0.0260 seconds

Scan the table for all data at once.

Use Scan command to list all contents of a table in hbase. hbase> scan 'employeedb' ROW COLUMN+CELL row1 column=finance:name, timestamp=1403759475114, value=Naveen row2 column=finance:salary, timestamp=1403759492807, value=20000 row3 column=finance:empid, timestamp=1403759503155, value=10124 3 row(s) in 0.0440 seconds

Get Particular row of data

Use get command to single row from a table in hbase. hbase> get 'employeedb', 'row1' COLUMN CELL finance:name timestamp=1403759475114, value=Naveen 1 row(s) in 0.0230 seconds

Delete a table

To delete a table in hbase you have to first disable the table then only you can able to delete the table. use disable command to disable the table and enable command to enable the table.drop command to drop the table hbase> disable 'employeedb' 0 row(s) in 1.6270 seconds hbase> drop 'employeedb' 0 row(s) in 0.2900 seconds

Exit from HBase Shell

use quit command to exit from hbase shell. hbase> quit

Stopping HBase

For stopping HBase use bin/stop-hbase.sh shell script in bin folder. $ ./bin/stop-hbase.sh stopping hbase....................

Hadoop Installation Guide

37

Chapter 7

Apache Hive Installation 7.1

Hive Installation

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage.

7.1.1

Requirements

Hive requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See Hadoop installation section for Oracle JDK or Open JDK Installation.

7.1.2

Installation Guide

Download Hive package from the hive.apache.org site and extract the packages using the following commands.In this installation we use default derby database as metastore.

Extract Hive Package

$ tar xzf apache-hive-0.14.0-bin.tar.gz $ cd hive-0.14.0

Add entry for hadoop_home,Hive_home in .profile or .bashrc

For addiing .bashrc entry follow these steps $ cd $ vim .bashrc

Add this line at end of .bashrc file export export export export

JAVA_HOME=/opt/jdk1.7.0_51 HADOOP_HOME=/home/ubuntu/hadoop-2.5.1 HIVE_HOME=/home/ubuntu/hive-0.14.0 PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin

Esc -> :wq for save and quit from vim editor

Note:

Assumptions made that hadoop is install on home directory for ubuntu user (ie: /home/ubuntu/hadoop-2.5.1) and hive is instlled on home directory of ubuntu user (ie: /home/ubuntu/hive-0.14.0) $ source .bashrc $ java -version Console output : java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Start Hadoop Cluster

$ $HADOOP_HOME/bin/start-all.sh

Jps to verify Hadoop Daemons

$ jps

Get started with Hive Shell

After installing Hive now its time to get started with Hive shell.Lets fire up Hive shell using command bin/hive command. $ $HIVE_HOME/bin/hive

Some sample commands

Try out some basic commands listed below, for detailed documentation kindly refer apache documentation at hive.apache.org . :> CREATE DATABASE my_hive_db; :> DESCRIBE DATABASE my_hive_db; :> USE my_hive_db; :> DROP DATABASE my_hive_db; :> exit

Exit from Hive Shell

use quit command to exit from hive shell. :> quit

Hadoop Installation Guide

39

Chapter 8

Apache Pig Installation 8.1

Pig Installation

Apache Pig is a platform for analyzing large data sets. Apache pig can be run in distributed fashion on cluster. Pig consists of a high-level language called pig latin for expressing data analysis programs. Pig is very similar to Hive, provides datawarehouse. Using PigLatin we can able to analyze,Filter and Extract data sets.Pig internally converts PigLatin commands in to MapReduce jobs and will execute on HDFS to retrive data sets.

8.1.1

Requirements

Pig requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.

8.1.2

Installation Guide

Download Pig package from the pig.apache.org site and extract the packages using the following commands. Extract Pig Package

$ tar xzf pig-0.14.0.tar.gz $ cd pig-0.14.0

Add entry for pig_home in .profile or .bashrc

For adding .bashrc entry kindly follow .bashrc entry section. Start Hadoop Cluster

$ $HADOOP_HOME/bin/start-all.sh

Jps to verify Hadoop Daemons

$ jps

Execution Modes

Local Mode :To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Mapreduce Mode :To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don’t need to, specify it using the -x flag (pig OR pig -x mapreduce).

Get started with Pig Shell

After installing Pig now its time to get started with Pig shell.Lets fire up Pig shell using command bin/pig command. $ $PIG_HOME/bin/pig

Some Example pig commands

Try out some basic commands listed below, for detailed documentation kindly refer apache documentation at pig.apache.org . Invoke the Grunt shell by typing the "pig" command (in local or hadoop mode). Then, enter the Pig Latin statements interactively at the grunt prompt (be sure to include the semicolon after each statement). The DUMP operator will display the results to your terminal screen.STORE operator will store the pig results in the HDFS.

Note:

For using below commands kindly download employee dataset from this this and upload to HDFS in directory /dataset/employee.csv . and download manager.csv from this url and upload to HDFS in directory /dataset/manager.csv

Upload datasets to HDFS:

$ $HADOOP_HOME/bin/hadoop dfs -copyFromLocal ~/Downloads/employee.csv /dataset/ ←employee.csv $ $HADOOP_HOME/bin/hadoop dfs -copyFromLocal ~/Downloads/manager.csv /dataset/manager ←.csv

Select employee id and name from employee dataset

grunt> A =load '/dataset/employee.csv' using PigStorage(','); grunt> E = foreach A generate $0,$1; grunt> dump E;

select * from employee where country=’China’

grunt> A = load '/dataset/employee.csv' using PigStorage(',') as (eid:int, emp_name: ←chararray, country:chararray,salary:int); grunt> F = filter A by country == 'China'; grunt> dump F;

Hadoop Installation Guide

41

select country, sum(amt) as tot_amt from employee group by country order by tot_amt

grunt> A = load '/dataset/employee.csv' using PigStorage(',') as (eid:int, emp_name: ←chararray, country:chararray,salary:int); grunt> D = group A by country; grunt> X = FOREACH D GENERATE group, SUM(A.salary) as tot_sal; grunt> decorder = order X by tot_sal desc; grunt> dump decorder

Not(SQL) functionality in pig

grunt> A = load '/dataset/employee.csv' using PigStorage(',') as (eid:int, emp_name: ←chararray, country:chararray,salary:int); grunt> F = filter A by not emp_name matches'Anna.*'; grunt> dump F;

LIMIT(SQL) functionality in pig

grunt> A = load '/dataset/employee.csv' using PigStorage(',') as (eid:int, emp_name: ←chararray, country:chararray,salary:int); grunt> X = limit A 5; grunt> dump X;

select count(*) from employee

grunt> grunt> grunt> grunt>

A =load '/dataset/employee.csv' using PigStorage(','); F = GROUP A ALL; tot_rec = foreach F generate COUNT_STAR(A); dump tot_rec;

Join two tables by common key

grunt> grunt> grunt> grunt>

A = load '/dataset/employee.csv' using PigStorage(','); B = load '/dataset/manager.csv' using PigStorage(','); C = join A by $4, B by $0; dump C;

Group(SQL) functionality in pig

grunt> grunt> grunt> grunt> grunt>

A = load '/dataset/employee.csv' using PigStorage(','); B = foreach A generate $2,$3; C = group B by $0; D = foreach C generate group as uniquekey; dump D;

Distinct(SQL) functionality in pig

grunt> grunt> grunt> grunt>

A = load '/dataset/employee.csv' using PigStorage(','); B = foreach A generate $2; C = distinct B; dump C;

Join two tables by common key ’sunil’

grunt> grunt> grunt> grunt> grunt>

A = load '/dataset/employee.csv' using PigStorage(','); B = load '/dataset/manager.csv' using PigStorage(','); C = join A by $4, B by $0 using 'sunil'; dump C; quit

To quit from pig shell use quit command. Command List:

Category

Operator

Load and save

LOAD STORE DUMP

Filtering

FILTER DISTINCT FOREACH...GENERATE STREAM SAMPLE

Grouping andjoining

JOIN COGROUP GROUP CROSS

Sort,combine,split

ORDER LIMIT UNION SPLIT

Table 8.1: Pig Operators

Description Loads data fromthe filesystemor other storage into a pig storage Saves a pig results to any other storage ex: HDFS Prints a pig results to the console Removes unwanted rows from a pig relation Removes duplicate rows froma pig relation Adds or removes fields froma pig relation Transforms a pig relation using an external program Selects a random sampleof a pigrelation Joins two or more relations Groups the data in two or more pig relations Groups the data in a single pig relation Creates the cross-product of two or more pig relations Sorts a pig relation by one or more fields Limits the sizeof a pig relation to a maximum number of tuples Combines two or more pig relations into one Splits a pig relation into two or more pig relations

Hadoop Installation Guide

43

Chapter 9

Apache Spark Installation 9.1

Apache Spark Installation

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

9.1.1

Requirements

Apache Spark requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.

9.1.2

Installation Guide

Download Spark package for hadoop1 from the spark.apache.org site and extract the packages using the following commands.Choose the appropiate hadoop versions. Note: this guide follows setting up apache spark with hadoop 1 version. Extract Spark Package

$ tar xzf spark-1.1.0-bin-hadoop1.tgz $ cd spark-1.1.0-bin-hadoop1

Start Hadoop Cluster

$ $HADOOP_HOME/bin/start-all.sh

Jps to verify Hadoop Daemons

$ jps

Execution Modes

Local Mode : Cluster Mode :

Running word count in Local Mode:

Note: before running this commands you should run hadoop $ bin/spark-shell scala> var file = sc.textFile("hdfs://localhost:9000/input/file") scala> var count = file.flatMap(line => line.split(" ")).map(word => (word, 1)). ←reduceByKey(_+_) scala> count.saveAsTextFile("hdfs://localhost:9000/sparkdata/wcout")

For Web UI

For webHDFS: http://localhost:50070 For SparkUI: http://localhost:8080

Running word count in Cluster Mode:

To launch a Spark standalone cluster with the launch scripts, you need to create a file called conf/slaves in your Spark directory, which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master machine must be able to access each of the slave machines via password-less ssh (using a private key) Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/bin: • sbin/start-master.sh - Starts a master instance on the machine the script is executed on. • sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file. • sbin/start-all.sh - Starts both a master and a number of slaves as described above. • sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script. • sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file. • sbin/stop-all.sh - Stops both the master and the slaves as described above. Note

these scripts must be executed on the machine you want to run the Spark master on, not your local machine.