INSTALLING HADOOP ON MUTI-NODE CLUSTER

Hello everyone, In this blog I am going to show how to install hadoop on multi-node cluster. Follow the below steps to install hadoop on ubuntu system.

PREREQUISITES FOR INSTALLATION

Install Java

Java is the main prerequisite for Hadoop. You can download java in the below link

http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

Mapping the nodes

To map the nodes edit the /etc/hosts file with following command.

$ sudo vi /etc/hosts

had-1

Add following hostname and their IP in the /etc/hosts file
192.168.2.14 Hadoopmaster
192.168.2.15 Hadoopnode1
192.168.2.16 Hadoopnode2

had-2

Creating User Account

Create a system user account on both master and slave systems to use the Hadoop installation. The following commands will be used to create the user account.

$ sudo addgroup hadoop

$ sudo adduser –ingroup hadoop hduser

Add following line in /etc/sudoers/

hduser ALL=(ALL:ALL) ALL

Setup ssh in every node such that they can communicate with one another without any prompt for password.

#First login with hduser (and from now use only hduser account for further steps)

$ sudo su hduser

# Generate ssh key for hduser account

$ ssh-keygen -t rsa -P “”

had-3

#Configure password-less SSH

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Download Hadoop

Download any stable version of hadoop fron apache downloads and untar the file. Then move the hadoop package to /home directory. The below command is used to download the hadoop.

$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz

Now make changes in below files in hadoop package to configure multi-node cluster.

a. edit yarn-site.xml by using the command

$ sudo gedit yarn-site.xml

Add the following entry to the file and save and quit the file:

        <configuration>

        <!– Site specific YARN configuration properties –>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

b. edit core-site.xml by using the command

$ sudo gedit core-site.xml

Add the following entry to the file and save and quit the file:

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.

</description>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/home/techm/tmp</value>

<description>A base for other temporary directories.</description>

</property>

</configuration>

c. edit mapred-site.xml

i. copy the contents of mapred-site.xml.template to mapred-site.xml by the command

$ mv mapred-site.xml.template mapred-site.xml

ii. By using this command, edit mapred-site.xml

$ sudo gedit mapred-site.xml

Add the following entry to the file and save and quit the file:

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>

The host and port that the MapReduce job tracker runsat. If “local”, then jobs are run in-process as a single map and reduce task.

</description>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

d. edit hdf-site.xml by using the command

$ sudo gedit hdfs-site.xml

Add the following entry to the file and save and quit the file:

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

<description>

Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

</description>

</property>

</configuration>

e. Update .bashrc file by using the command

$ sudo gedit .bashrc

Add the following entry to the file and save and quit the file:

# Set Hadoop-related environment variables

export HADOOP_HOME=/home/techm/hadoop

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

# Native Path

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”

#Java path

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd6

export PATH=PATH:$JAVA_HOME/bin

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

Formatting the HDFS filesystem via the NameNode

Before we start our new multi-node cluster, we must format Hadoop’s distributed filesystem (HDFS) via the NameNode. You need to do this the first time you set up an Hadoop cluster.The following command is used to format the namenode.

$ hadoop namenode -format

Then start the hadoop multi-node cluster using the command

$ bin/start-dfs.sh

Check whether the services are running using the command

$ jps

To stop the running hadoop cluster use the command

$ bin/stop-dfs.sh

References

Thanks for reading the blog, hope it will help. Please leave your comment below.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s