Multinode cluster setup in Hadoop 2.x

It’s been quite some time since I wanted to join the distributed processing bandwagon and finally got my lazy self to actually do something about it and started investing some time to learn and experiment with few technologies – some old, some not so old and some new – the first of which was Hadoop..

So naturally, the next step was to setup Hadoop in a cluster setup… The setup process, contrary to any misgivings that I may have had, was quite simple and straight forward and all that needed to be done was follow a series of steps –

  •  First off, choosing the cluster configuration – I decided to use a cluster with one name-node/resource manager and 3 other data-nodes/node managers. For simplicity, let’s call them as hadoopmasternode and hadoopdatanode1, hadoopdatanode2 and hadoopdatanode3
  • Once I had all 4 RHEL systems in place, second step was to download the latest stable release of Hadoop 2.x – which happened to be 2.6 during the writing of this post… The downloaded tar.gz archive was extracted to the /opt/hadoop folder
  • Hadoop also needs JDK to be present, which can easily be downloaded from the Oracle Java download site, which in my case happened to be JDK8
  • Next update /etc/hosts file on all the systems to include all the cluster nodes
  • It is better to have a separate user for using hadoop – So create a new user using the commands “useradd -U -m hadoopuser” and “usermod -g root hadoopuser”
  • Now that the hadoop user is created, it is time to make this user the owner of hadoop files – “chown -R hadoopuser:hadoopuser /opt/hadoop”
  • Login as hadoopuser (“su – hadoopuser”) and edit/update the hadoop environment variables for hadoopuser
    • The bash shell needs to be updated with the hadoop variables for which we would need to edit ~/.bashrc (“vi ~/.bashrc”) and append the file with below updates

      export JAVA_HOME=<JAVA_HOME>( ex. /usr/java/jdk1.8.0_40/)
      export HADOOP_HOME=/opt/hadoop
      export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
      export PATH=$PATH:$HADOOP_HOME/bin
      export PATH=$PATH:$HADOOP_HOME/sbin
      export YARN_HOME=$HADOOP_HOME
      export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
      export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
      
    • Next is to update the JAVA_HOME variable with the path to Java install location in hadoop settings file hadoop-env.sh under /opt/hadoop/etc/hadoop folder…
  • Once the settings are updated, the same needs to be sourced by running the command “source ~/.bashrc”
  • Now that the hadoop environment settings are updated, the next step is to update the hadoop and yarn settings/configuration for the cluster which are basically a bunch of XML files present in $HADOOP_HOME/etc/hadoop folder
    • First is to edit the core-site.xml and provide the namenode details –

      <property>
         <name>fs.defaultFS</name>
         <value>hdfs://hadoopmasternode:9000</value>
      </property>
      
    • Next, update yarn-site.xml file with Yarn specific configurations –

                <property>
                         <name>yarn.nodemanager.aux-services</name>
                           <value>mapreduce_shuffle</value>
                </property>
                <property>
                           <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                           <value> org.apache.hadoop.mapred.ShuffleHandler</value>
                </property>
                <property>
                            <name>yarn.resourcemanager.resource-tracker.address</name>
                            <value>hadoopmasternode:9010</value>
                </property>
                <property>
                          <name>yarn.resourcemanager.scheduler.address</name>
                           <value>hadoopmasternode:9020</value>
                </property>
                <property>
                          <name>yarn.resourcemanager.address</name>
                           <value>hadoopmasternode:9030</value>
                </property>
      
    • Copy mapred-site.xml.template file as mapred-site.xml and then mark Yarn as the mapreduce framework by adding following properties to mapred-site.xml file

                <property>
                          <name>mapreduce.framework.name</name>
                           <value>yarn</value>
                </property>
                <property>
                          <name>mapred.job.tracker</name>
                           <value>hadoopmasternode:9040</value>
                </property>
      
  • Please note, these steps need to replicated on all the nodes of the cluster
  • Once all the nodes are made ready create the namenode folder in hadoopmasternode –
    • Run command “mkdir -p /opt/hadoop/hdfs_data/namenode” to create the namenode directory
    • Update hadoop configuration files to indicate the namenode folder and the number of data nodes by editing the $HADOOP_HOME/etc/hadoop/hdfs-site.xml file and including the below properties –

                <property>
                          <name>dfs.replication</name>
                           <value>3</value>
                </property>
                <property>
                          <name>dfs.namenode.name.dir</name>
                          <value>file:/opt/hadoop/hdfs_data/namenode</value>
                </property>
      
  • Next on the hadoopmasternode, mark the master and slave node details one-per-line in $HADOOP_HOME/etc/hadoop/masters and $HADOOP_HOME/etc/hadoop/slaves files (Make sure you create the file if it does not exist…) respectively.
  • Hadoop needs needs to be able to communicate from the masternode to the data nodes via SSH without being asked for password authentication. In order to achieve this, the data nodes needs to have the namenode added to its authorized keys… This can be done by the following steps –
    • Use command “ssh-keygen -t rsa -P “” -f ~/.ssh/id_rsa” to generate a key
    • Add this key to the list of authorized keys by running “cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys”
    • Run the command “ssh-copy-id -i ~/.ssh/id_rsa.pub hadoopuser@hadoopdatanode1” , “ssh-copy-id -i ~/.ssh/id_rsa.pub hadoopuser@hadoopdatanode2” and “ssh-copy-id -i ~/.ssh/id_rsa.pub hadoopuser@hadoopdatanode3” to ensure that communication between hadoopmasternode and all three data nodes is authorized
  • Next, on each of the data node, create the datanode folder – “mkdir –p $HADOOP_HOME/hdfs_data/datanode” and update the hadoop configuration to point to the created folder by editing the $HADOOP_HOME/etc/hadoop/hdfs-site.xml and adding the following properties –
    • <property>
                <name>dfs.replication</name>
                <value>3</value>
      </property>
      <property>
                <name>dfs.datanode.name.dir</name>
                <value>file:/opt/hadoop/hdfs_data/datanode</value>
      </property>
      
  • Once all the datanodes are ready, switch back to the hadoopmasternode and format the hdfs file system by running the command “$HADOOP_HOME/bin/hdfs namenode -format –clusterId HADOOP_CLUSTER” which will create a cluster called HADOOP_CLUSTER
  • Once the cluster is formatted, start the hdfs filesystem and the yarn resource manager by running the commands “$HADOOP_HOME/sbin/start-dfs.sh” and “$HADOOP_HOME/sbin/start-yarn.sh” respectively
  • After the services are started, cluster health can be checked @ http://hadoopmasternode:50070/dfshealth.html#tab-overview

At any point, in-case there is a need to shut down the resourcemanager and filesystem, run the scripts $HADOOP_HOME/sbin/stop-yarn.sh and $HADOOP_HOME/sbin/stop-dfs.sh respectively.

Now the hadoop cluster is ready for use…

Leave a comment