Monday, October 24, 2022

Creating a three node Hadoop cluster using Ubuntu OS (Apr 2020)

Dated: 28 Apr 2020
Note about the setup: We are running the Ubuntu OS(s) on top of Windows via VirtualBox.

1. Setting hostname in three Guest OS(s)

$ sudo gedit /etc/hostname The hostnames for three machines are master, slave1, and slave2.

ON MASTER (Host OS IP: 192.168.1.12)

$ cat /etc/hosts 192.168.1.12 master 192.168.1.3 slave1 192.168.1.4 slave2

2. ON SLAVE2 (Host OS IP: 192.168.1.4)

$ cat /etc/hostname slave2 $ cat /etc/hosts 192.168.1.12 master 192.168.1.3 slave1 192.168.1.4 slave2

3. FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

4. Configuring Key Based Login

Setup SSH in every node such that they can communicate with one another without any prompt for password. Check this link for: Steps of Doing SSH Setup

5. Setting up ".bashrc" on each system (master, slave1, slave2)

$ sudo gedit ~/.bashrc Add the below lines at the end of the file. export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_MAPRED_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=/usr/local/hadoop export HADOOP_HDFS_HOME=/usr/local/hadoop export YARN_HOME=/usr/local/hadoop

6. Follow all the nine steps from the article below to setup Hadoop on "master" machine

Getting started with Hadoop on Ubuntu in VirtualBox

On "master"

7. Set NameNode Location

Update your $HADOOP_HOME/etc/hadoop/core-site.xml file to set the NameNode location to master on port 9000: $HADOOP_HOME: /usr/local/hadoop Code: <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> </configuration>

8. Set path for HDFS

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml file to resemble the following configuration. <configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

9. Set YARN as Job Scheduler

Edit the mapred-site.xml file, setting YARN as the default framework for MapReduce operations $HADOOP_HOME/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> </configuration>

10. Configure YARN

Edit yarn-site.xml, which contains the configuration options for YARN. In the value field for the yarn.resourcemanager.hostname, replace 192.168.1.12 with the public IP address of "master": $HADOOP_HOME/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>192.168.1.12</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>

11. Configure Workers

The file workers is used by startup scripts to start required daemons on all nodes. Edit this file: $HADOOP_HOME/etc/hadoop/workers to include both of the nodes: slave1 slave2

12. Configure Memory Allocation (Two steps)

A) Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml and add the following lines: $ sudo gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> B) Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml and add the following lines $ sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>256</value> </property>

13. Duplicate Config Files on Each Node

Copy the Hadoop configuration files to the worker nodes: $ scp -r /usr/local/hadoop/etc/* ashish@slave1:/usr/local/hadoop/etc/ $ scp -r /usr/local/hadoop/etc/* ashish@slave2:/usr/local/hadoop/etc/ When you are copying contents of "/etc", the following file should be modified to contain the correct JAVA_HOME for each of the destination nodes. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

14. Format HDFS

HDFS needs to be formatted like any classical file system. On "master", run the following command: $ hdfs namenode -format Your Hadoop installation is now configured and ready to run.

15. ==> Start and Stop HDFS

Start the HDFS by running the following script from master:

/usr/local/hadoop/sbin/start-dfs.sh

This will start NameNode and SecondaryNameNode on master, and DataNode on slave1 and slave2, according to the configuration in the workers config file.

Check that every process is running with the jps command on each node. On master, you should see the following (the PID number will be different):

21922 Jps
21603 NameNode
21787 SecondaryNameNode

And on slave1 and slave2 you should see the following:

19728 DataNode
19819 Jps

To stop HDFS on master and worker nodes, run the following command from node-master:

stop-dfs.sh

16. ==> Monitor your HDFS Cluster

Point your browser to http://master:9870/dfshealth.html, where "master" IP is the IP address of your master, and you’ll get a user-friendly monitoring console.

Tags: Technology,Big Data,

No comments:

Post a Comment