survival8: Creating a three node Hadoop cluster using Ubuntu OS (Apr 2020)

Monday, October 24, 2022

Creating a three node Hadoop cluster using Ubuntu OS (Apr 2020)

Dated: 28 Apr 2020
Note about the setup: We are running the Ubuntu OS(s) on top of Windows via VirtualBox.

1. Setting hostname in three Guest OS(s)

$ sudo gedit /etc/hostname

The hostnames for three machines are master, slave1, and slave2.

ON MASTER (Host OS IP: 192.168.1.12)
$ cat /etc/hosts

192.168.1.12 master
192.168.1.3  slave1
192.168.1.4  slave2

2. ON SLAVE2 (Host OS IP: 192.168.1.4)

$ cat /etc/hostname
slave2
$ cat /etc/hosts

192.168.1.12 master
192.168.1.3  slave1
192.168.1.4  slave2

3. FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

4. Configuring Key Based Login
Setup SSH in every node such that they can communicate with one another without any prompt for password.

Check this link for: Steps of Doing SSH Setup

5. Setting up ".bashrc" on each system (master, slave1, slave2)

$ sudo gedit ~/.bashrc

Add the below lines at the end of the file.

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export YARN_HOME=/usr/local/hadoop 

6. Follow all the nine steps from the article below to setup Hadoop on "master" machine
Getting started with Hadoop on Ubuntu in VirtualBox

On "master"

7. Set NameNode Location
Update your $HADOOP_HOME/etc/hadoop/core-site.xml file to set the NameNode location to master on port 9000:
$HADOOP_HOME: /usr/local/hadoop

Code:
<configuration>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://master:9000</value>
	</property>
</configuration> 

8. Set path for HDFS

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml file to resemble the following configuration.
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/data/nameNode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/data/dataNode</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

9. Set YARN as Job Scheduler
Edit the mapred-site.xml file, setting YARN as the default framework for MapReduce operations

$HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
</configuration>

10. Configure YARN

Edit yarn-site.xml, which contains the configuration options for YARN. In the value field for the
yarn.resourcemanager.hostname, replace 192.168.1.12 with the public IP address of "master":

$HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.acl.enable</name>
        <value>0</value>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>192.168.1.12</value> 
    </property>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

11. Configure Workers
The file workers is used by startup scripts to start required daemons on all nodes.
Edit this file: $HADOOP_HOME/etc/hadoop/workers to include both of the nodes:

slave1
slave2

12. Configure Memory Allocation (Two steps)

A) Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml and add the following lines:

$ sudo gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml


<property>
 <name>yarn.nodemanager.resource.memory-mb</name>
 <value>1536</value>
</property>

<property>
 <name>yarn.scheduler.maximum-allocation-mb</name>
 <value>1536</value>
</property>

<property>
 <name>yarn.scheduler.minimum-allocation-mb</name>
 <value>128</value>
</property>

<property>
 <name>yarn.nodemanager.vmem-check-enabled</name>
 <value>false</value>
</property>

B) Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml and add the following lines

$ sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>
 <name>yarn.app.mapreduce.am.resource.mb</name>
 <value>512</value>
</property>

<property>
 <name>mapreduce.map.memory.mb</name>
 <value>256</value>
</property>

<property>
 <name>mapreduce.reduce.memory.mb</name>
 <value>256</value>
</property>

13. Duplicate Config Files on Each Node
Copy the Hadoop configuration files to the worker nodes:

$ scp -r /usr/local/hadoop/etc/* ashish@slave1:/usr/local/hadoop/etc/
$ scp -r /usr/local/hadoop/etc/* ashish@slave2:/usr/local/hadoop/etc/

When you are copying contents of "/etc", the following file should be modified to contain the correct JAVA_HOME for each of the destination nodes.

/usr/local/hadoop/etc/hadoop/hadoop-env.sh

14. Format HDFS
HDFS needs to be formatted like any classical file system. On "master", run the following command:
$ hdfs namenode -format

Your Hadoop installation is now configured and ready to run.

15. ==> Start and Stop HDFS

Start the HDFS by running the following script from master:

/usr/local/hadoop/sbin/start-dfs.sh

This will start NameNode and SecondaryNameNode on master, and DataNode on slave1 and slave2, according to the configuration in the workers config file.

Check that every process is running with the jps command on each node. On master, you should see the following (the PID number will be different):

21922 Jps
21603 NameNode
21787 SecondaryNameNode

And on slave1 and slave2 you should see the following:

19728 DataNode
19819 Jps

To stop HDFS on master and worker nodes, run the following command from node-master:

stop-dfs.sh

16. ==> Monitor your HDFS Cluster

Point your browser to http://master:9870/dfshealth.html, where "master" IP is the IP address of your master, and you’ll get a user-friendly monitoring console.

survival8

Pages

Monday, October 24, 2022

Creating a three node Hadoop cluster using Ubuntu OS (Apr 2020)

1. Setting hostname in three Guest OS(s)

ON MASTER (Host OS IP: 192.168.1.12)

2. ON SLAVE2 (Host OS IP: 192.168.1.4)

3. FOLLOW THE STEPS MENTIONED FOR SLAVE2 ALSO FOR SLAVE1 (Host OS IP: 192.168.1.3)

4. Configuring Key Based Login

5. Setting up ".bashrc" on each system (master, slave1, slave2)

6. Follow all the nine steps from the article below to setup Hadoop on "master" machine

On "master"

7. Set NameNode Location

8. Set path for HDFS

9. Set YARN as Job Scheduler

10. Configure YARN

11. Configure Workers

12. Configure Memory Allocation (Two steps)

13. Duplicate Config Files on Each Node

14. Format HDFS

No comments:

Post a Comment