Hadoop — HDFS and YARN Kerberos based Configuration

7 min readSep 16, 2020

This is a third part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the overall deployment architecture and setup the initial system with Kerberos. In this story, we will go through the steps to setup Hadoop — HDFS and YARN setup.

Chapter 3. Unpacking Hadoop Distributions

1. Download Hadoop on all nodes

We are using Hadoop version 2.10.0 for setting up the environment. Below are the steps to download and unpack Hadoop on Sicily. Please repeat the same steps on all nodes. Perform these steps using the “hadoop” user.

[hadoop@sicily1 ~]$ pwd
/home/hadoop
[hadoop@sicily1 ~]$ wget http://apachemirror.wuchna.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz
…
[hadoop@sicily1 ~]$ tar xvzf hadoop-2.10.0.tar.gz
…
[hadoop@sicily1 ~]$ mv hadoop-2.10.0 hadoop

2. Download Spark on all nodes

We are using Spark version 2.4.6 without Hadoop distribution for setting up the environment. Below are the steps to download and unpack spark on Sicily. Please repeat the same steps on all nodes. Perform these steps using the “hadoop” user.

[root@sicily1 ~]# su - hadoop
[hadoop@sicily1 ~]$ pwd
/home/hadoop
[hadoop@sicily1 ~]$ wget http://apachemirror.wuchna.com/spark/spark-2.4.6/spark-2.4.6-bin-without-hadoop.tgz
…
[hadoop@sicily1 ~]$ tar xvzf spark-2.4.6-bin-without-hadoop.tgz
…
[hadoop@sicily1 ~]$ mv spark-2.4.6-bin-without-hadoop spark

3. Download Livy in the edge node — florence1

Download Livy only on the edge node, which is Florence node. Perform these steps using the “hadoop” user.

[hadoop@florence1 ~]$ pwd
/home/hadoop
[hadoop@florence1 ~]$ wget https://mirrors.estointernet.in/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip
[hadoop@florence1 ~]$ unzip apache-livy-0.7.0-incubating-bin.zip
[hadoop@florence1 ~]$ mv apache-livy-0.7.0-incubating-bin livy

4. Moving Keytab files

Copy the keytab files to hadoop configuration folder, as below. Below are the steps to download and unpack spark on Turin node. Please repeat the same steps on all nodes. Perform these steps using the “hadoop” user.

[hadoop@turin1 ~]$ pwd
/home/hadoop
[hadoop@turin1 ~]$ cd keytabs/
[hadoop@turin1 keytabs]$ cp hdfs.keytab mapred.keytab yarn.keytab /home/hadoop/hadoop/etc/hadoop/
[hadoop@turin1 keytabs]$

5. Build JSVC

So, HDFS comprises of two components — Name Nodes and Data Nodes. Name Nodes stores the namespaces, how are the files stored across data nodes in terms of blocks. And data nodes are the ones that are uses to actually store the files (data). The key thing here is, say if we want to store a file size of 2280MB, then it is partitioned into two blocks 128 MB (default block size) and 100 MB. Name Node decides where to store these blocks and also replicates them based on the replication factor (a value of 3 for the replication factor means each block is replicated across 3 nodes) and the rack aware property (nodes are grouped into racks, so if replication happens across nodes of the same rack and if the rack itself fails then the whole data block is lost. So setting a rack aware property would make sure that at least one of the block replication is stored in a different rack.) Okay, let’s say a data node is crashed and new one comes up. There is a possibility that some one can fake a node as being a data node and add that to the HDFS cluster. All this because there is no default security in Hadoop eco system. Now comes JSVC. JSVC is an apache daemon process which will enforce the data node process to use a privileged port (lower port numbers).

The below set of command will download the apache common daemons and build it to make a JSVC tool. These series of steps are performed on Tuscany node, and please repeat the same on all other HDFS nodes.

[root@tuscany1 ~]# pwd
/root
[root@tuscany1 ~]# cd /home/hadoop/
[root@tuscany1 hadoop]# wget https://mirrors.estointernet.in/apache//commons/daemon/source/commons-daemon-1.2.2-src.tar.gz
[root@tuscany1 hadoop]# tar xvzf commons-daemon-1.2.2-src.tar.gz
…
[root@tuscany1 hadoop]# cd commons-daemon-1.2.2-src/src/native/unix/
[root@tuscany1 unix]# ./configure
…
*** All done ***

Now you can issue “make”

[root@tuscany1 unix]# make
…
make[1]: Leaving directory ‘/home/hadoop/commons-daemon-1.2.2-src/src/native/unix/native’
[root@tuscany1 unix]# cp jsvc /home/hadoop/
[root@tuscany1 unix]#

Chapter 4. Configuring HDFS and YARN

1. core-site.xml

Edit core-site.xml and update the file with HDFS hostname in all 5 nodes. Make sure to specify “hadoop.security.authentication” value as “kerberos” (indicating kerberos authentication would be used) and “hadoop.security.authorization” as “true”. And Copy the same core-site.xml file in all others nodes of the cluster.

fs.defaultFS = hdfs://sicily1.wsdm.ami.com:9000
hadoop.security.authentication = kerberos
hadoop.security.authorization = true
hadoop.security.auth_to_local =
<value>
RULE:[2:$1/$2@$0]([*]/.*@HADOOPCLUSTER\.LOCAL)s/.*/hdfs/
DEFAULT
</value>

2. Name Node and Data Node directories

Create the namenode and datanode directories under hadoop user home /home/hadoop directory on all 3 nodes

Name node: sicily1.wsdm.ami.com
Data Node: turin1.wsdm.ami.com
Data Node: tuscany1.wsdm.ami.com

Like this (shown only in sicily1, and repeat the same in turin1 and tuscany1 as well)

[hadoop@sicily1 ~]$ pwd
/home/hadoop
[hadoop@sicily1 ~]$ mkdir -p ~/hdfs/{namenode,datanode}

3. Name Node hdfs-site.xml configuration

Update the /home/hadoop/hadoop/etc/hadoop/hdfs-site.xml file to specify the name node, data node and other kerberos authentication related information (principals and keytab file) After editing the file content in sicily1 (which is the Name node) as follows

(I am pasting them as key value pair, but it the hdfs-site.xml should be in the form of)

<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value><description>NameNode directory</description>
</property>dfs.namenode.name.dir = file:///home/hadoop/hdfs/namenode
dfs.datanode.data.dir = file:///home/hadoop/hdfs/datanode
dfs.replication= 2
dfs.permissions.enabled = false
dfs.datanode.use.datanode.hostname = true
dfs.namenode.datanode.registration.ip-hostname-check = false
dfs.block.access.token.enable = true
dfs.namenode.keytab.file = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
dfs.namenode.kerberos.principal = hdfs/_HOST@HADOOPCLUSTER.LOCAL
dfs.namenode.kerberos.internal.spnego.principal = HTTP/_HOST@HADOOPCLUSTER.LOCAL
dfs.secondary.namenode.keytab.file = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
dfs.secondary.namenode.kerberos.principal = hdfs/_HOST@HADOOPCLUSTER.LOCAL
dfs.secondary.namenode.kerberos.internal.spnego.principal = HTTP/_HOST@HADOOPCLUSTER.LOCAL
dfs.datanode.data.dir.perm = 700
dfs.datanode.address = 0.0.0.0:1004
dfs.datanode.http.address = 0.0.0.0:1006
dfs.datanode.keytab.file = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
dfs.datanode.kerberos.principal = hdfs/_HOST@HADOOPCLUSTER.LOCAL
dfs.web.authentication.kerberos.principal = HTTP/_HOST@HADOOPCLUSTER.LOCAL

4. Data Node hdfs-site.xml configuration

Exactly same as the configuring content of sicily1, except that for the data node, you need not specify the “dfs.namenode.name.dir” configuration. Meaning DO NOT specify the following property and its value in the hdfs-site.xml for turin1 and tuscany1.

5. mapred-site.xml

Edit /home/hadoop/hadoop/etc/hadoop/mapred-site.xml to specify the map-reduce framework to use. We will be using yarn.

Below is performed in sicily, and likewise perform the same in all HDFS nodes.

[hadoop@sicily1 hadoop]$ mv mapred-site.xml.template mapred-site.xml
Content of mapred-site.xml in the form of property/name/value format
mapreduce.framework.name = yarn
mapreduce.jobhistory.address = 0.0.0.0:10020
mapreduce.jobhistory.keytab = /home/hadoop/hadoop/etc/hadoop/mapred.keytab
mapreduce.jobhistory.principal = mapred/_HOST@HADOOPCLUSTER.LOCAL

6. Resource Manager yarn-site.xml — only on sicily1 node.

Edit the /home/hadoop/hadoop/etc/hadoop/yarn-site.xml and update the yarn.resourcemanager.scheduler.address and other Kerberos authentication related properties as below, which is the content for sicily1.

yarn.nodemanager.aux-services = mapreduce_shuffle
yarn.resourcemanager.scheduler.address = sicily1.wsdm.ami.com:8030
yarn.nodemanager.pmem-check-enabled = false
yarn.nodemanager.vmem-check-enabled = false
yarn.resourcemanager.keytab = /home/hadoop/hadoop/etc/hadoop/yarn.keytab
yarn.resourcemanager.principal = yarn/_HOST@HADOOPCLUSTER.LOCAL
yarn.nodemanager.keytab = /home/hadoop/hadoop/etc/hadoop/yarn.keytab
yarn.nodemanager.principal = yarn/_HOST@HADOOPCLUSTER.LOCAL

7. Node Managers — yarn-site.xml — turin1 and tuscany1

Contents of the /home/hadoop/hadoop/etc/hadoop/yarn-site.xml for the Node Managers — turin1 and tuscany1

yarn.nodemanager.aux-services = mapreduce_shuffle
yarn.nodemanager.pmem-check-enabled = false
yarn.nodemanager.vmem-check-enabled = false
yarn.resourcemanager.address = sicily1.wsdm.ami.com:8032
yarn.resourcemanager.scheduler.address = sicily1.wsdm.ami.com:8030
yarn.resourcemanager.resource-tracker.address = sicily1.wsdm.ami.com:8031
yarn.resourcemanager.keytab = /home/hadoop/hadoop/etc/hadoop/yarn.keytab
yarn.resourcemanager.principal = yarn/_HOST@HADOOPCLUSTER.LOCAL
yarn.nodemanager.keytab = /home/hadoop/hadoop/etc/hadoop/yarn.keytab
yarn.nodemanager.principal = yarn/_HOST@HADOOPCLUSTER.LOCAL

8. Logs folder

Create the logs folder in all nodes

[root@sicily1 ~]# su - hadoop
[hadoop@sicily1 ~]$ cd hadoop/
[hadoop@sicily1 hadoop]$ pwd
/home/hadoop/hadoop
[hadoop@sicily1 hadoop]$ mkdir logs
[hadoop@sicily1 hadoop]$

9. Create the slaves files with entries of all the data nodes.

Create the slaves file on the Name Node pointing to the Data Nodes. As part of our setup we are using the Name Node as the Data Node as well.

[hadoop@sicily1 hadoop]$ pwd
/home/hadoop/hadoop
[hadoop@sicily1 hadoop]$ cd etc/hadoop/
[hadoop@sicily1 hadoop]$ cat slaves
localhost
[hadoop@sicily1 hadoop]$ vi slaves
…
[hadoop@sicily1 hadoop]$ cat slaves
sicily1.wsdm.ami.com
turin1.wsdm.ami.com
tuscany1.wsdm.ami.com

10. hadoop-env.sh

Modify /home/hadoop/hadoop/etc/Hadoop/hadoop-env.sh with following entries:

export JAVA_HOME=/usr/lib/jvm/java
export JSVC_HOME=/home/hadoop
export HADOOP_SECURE_DN_USER=hadoop

And copy the same file to all other nodes.

[hadoop@sicily1 hadoop]$ scp hadoop-env.sh hadoop@florence1.wsdm.ami.com:/home/hadoop/hadoop/etc/hadoop
[hadoop@sicily1 hadoop]$ scp hadoop-env.sh hadoop@turin1.wsdm.ami.com:/home/hadoop/hadoop/etc/hadoop
[hadoop@sicily1 hadoop]$ scp hadoop-env.sh hadoop@tuscany1.wsdm.ami.com:/home/hadoop/hadoop/etc/hadoop
[hadoop@sicily1 hadoop]$ scp hadoop-env.sh hadoop@verona1.wsdm.ami.com:/home/hadoop/hadoop/etc/hadoop

11. Format the name node using the following command

Till now, what we have done is the specifying various kinds of configuration for HDFS, YARN. Now it is time to get into action where we actually create the HDFS. And this step does the same, where we format the Hadoop Distribution File System.

[hadoop@sicily1 hadoop]$ hdfs namenode -format
20/09/05 11:32:02 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = sicily1.wsdm.ami.com/10.41.6.179
…
20/09/05 11:32:03 INFO security.UserGroupInformation: Login successful for user hdfs/sicily1.wsdm.ami.com@HADOOPCLUSTER.LOCAL using keytab file /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
Formatting using clusterid: CID-05d0cb58-d92a-4580-a8c8-c3de839d15e0
…
20/09/05 11:32:03 INFO namenode.FSNamesystem: fsOwner = hdfs/sicily1.wsdm.ami.com@HADOOPCLUSTER.LOCAL (auth:KERBEROS)
…
20/09/05 11:32:04 INFO common.Storage: Storage directory /home/hadoop/hdfs/namenode has been successfully formatted.
…
20/09/05 11:32:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sicily1.wsdm.ami.com/10.41.6.179
************************************************************/

12. Starting Name Node

Start the Name Node using the “Hadoop” user. Well, you can as well start using the “root” user, it does not matter — but the Data Nodes should be started using the “root” user as it uses “jsvc” for privileged daemon communication between HDFS nodes.

[hadoop@sicily1 hadoop]$ start-dfs.sh
Starting namenodes on [sicily1.wsdm.ami.com]
sicily1.wsdm.ami.com: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-sicily1.wsdm.ami.com.out
Attempting to start secure cluster, skipping datanodes. Run start-secure-dns.sh as root to complete startup.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-sicily1.wsdm.ami.com.out

13. Start the secured data nodes using the “root” user

[root@sicily1 hadoop]# start-secure-dns.sh
turin1.wsdm.ami.com: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-turin1.wsdm.ami.com.out
tuscany1.wsdm.ami.com: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-tuscany1.wsdm.ami.com.out
sicily1.wsdm.ami.com: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-sicily1.wsdm.ami.com.out

14. Some basic HDFS operations

[root@sicily1 logs]$ hdfs dfs -ls /
[root@sicily1 logs]$ hdfs dfs -mkdir /testing_data
[root@sicily1 confirm]# hdfs dfs -ls /
Found 1 items
drwxr-xr-x — HTTP supergroup 0 2020–09–05 20:40 /testing_data[root@sicily1 confirm]$ wget https://raw.githubusercontent.com/ravichamarthy/spark/master/airports.csv[root@sicily1 confirm]# hdfs dfs -put airports.csv /testing_data
[root@sicily1 confirm]# hdfs dfs -ls /testing_data
Found 1 items
-rw-r — r — 2 HTTP supergroup 850413 2020–09–05 20:42 /testing_data/airports.csv
[root@sicily1 confirm]# hdfs dfs -get /testing_data/airports.csv
get: `airports.csv’: File exists[root@sicily1 confirm]# hdfs dfs -cat /testing_data/airports.csv | head
Airport_ID,Name,City,Country,IATA_FAA,ICAO,Latitude,Longitude,Altitude,Timezone,DST,Tz_db_time_zone
1,”Goroka”,”Goroka”,”Papua New Guinea”,”GKA”,”AYGA”,-6.081689,145.391881,5282,10,”U”,”Pacific/Port_Moresby”
2,”Madang”,”Madang”,”Papua New Guinea”,”MAG”,”AYMD”,-5.207083,145.7887,20,10,”U”,”Pacific/Port_Moresby”
…

15. Start YARN

Start YARN as the root user.

[root@sicily1 confirm]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-root-resourcemanager-sicily1.wsdm.ami.com.out
turin1.wsdm.ami.com: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-root-nodemanager-turin1.wsdm.ami.com.out
tuscany1.wsdm.ami.com: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-root-nodemanager-tuscany1.wsdm.ami.com.out
sicily1.wsdm.ami.com: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-root-nodemanager-sicily1.wsdm.ami.com.out

Make sure YARN Resource Manager, Node Managers and even HDFS Name Node and Data Nodes are started properly by logging at the respective logs files under “/home/hadoop/hadoop/logs” folder
Accessing Hadoop Administration / Namenode
http://sicily1.wsdm.ami.com:50070/dfshealth.html#tab-overview
ResourceManager
http://sicily1.wsdm.ami.com:8088/cluster/nodes

As a summary, in this story we looked at how to configure HDFS, and YARN with Kerberos and submit sample HDFS commands. In the next chapter we shall proceed with Spark configuration.