Apache Hadoop Multi-Node Kerberized Cluster Setup

Ravi Chamarthy
3 min readSep 16, 2020
An integrated Apache Hadoop Ecosystem

These are set of chapters to showcase the ability to submit spark jobs to a Kerberized Hadoop Cluster, where all components of the ecosystem are Kerberized — HDFS, YARN, SPARK, HIVE, LIVY. Over the course of these chapters, I would try to explain the steps to setup the following:

Kerberos Admin Server and Key Distribution Center setup
Kerberos Principals creation and getting
HDFS (1 Name Node x 3 Data Nodes)
YARN (1 Resource Manager x 3 Node Managers)
Hive and MySQL (in a separate node)
Spark on all nodes and master configured with YARN on Resource Manager Node.
Hive integration with Spark
Livy setup on a separate edge node.
And a client node from where we submit Livy Batch API and WebHDFS calls.

Here is the high-level cluster deployment architecture, that I am setting up as part of these chapters.

Apache Hadoop Ecosystem — deployment architecture used as part of this series

Following are the related stories (well, I am calling them as chapters):

Chapter 1. Users Creation and initial setup
Chapter 2. Kerberos Installation and configuration
Chapter 3. Unpacking Hadoop Distributions
Chapter 4. Configuring HDFS and YARN
Chapter 5. Configure Spark and Run Spark Applications
Chapter 6. Configuring Edge Node and Run Spark Applications
Chapter 7. Hive Configuration
Chapter 8. Integrating Hive with Spark in Resource Manager and in Edge Node
Chapter 9. Running Spark application communicating with Kerberized Hive

We first begin with these two Chapters in Kerberos Setup for Apache Hadoop Multi-Node Cluster (link: https://medium.com/@ravi.chamarthy/kerberos-setup-for-apache-hadoop-multi-node-cluster-6bd8a2fbe680)

Chapter 1. Users Creation and initial setup
Chapter 2. Kerberos Installation and configuration

And then this story Hadoop — HDFS and YARN Kerberos based Configuration (link: https://medium.com/@ravi.chamarthy/hadoop-hdfs-and-yarn-kerberos-based-configuration-d23d286fdbcc) will explain about configuring HDFS and YARN by describing the steps in these two chapters:

Chapter 3. Unpacking Hadoop Distributions
Chapter 4. Configuring HDFS and YARN

Then we will setup Spark and run Spark applications using spark-submit and also submit the Spark Jobs from an Edge node as described in this story Configuring Spark and Running Spark Applications (link: https://medium.com/@ravi.chamarthy/configuring-spark-and-running-spark-applications-983e5fdd6499)

Chapter 5. Configure Spark and Run Spark Applications
Chapter 6. Configuring Edge Node and Run Spark Applications

We shall do the Hive configuration as part of this story Apache Hive Configuration with MySQL metastore (link: https://medium.com/@ravi.chamarthy/apache-hive-configuration-with-mysql-metastore-3ecb9a0df3a1)

Chapter 7. Hive Configuration

From the Edge node, we shall confirm the execution of Spark jobs using Hive as detailed in the following chapters in the story Submit Spark Applications using Hive from Edge Node (link: https://medium.com/@ravi.chamarthy/submit-spark-applications-using-hive-from-edge-node-832c9b0f17d)

Chapter 8. Integrating Hive with Spark in Resource Manager and in Edge Node
Chapter 9. Running Spark application communicating with Kerberized Hive

And finally in the same story, Submit Spark Applications to Livy Batches API from Client System (link: https://medium.com/@ravi.chamarthy/submit-spark-applications-using-hive-from-edge-node-832c9b0f17d) we shall make use of a client machine to submit the cURL commands to Livy server and WebHDFS. And that concludes the setup!

List of packages used as part of this setup:

Happy Hadooping!

--

--

Ravi Chamarthy

Software Architect, watsonx.governance - Monitoring & IBM Master Inventor