An integrated Apache Hadoop Ecosystem

These are set of chapters to showcase the ability to submit spark jobs to a Kerberized Hadoop Cluster, where all components of the ecosystem are Kerberized — HDFS, YARN, SPARK, HIVE, LIVY. Over the course of these chapters, I would try to explain the steps to setup the following:

Kerberos Admin Server and Key Distribution Center setup
Kerberos Principals creation and getting
HDFS (1 Name Node x 3 Data Nodes)
YARN (1 Resource Manager x 3 Node Managers)
Hive and MySQL (in a separate node)
Spark on all nodes and master configured with YARN on Resource Manager Node.
Hive integration with Spark
Livy setup on a…


Accessing machine learning model risk

Choose the right machine learning model for your production deployment

The machine learning model, like any other software deliverable, has a lifecycle. The model owner would propose the problem statement for the model predictions, the model developer will design, develop and deploy the model, and then the model validator will test the model. The model approver will review the model validation outcomes and decide whether to approve or reject the model usage. All these steps normally happen in a sandbox, or pre-production, environment. Only after the model is approved is it promoted to production.

In this story, we will take a step-wise view on how IBM Watson OpenScale can be…


Spark with JDBC communicating with Kerberized Hive

JDBC is a popular data access technology which is supported by multiple databases, where the database vendors provides drivers implementing the JDBC specification. Applications would set the database drivers in the application classpath to communicate with the underlying database. On the other hand, for the Spark-based applications development, the widely used authentication mechanism is through Kerberos which is a three way authentication mechanism comprising of Authentication Server (AS), Key Distribution Center (KDC), and Ticket Granting Server (TGS) where the Hadoop Cluster and the Hive database is protected using Kerberos. …


Let’s keep it simple. Your data is stored in a Kerberized Hive which is part of your Kerberized Hadoop cluster. And from your system, you want to connect to this Hive through a Jupyter notebook to, let’s say, run some SQL queries.

If it is a regular Hive, it is pretty straight forward. And if it is Kerberized Hive, it’s bit tricky — hence this post.

Instead of assuming that your system has so-and-so packages already installed, I have captured all the steps starting from a freshly brewed RHEL 8.2 VM, with the end goal of connecting to a Kerberized…


Putting everything together — Kerberos, HDFS, YARN, Spark, Hive, Edge Node, Client Node

This is the sixth and the final part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the following chapters:

Chapter 1. Users Creation and initial setup
Chapter 2. Kerberos Installation and configuration
Chapter 3. Unpacking Hadoop Distributions
Chapter 4. Configuring HDFS and YARN
Chapter 5. Configure Spark and Run Spark Applications
Chapter 6. Configuring Edge Node and Run Spark Applications
Chapter 7. Hive Configuration

In this story we shall submit Spark applications using Hive as the data source from the Edge node.

Chapter 8. Integrating Hive with Spark in Resource Manager and in Edge Node

Copy the hive-site.xml from the hive node to spark conf folder of all nodes.

[root@verona1…


This is a fifth part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the following chapters:

Chapter 1. Users Creation and initial setup
Chapter 2. Kerberos Installation and configuration
Chapter 3. Unpacking Hadoop Distributions
Chapter 4. Configuring HDFS and YARN
Chapter 5. Configure Spark and Run Spark Applications
Chapter 6. Configuring Edge Node and Run Spark Applications

In this story we shall perform Hive configuration.

Chapter 7. Hive Configuration

Select one of the data node to install hive in it. We could have the hive/mysql in the…


This is a fourth part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the overall deployment architecture followed by setup of initial system with Kerberos, and then setup of multi-node Hadoop with HDFS and YARN. In this story, we will go through the steps to setup Spark and run applications.

Chapter 5. Configure Spark and Run Spark Applications

1. spark-defaults.conf

Create /home/Hadoop/spark/conf/spark-defaults.conf to specify the Spark communication with YARN over kerberos principal and the keytab file. Specify the same principal in all nodes.

[hadoop@turin1 logs]$ cd ../../spark/conf/
[hadoop@turin1 conf]$ pwd
/home/hadoop/spark/conf
[hadoop@turin1 conf]$ mv spark-defaults.conf.template spark-defaults.conf
[hadoop@turin1…


This is a third part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the overall deployment architecture and setup the initial system with Kerberos. In this story, we will go through the steps to setup Hadoop — HDFS and YARN setup.

Chapter 3. Unpacking Hadoop Distributions

1. Download Hadoop on all nodes

We are using Hadoop version 2.10.0 for setting up the environment. Below are the steps to download and unpack Hadoop on Sicily. Please repeat the same steps on all nodes. Perform these steps using the “hadoop” user.

[hadoop@sicily1 ~]$ pwd
/home/hadoop
[hadoop@sicily1 ~]$ wget…


Kerberos — gatekeeping with 3 locks — Authentication Server, Database, Ticket Granting Server

As explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, as part of this story we shall perform the initial setup of the Hadoop ecosystem with required packages and then setup Kerberos on all cluster nodes.

Here are the various nodes in which we would setup the Hadoop ecosystem.


Measuring model performance metrics

Like any other software development, testing and evaluating your machine learning model is very essential before the model can be used for making actual predictions. Quality evaluations like Accuracy, Precision, Recall, Sensitivity, and other quality metrics are often used to measure the model. But measuring one metric may not give a complete picture of your model and we need to consider multiple metrics to better understand the different dimensionality of the model. As part of this story, we try to cover a group of metrics for measuring the quality of the machine learning model.

Let’s say we are building a…

Ravi Chamarthy

Software Architect, IBM Watson OpenScale — Trusted AI

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store