An integrated Apache Hadoop Ecosystem

These are set of chapters to showcase the ability to submit spark jobs to a Kerberized Hadoop Cluster, where all components of the ecosystem are Kerberized — HDFS, YARN, SPARK, HIVE, LIVY. Over the course of these chapters, I would try to explain the steps to setup the following:


Accessing machine learning model risk

Choose the right machine learning model for your production deployment

The machine learning model, like any other software deliverable, has a lifecycle. The model owner would propose the problem statement for the model predictions, the model developer will design, develop and deploy the model, and then the model validator will test the model. The model approver will review the model…


Spark with JDBC communicating with Kerberized Hive

JDBC is a popular data access technology which is supported by multiple databases, where the database vendors provides drivers implementing the JDBC specification. Applications would set the database drivers in the application classpath to communicate with the underlying database. On the other hand, for the Spark-based applications development, the widely…


Let’s keep it simple. Your data is stored in a Kerberized Hive which is part of your Kerberized Hadoop cluster. And from your system, you want to connect to this Hive through a Jupyter notebook to, let’s say, run some SQL queries.


Putting everything together — Kerberos, HDFS, YARN, Spark, Hive, Edge Node, Client Node

This is the sixth and the final part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the following chapters:


This is a fifth part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the following chapters:


This is a fourth part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the overall deployment architecture followed by setup of initial system with Kerberos, and then setup of multi-node Hadoop with HDFS…


This is a third part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the overall deployment architecture and setup the initial system with Kerberos. …


Kerberos — gatekeeping with 3 locks — Authentication Server, Database, Ticket Granting Server

As explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, as part of this story we shall perform the initial setup of the Hadoop ecosystem with required packages and then setup Kerberos on all cluster nodes.

Here are the various nodes in which we would setup the Hadoop ecosystem.


Measuring model performance metrics

Like any other software development, testing and evaluating your machine learning model is very essential before the model can be used for making actual predictions. Quality evaluations like Accuracy, Precision, Recall, Sensitivity, and other quality metrics are often used to measure the model. But measuring one metric may not give…

Ravi Chamarthy

Software Architect, IBM Watson OpenScale — Trusted AI

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store