Setting up Kerberized Cloudera Data Platform

Ravi Chamarthy
17 min readJul 17, 2022

--

Setting up Kerberized Cloudera Data Platform

This article describes the step by step guide for setting up a Kerberized Cloudera Data Platform by describing the steps for installation, Kerbizing the system, configuration of all the necessary Hadoop packages, and finally validation of the system by running sample Spark Jobs.

Systems Setup

Systems
For setting up Cloudera Data Platform, 4 RHEL 7.9 based VM’s are provisioned (please use the latest RHEL version, as needed)

kcdpmasterl1.good-code.com
kcdpworkerm1.good-code.com
kcdpworkern1.good-code.com
kcdpworkero1.good-code.com
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

Locale information
Add locale related attributes to /etc/environment file in all nodes.

# cat /etc/environment
LANG=en_US.utf-8
LC_ALL=en_US.utf-8

Update all nodes with latest packages

# yum update -y

Hosts Mapping
Add cluster nodes IP/hostname’s in /etc/hosts file of all cluster nodes

# cat /etc/hosts10.99.77.123 kcdpmasterl1.good-code.com kcdpmasterl1
10.44.99.61 kcdpworkerm1.good-code.com kcdpworkerm1
10.99.1.239 kcdpworkern1.good-code.com kcdpworkern1
10.44.55.209 kcdpworkero1.good-code.com kcdpworkero1

Export Certificates from Master to all other nodes.

ssh-copy-id -i ~/.ssh/id_rsa.pub root@kcdpworkerm1.good-code.com
ssh-copy-id -i ~/.ssh/id_rsa.pub root@kcdpworkern1.good-code.com
ssh-copy-id -i ~/.ssh/id_rsa.pub root@kcdpworkero1.good-code.com

Establish the passwordless login from master node to all worker nodes.

[root@kcdpmasterl1 ~]# ssh root@kcdpmasterl1
[root@kcdpmasterl1 ~]# ssh root@kcdpmasterl1.good-code.com
[root@kcdpmasterl1 ~]# ssh root@kcdpworkerm1
[root@kcdpmasterl1 ~]# ssh root@kcdpworkerm1.good-code.com
[root@kcdpmasterl1 ~]# ssh root@kcdpworkern1
[root@kcdpmasterl1 ~]# ssh root@kcdpworkern1.good-code.com
[root@kcdpmasterl1 ~]# ssh root@kcdpworkero1
[root@kcdpmasterl1 ~]# ssh root@kcdpworkero1.good-code.com

Install Open JDK8 on all Nodes

[root@kcdpmasterl1 ~]# yum install -y http://birepo-build.svl.good-code.com/repos/Cloudera/Cloudera_Manager/RHEL7/x86_64/7.1.1/Beta/RPMS/x86_64/openjdk8-8.0+232_9-cloudera.x86_64.rpm

Install pip and pandas
Note: default python that comes with RHEL 7.9 is python 2.7. We can upgrade to 3.9.x, as needed.

# curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py
# python get-pip.py
# python -m pip install pandas

Make sure multi-user.target is set
Otherwise, you can set it using the following command followed by the VM reboot

# systemctl get-default
multi-user.target
# systemctl set-default multi-user.target

Make sure the HOSTNAME is set.

# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=kcdpmaster1.good-code.com
# echo $HOSTNAME
kcdpmaster1.good-code.com

Make sure firewall is disabled

# systemctl disable firewalld
# systemctl status firewalld
firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)

Change SELINUX from enforcing to disabled

# cat /etc/selinux/config
...
SELINUX=disabled
...
SELINUXTYPE=targeted

Install NTP, if not already installed.

# systemctl status ntpd

Disable THP

Note: Very Important to make this change. Otherwise, Cloudera Manager installation pre-check step will fail.

Before the change:
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never
Change:
# echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
# echo "never" > /sys/kernel/mm/transparent_hugepage/defrag
After the change:
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
# cat /sys/kernel/mm/transparent_hugepage/defrag
always madvise [never]

Make sure umask is set to 022

Setting the umask for your current login session:
# umask 0022
Checking your current umask:
# umask
Permanently changing the umask for all interactive users:
# echo umask 0022 >> /etc/profile

Set vm.swappiness to 10

On most systems, vm.swappiness is set to 60 by default. This is not suitable for Hadoop clusters because processes are sometimes swapped even when enough memory is available. This can cause lengthy garbage collection pauses for important system daemons, affecting stability and performance.

Cloudera recommends that you set vm.swappiness to a value between 1 and 10, preferably 1, for minimum swapping on systems where the RHEL kernel is 2.6.32–642.el6 or higher.

[root@kcdpmasterl1 ~]# cat /etc/sysctl.conf | grep "vm.swappiness"
vm.swappiness = 10

Kerberos Setup

Install Kerberos client in all nodes

# yum -y install krb5-workstation krb5-libs

Install Kerberos server on the master node

[root@kcdpmasterl1 ~]# yum -y install krb5-server

Configure Kerberos KDC on the master node

[root@kcdpmasterl1 ~]# cat /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
HADOOPCLUSTER.LOCAL = {
#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal
default_principal_flags = +renewable, +forwardable
}

Configure krb5.conf file on the master node

[root@kcdpmasterl1 etc]# cat /etc/krb5.conf# To opt out of the system crypto-policies configuration of krb5, remove the
# symlink at /etc/krb5.conf.d/crypto-policies which will not be recreated.
includedir /etc/krb5.conf.d/
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 4320h
renew_lifetime = 7d
forwardable = true
default_realm = HADOOPCLUSTER.LOCAL
[realms]
HADOOPCLUSTER.LOCAL = {
kdc = kcdpmasterl1.good-code.com
admin_server = kcdpmasterl1.good-code.com
}
[domain_realm]
.good-code.com = HADOOPCLUSTER.LOCAL
good-code.com = HADOOPCLUSTER.LOCAL
kcdpmasterl1.good-code.com = HADOOPCLUSTER.LOCAL
kcdpworkerm1.good-code.com = HADOOPCLUSTER.LOCAL
kcdpworkern1.good-code.com = HADOOPCLUSTER.LOCAL
kcdpworkero1.good-code.com = HADOOPCLUSTER.LOCAL

Configure the Kerberos ACL’s on master node

[root@kcdpmasterl1 etc]# cat /var/kerberos/krb5kdc/kadm5.acl
*/admin@HADOOPCLUSTER.LOCAL *
*/kcdpmasterl1.good-code.com@HADOOPCLUSTER.LOCAL *

Specify the Kerberos Cache Name for the “root” user bashrc profile file — on all nodes.

[root@kcdpmasterl1 etc]# cat ~/.bashrc | grep KRB5CCNAME
export KRB5CCNAME=/tmp/krb5cc

Configure KDC

[root@kcdpmasterl1 ~]# kdb5_util create -r HADOOPCLUSTER.LOCAL -s
Loading random data
Initializing database '/var/kerberos/krb5kdc/principal' for realm 'HADOOPCLUSTER.LOCAL',
master key name 'K/M@HADOOPCLUSTER.LOCAL'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key: XXXXXXXXX
Re-enter KDC database master key to verify: XXXXXXXXX
[root@kcdpmasterl1 ~]#

Create the root/admin user

[root@kcdpmasterl1 ~]# kadmin.local
Authenticating as principal root/admin@HADOOPCLUSTER.LOCAL with password.
kadmin.local: addprinc root/admin@HADOOPCLUSTER.LOCAL
WARNING: no policy specified for root/admin@HADOOPCLUSTER.LOCAL; defaulting to no policy
Enter password for principal "root/admin@HADOOPCLUSTER.LOCAL": XXXXXXXXX
Re-enter password for principal "root/admin@HADOOPCLUSTER.LOCAL": XXXXXXXXX
Principal "root/admin@HADOOPCLUSTER.LOCAL" created.
kadmin.local: exit
[root@kcdpmasterl1 ~]#

Start the krb5kdc and kadmin services

[root@kcdpmasterl1 ~]# service krb5kdc start
[root@kcdpmasterl1 ~]# service kadmin start
[root@kcdpmasterl1 ~]# service krb5kdc status
Mar 28 03:20:58 kcdpmasterl1.good-code.com systemd[1]: Starting Kerberos 5 KDC...
Mar 28 03:20:58 kcdpmasterl1.good-code.com systemd[1]: Started Kerberos 5 KDC.
[root@kcdpmasterl1 ~]# service kadmin status
..
Mar 28 03:21:02 kcdpmasterl1.good-code.com systemd[1]: Starting Kerberos 5 Password-changing and Administration...
Mar 28 03:21:02 kcdpmasterl1.good-code.com systemd[1]: Started Kerberos 5 Password-changing and Administration.

Create the hadoop/admin user

[root@kcdpmasterl1 ~]# kadmin
...
Password for root/admin@HADOOPCLUSTER.LOCAL:
kadmin: addprinc hadoop/admin@HADOOPCLUSTER.LOCAL
...
Enter password for principal "hadoop/admin@HADOOPCLUSTER.LOCAL": XXXXXXXXX
Re-enter password for principal "hadoop/admin@HADOOPCLUSTER.LOCAL": XXXXXXXXX
Principal "hadoop/admin@HADOOPCLUSTER.LOCAL" created.
kadmin: exit

Getting the first ticket for the root/admin user

[root@kcdpmasterl1 ~]# klist
klist: No credentials cache found (filename: /tmp/krb5cc)
[root@kcdpmasterl1 ~]# klist -A
[root@kcdpmasterl1 ~]# kinit root/admin
Password for root/admin@HADOOPCLUSTER.LOCAL: XXXXXXX
[root@kcdpmasterl1 ~]# klist
Ticket cache: FILE:/tmp/krb5cc
Default principal: root/admin@HADOOPCLUSTER.LOCAL
Valid starting Expires Service principal
03/28/2022 03:22:25 03/29/2022 03:22:25 krbtgt/HADOOPCLUSTER.LOCAL@HADOOPCLUSTER.LOCAL
[root@kcdpmasterl1 ~]# chkconfig krb5kdc on
...
[root@kcdpmasterl1 ~]# chkconfig kadmin on
...

Copy the krb5.conf file from master node to worker nodes.

[root@kcdpmasterl1 ~]# for h in kcdpworkerm1 kcdpworkern1 kcdpworkero1 ; { scp /etc/krb5.conf root@$h:/etc ; }

Reboot all the nodes

# reboot

MySQL Setup — Hive Metastore

Install MySQL on the master node

[root@kcdpmasterl1 ~]# rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
[root@kcdpmasterl1 ~]# rpm -Uvh https://repo.mysql.com/mysql80-community-release-el7-3.noarch.rpm
...
[root@kcdpmasterl1 ~]# sed -i 's/enabled=1/enabled=0/' /etc/yum.repos.d/mysql-community.repo
[root@kcdpmasterl1 ~]# yum -y --enablerepo=mysql57-community install mysql-community-server
...
[root@kcdpmasterl1 ~]# service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[root@kcdpmasterl1 ~]# service mysqld status
...
Mar 28 04:28:35 kcdpmasterl1.good-code.com systemd[1]: Starting MySQL Server...
Mar 28 04:28:40 kcdpmasterl1.good-code.com systemd[1]: Started MySQL Server.
[root@kcdpmasterl1 ~]# grep "A temporary password" /var/log/mysqld.log
2022-03-28T11:28:37.397163Z 1 [Note] A temporary password is generated for root@localhost: %a)t/F%k%97q

Secure MySQL

[root@kcdpmasterl1 ~]# mysql_secure_installation
Securing the MySQL server deployment.
Enter password for user root: %a)t/F%k%97q
The existing password for the user account root has expired. Please set a new password.
New password: XXXXXXXXX
Re-enter new password: XXXXXXXXX
...
Change the password for root ? ((Press y|Y for Yes, any other key for No) : n
...
Remove anonymous users? (Press y|Y for Yes, any other key for No) : y
...
Disallow root login remotely? (Press y|Y for Yes, any other key for No) : y
...
Remove test database and access to it? (Press y|Y for Yes, any other key for No) : y
...
Reload privilege tables now? (Press y|Y for Yes, any other key for No) : y
Success.
All done!
[root@kcdpmasterl1 ~]# service mysqld restart
Redirecting to /bin/systemctl restart mysqld.service
[root@kcdpmasterl1 ~]# service mysqld status
...
Mar 28 04:29:41 kcdpmasterl1.good-code.com systemd[1]: Starting MySQL Server...
Mar 28 04:29:41 kcdpmasterl1.good-code.com systemd[1]: Started MySQL Server.
[root@kcdpmasterl1 ~]# chkconfig mysqld on
Note: Forwarding request to 'systemctl enable mysqld.service'.

Create Cloudera related databases

[root@kcdpmasterl1 ~]# mysql -u root -p
Enter password: XXXXXXXX
...
mysql> create database cm;
mysql> create database hive;
mysql> create database amon;
mysql> create database hue;
mysql> create database rman;
mysql> create database sentry;
mysql> create database oozie;
mysql> create database nava DEFAULT CHARACTER SET UTF8;
mysql> create database navm DEFAULT CHARACTER SET UTF8;
mysql> create database kudu DEFAULT CHARACTER SET UTF8;
mysql> GRANT ALL PRIVILEGES ON cm.* TO 'cm'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON amon.* TO 'amon'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON hue.* TO 'hue'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON rman.* TO 'rman'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON sentry.* TO 'sentry'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON oozie.* TO 'oozie'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON nava.* TO 'nava'@'%' IDENTIFIED BY 'MySQLPassword1@24';
mysql> GRANT ALL PRIVILEGES ON navm.* TO 'navm'@'%' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON kudu.* TO 'kudu'@'%' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON cm.* TO 'cm'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON amon.* TO 'amon'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON hue.* TO 'hue'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON rman.* TO 'rman'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON sentry.* TO 'sentry'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON oozie.* TO 'oozie'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON nava.* TO 'nava'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON navm.* TO 'navm'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> GRANT ALL PRIVILEGES ON kudu.* TO 'kudu'@'localhost' IDENTIFIED BY 'XXXXXXXX';
mysql> commit;
mysql> FLUSH PRIVILEGES;
mysql> exit;
Bye

Download the load the mysql java connector

[root@kcdpmasterl1 ~]# wget -P /etc/yum.repos.d/ https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.48.tar.gz
...
[root@kcdpmasterl1 ~]# cd /etc/yum.repos.d/
[root@kcdpmasterl1 yum.repos.d]# tar xvzf mysql-connector-java-5.1.48.tar.gz
...
[root@kcdpmasterl1 yum.repos.d]# cd mysql-connector-java-5.1.48
[root@kcdpmasterl1 mysql-connector-java-5.1.48]# mkdir -p /usr/share/java
[root@kcdpmasterl1 mysql-connector-java-5.1.48]# cp mysql-connector-java-5.1.48-bin.jar /usr/share/java/mysql-connector-java.jar

Cloudera Data Platform Installation

Install Cloudera Manager on Master Node

Download Cloudera Manager repo and install the cloudera manager, agent and server.

[root@kcdpmasterl1 ~]# wget -P /etc/yum.repos.d http://birepo-build.svl.good-code.com/repos/cm7/7.4.4/redhat7/yum/cloudera-manager.repo
...
[root@kcdpmasterl1 ~]# yum install -y cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server
...

Configure Certificates and Cloudera DB

Note: Certificates are not getting created in the first go. Need to run the steps couple of times, will a temporary cert store.

[root@kcdpmasterl1 ~]# export JAVA_HOME=/usr/java/jdk1.8.0_232-cloudera
[root@kcdpmasterl1 ~]# cd /opt/cloudera/cm-agent/bin/
[root@kcdpmasterl1 bin]# ./certmanager setup --configure-services
INFO:root:Logging to /var/log/cloudera-scm-agent/certmanager.log
[root@kcdpmasterl1 certmanager]# rm -fr /var/lib/cloudera-scm-server/certmanager/
[root@kcdpmasterl1 certmanager]# cd /root
[root@kcdpmasterl1 ~]# cd /opt/cloudera/cm-agent/bin/
[root@kcdpmasterl1 bin]# ./certmanager --location /opt/cloudera/CMCA2 setup --configure-services
INFO:root:Logging to /var/log/cloudera-scm-agent/certmanager.log
...
[root@kcdpmasterl1 bin]# rm -fr /var/lib/cloudera-scm-server/certmanager/
[root@kcdpmasterl1 bin]# cd /opt/cloudera/cm-agent/bin/
[root@kcdpmasterl1 bin]# rm -rf /var/lib/cloudera-scm-server/certmanager
[root@kcdpmasterl1 bin]# cd /root
[root@kcdpmasterl1 ~]# /opt/cloudera/cm-agent/bin/certmanager setup --configure-services
INFO:root:Logging to /var/log/cloudera-scm-agent/certmanager.log
[root@kcdpmasterl1 ~]# cat /var/log/cloudera-scm-agent/certmanager.log
[28/Mar/2022 05:11:27 -0700] 5904 MainThread cert INFO SCM Certificate Manager
[28/Mar/2022 05:11:27 -0700] 5904 MainThread os_ops INFO Created directory /var/lib/cloudera-scm-server/certmanager None None 0o755
...
[28/Mar/2022 05:13:56 -0700] 6207 MainThread cert INFO Bootstrapping keystore and truststore to: /var/lib/cloudera-scm-agent/agent-cert
[28/Mar/2022 05:13:56 -0700] 6207 MainThread cert INFO Generating key used to sign certificate request tokens

Initialize the Cloudera SCM database

[root@kcdpmasterl1 ~]# /opt/cloudera/cm/schema/scm_prepare_database.sh mysql -h kcdpmasterl1.good-code.com cm cm
Enter SCM password: XXXXXXXX
JAVA_HOME=/usr/java/jdk1.8.0_232-cloudera
Verifying that we can write to /etc/cloudera-scm-server
Creating SCM configuration file in /etc/cloudera-scm-server
...
Successfully connected to database.
All done, your SCM database is configured correctly!

Start Cloudera SCM Server
The Cloudera SCM Server is a web server which hosts Cloudera central server which manages the Cloudera Distribution Hadoop. A good blog on all the Cloudera terminologies — https://blog.cloudera.com/how-does-cloudera-manager-work/

[root@kcdpmasterl1 ~]# systemctl start cloudera-scm-server
...
[root@kcdpmasterl1 ~]# tail -n 100 -f /var/log/cloudera-scm-server/cloudera-scm-server.log
...
2022-03-28 05:16:24,861 INFO WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.

Install Cloudera Components

Cloudera SCM — Launch Cloudera Manager

Open http://kcdpmasterl1.good-code.com:7180 which will redirect to https://kcdpmasterl1.good-code.com:7183/cmf/login because we have setup SSL/TLS as part of Steps 33 and 34.

Default username/password : admin/admin

Cloudera SCM

Upload the license file

CDP License File

Look for a confirmation that the license uploaded successfully.

Cloudera Manager 7.4.4

AutoTLS confirmation
In the next screen, look for a confirmation on AutoTLS is already been enabled. This is done as part of the steps 33 and 34. Ignore the Kerberos related warning, as we are yet to configure Kerberos.

AutoTLS Confirmation

Specify a cluster name

Cloudera Cluster Name

Specify cluster nodes
All the cluster nodes and search for them, so that they can be added to the cluster.

Cloudera Cluster Nodes

Cloudera Repo
Select the Cloudera repo as downloaded in the earlier steps. Leave the other Software’s / Parcels default selection.

Cloudera Repo Selection

Select the JDK as the system-provided version of OpenJDK

JDK Selection

Specify the root password for all the nodes.

Root Password for all Cloudera Cluster Nodes

Installation of Agents will begin
Note, if there is any issue as part of certificate setup in the previous steps, then Cloudera agents installations will fail.

Agents installation

Wait for the Cloudera agents installation completion

Agents Installation Confirmation

Download the parcels.
Parcels are nothing but the Hadoop Components — Hue, HDFS, Yarn, Oozie, Spark, Hive, Zepllin, etc etc.

Downloads Parcels

Distribute the parcels.
Once downloaded, the distribution of the parcels will start (basically download in master and copy them to all worker nodes).

Parcels Distribution

The next step would be unpacking of these parcels in all nodes. Following by, Perform inspect the network performance, Inspect Hosts for any issues, Make sure you run all the steps in the Setup section, otherwise this step will fail. And for parcels selections as part of cluster, choose “Custom Services”.

Select Parcels
HDFS, Hive, Hive on Tez, Hue, Spark, Tez, Yarn, and Yarn Queue Manager.

View the parcels selection by Host.

Selected Parcels.

Specify the connection details for database dependent parcels
Once specified, make sure no further parameters are needed to be specified.

Database Connection Details

Parcels installation and configuration will start.

Parcels Instllation

Hive table creation might fail with flush hosts issue. For that run the following commands. Also, Hue might not get started, for that manually start Hue from the cluster. Once done, go back to the previous screen and comeback to this screen. Make sure you do not lose out from this step, as we have executed over 50 steps and losing a cluster at this stage is very expensive and time consuming to build one. Please search for the issue (if any) and fix it manually and let the installation complete.

[root@kcdpmasterl1 ~]# mysql -u root -p
Enter password: XXXXXXX
...
mysql> flush hosts;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GLOBAL max_connect_errors=10000;
Query OK, 0 rows affected (0.00 sec)
mysql> set global max_connections = 200;
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like "max_connections";
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 200 |
+-----------------+-------+
1 row in set (0.01 sec)
mysql> show variables like "max_connect_errors";
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| max_connect_errors | 10000 |
+--------------------+-------+
1 row in set (0.00 sec)
mysql> commit;
Query OK, 0 rows affected (0.00 sec)mysql> exit;
Bye
[root@kcdpmasterl1 ~]# service mysqld restart
Redirecting to /bin/systemctl restart mysqld.service
[root@kcdpmasterl1 ~]# service mysqld status
...
Mar 28 04:29:41 kcdpmasterl1.good-code.com systemd[1]: Starting MySQL Server...
Mar 28 04:29:41 kcdpmasterl1.good-code.com systemd[1]: Started MySQL Server.
[root@kcdpmasterl1 ~]# chkconfig mysqld on
Note: Forwarding request to 'systemctl enable mysqld.service'.

Cluster configuration is complete.

Cloudera Data Platform Cluster Configuration Confirmation

Make sure to resolve any port conflicts for HiveServer2 across nodes — basically change the default port numbers to different values. Set Java Home for all cluster nodes followed by cluster restart.

Install Livy on master node

[root@kcdpmasterl1 ~]# cd /opt
[root@kcdpmasterl1 opt]# wget https://mirrors.estointernet.in/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip
...
[root@kcdpmasterl1 opt]# unzip apache-livy-0.7.0-incubating-bin.zip
...
[root@kcdpmasterl1 opt]# mv apache-livy-0.7.0-incubating-bin livy
[root@kcdpmasterl1 opt]# cd livy/conf/
[root@kcdpmasterl1 conf]# mv livy.conf.template livy.conf
[root@kcdpmasterl1 conf]# vi ~/.bashrc
[root@kcdpmasterl1 conf]# source ~/.bashrc
[root@kcdpmasterl1 conf]# livy-server start
starting java -cp /opt/livy/jars/*:/opt/livy/conf:/etc/hadoop/conf: org.apache.livy.server.LivyServer, logging to /opt/livy/logs/livy-root-server.out

Enable Kerberos in Cloudera Data Platform

Launch Cloudera SCM

Cloudera SCM

Select MIT KDC
As we have already installed Kerberos and configured KDC and KAdmin, select “I have completed all the above steps” option.

MIT KDC

KDC Details
Select the following as the Kerberos Encryption Types

aes256-cts
aes128-cts
des3-hmac-sha1
arcfour-hmac
des-hmac-sha1
des-cbc-md5
des-cbc-crc
aes256-cts-hmac-sha1-96

Specify the Kerberos Realm, KDC Server Host and KDC Admin Server Host

Kerberos Configuration

Manage krb5.conf
Specify the location of the krb5.conf file.
DO NOT SELECT the option “Manage krb5.conf through Cloudera Manager”

krb5.conf file location

Specify the root/admin Kerberos Principal and its password

Kerberos root/admin principal password

Confirmation on using the above account details.

Kerberos Configuration Confirmation

Select Default Kerberos Principals
This selection will create the kerberos principals for various parcels like HDFS, Hive, Yarn, etc.

Kerberos Principals Creation

Kerberos Setup Progress

Kerberos Setup Progress

Completion of Kerberizing Cloudera.

Kerberos Configuration Confirmation

Cluster Validation — Spark Jobs Execution

Set HADOOP_USER_NAME in all nodes

[root@kcdpmasterl1 ~]# cat ~/.bashrc | grep HADOOP_USER_NAME
export HADOOP_USER_NAME=hdfs

HDFS sample commands

# kdestroy
# hdfs dfs -ls /
22/03/29 01:44:22 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
ls: DestHost:destPort kcdpmasterl1.good-code.com:8020 , LocalHost:localPort kcdpmasterl1.good-code.com/10.11.58.123:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
# kinit -kt /var/run/cloudera-scm-agent/process/376-hive-HIVEMETASTORE/hive.keytab hive/kcdpmasterl1.good-code.com@HADOOPCLUSTER.LOCAL
[root@kcdpmasterl1 ~]# hdfs dfs -ls /
Found 7 items
drwxr-xr-x - hdfs supergroup 0 2022-03-28 09:41 /code
drwxr-xr-x - hive hive 0 2022-03-28 07:34 /hive
drwxr-xr-x - hdfs supergroup 0 2022-03-28 09:09 /testing_data
drwxrwxrwt - hdfs supergroup 0 2022-03-28 07:55 /tmp
drwxr-xr-x - hdfs supergroup 0 2022-03-28 10:11 /user
drwxr-xr-x - hdfs supergroup 0 2022-03-28 07:34 /warehouse
drwxr-xr-x - yarn hadoop 0 2022-03-28 07:34 /yarn

Load sample data to HDFS

[root@kcdpmasterl1 ~]# wget https://raw.githubusercontent.com/ravichamarthy/spark/master/airports.csv
...
[root@kcdpmasterl1 ~]# hdfs dfs -mkdir /testing_data
[root@kcdpmasterl1 ~]# hdfs dfs -put -f airports.csv /testing_data
[root@kcdpmasterl1 ~]# hdfs dfs -cat /testing_data/airports.csv | tail -n 10
...
9540,”Deer Harbor Seaplane”,”Deer Harbor”,”United States”,”DHB”,\N,48.618397,-123.00596,0,-8,”A”,”America/Los_Angeles”
9541,”San Diego Old Town Transit Center”,”San Diego”,”United States”,”OLT”,\N,32.7552,-117.1995,0,-8,”A”,”America/Los_Angeles”
...

Sample Spark job — pi.py

[root@kcdpmasterl1 ~]# cd /opt/cloudera/parcels/CDH/lib/spark
[root@kcdpmasterl1 spark]# gunzip python.tar.gz
[root@kcdpmasterl1 spark]# tar xvf python.tar
...
[root@kcdpmasterl1 spark]# spark-submit — deploy-mode cluster /opt/cloudera/parcels/CDH/lib/spark/pi.py 10
[root@kcdpmasterl1 spark]# yarn logs -applicationId application_1648539318804_0010 | grep roughly
...
Pi is roughly 3.141952

Sample Spark job — org.apache.spark.examples.SparkPi

# spark-submit — deploy-mode cluster — class org.apache.spark.examples.SparkPi /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples_2.11–2.4.7.7.1.7.0–551.jar 10
# yarn logs -applicationId application_1648539318804_0011 | grep roughly
...
Pi is roughly 3.143107143107143

Save Spark Jobs to HDFS — run by reference.

[root@kcdpmasterl1 jars]# hdfs dfs -mkdir /code
[root@kcdpmasterl1 jars]# hdfs dfs -put -f /opt/cloudera/parcels/CDH/lib/spark/pi.py /code
[root@kcdpmasterl1 jars]# pwd
/opt/cloudera/parcels/CDH/lib/spark/examples/jars
[root@kcdpmasterl1 jars]# cd /root
[root@kcdpmasterl1 ~]# vi cdp.hdfs.testing.py
[root@kcdpmasterl1 ~]# hdfs dfs -put -f cdp.hdfs.testing.py /code
[root@kcdpmasterl1 ~]# vi cdp.hive.testing.py
[root@kcdpmasterl1 ~]# vi cdp.hive.testing.py
[root@kcdpmasterl1 ~]# cat cdp.hdfs.testing.py
import sys
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("secureHadoop").enableHiveSupport().getOrCreate()
path = "hdfs://kcdpmasterl1.good-code.com:8020/testing_data"
print("L1:Path accessed in HDFS is : {}".format(path));
df = spark.read.format("csv").load(path);
df.show()
spark.stop()
[root@kcdpmasterl1 ~]# cat cdp.hive.testing.py
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == "__main__":
sparkSession = SparkSession.builder.appName('Spark Hive Kerberized Testing').enableHiveSupport().getOrCreate()
all_databases_df = sparkSession.sql("show databases")
all_databases_pd = all_databases_df.toPandas()
print('all databases')
print(all_databases_pd)
print('total number of airports')
rows_count_df = sparkSession.sql("select count(*) from
testing_data.airports")
rows_count_pd = rows_count_df.toPandas()
print(rows_count_pd)
sparkSession.stop()
[root@kcdpmasterl1 ~]# hdfs dfs -put -f cdp.hive.testing.py /code

Sample Spark Job — pi.py from HDFS

# spark-submit — deploy-mode cluster hdfs://kcdpmasterl1.good-code.com/code/pi.py
# yarn logs -applicationId application_1648539318804_0012 | grep roughly -A 10
...
Pi is roughly 3.141960

Create Hive table — Use Hue.
Table creation script:

CREATE external TABLE airports (Airport_ID int, Name string, City string, Country string, IATA_FAA string, ICAO string, Latitude float, Longitude float, Altitude int, Timezone float, DST string, Tz_db_time_zone string) COMMENT “The table [airports]” ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE LOCATION ‘hdfs://10.99.44.123/testing_data’;

Load data to Hive table.

Sample Spark Job — cdp.hdfs.testing.py from HDFS

# spark-submit --deploy-mode cluster hdfs://kcdpmasterl1.good-code.com/code/cdp.hdfs.testing.py
# yarn logs -applicationId application_1648539318804_0013 | grep L1 -A 10
...
22/03/29 02:47:36 INFO client.RMProxy: Connecting to ResourceManager at kcdpmasterl1.good-code.com/10.11.58.123:8032
L1:Path accessed in HDFS is : hdfs://kcdpmasterl1.good-code.com:8020/testing_data
+----------+--------------------+--------------+----------------+--------+----+---------+----------+--------+--------+----+--------------------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|_c10| _c11|
+----------+--------------------+--------------+----------------+--------+----+---------+----------+--------+--------+----+--------------------+
|Airport_ID| Name| City| Country|IATA_FAA|ICAO| Latitude| Longitude|Altitude|Timezone| DST| Tz_db_time_zone|
...
| 6| Wewak Intl| Wewak|Papua New Guinea| WWK|AYWK|-3.583828|143.669186| 19| 10| U|Pacific/Port_Moresby|

Sample Spark Job — cdp.hive.testing.py from HDFS

# spark-submit --deploy-mode cluster hdfs://kcdpmasterl1.good-code.com/code/cdp.hive.testing.py
# yarn logs -applicationId application_1648539318804_0014 | grep 8108 -B 10
...
22/03/29 02:48:47 INFO client.RMProxy: Connecting to ResourceManager at kcdpmasterl1.good-code.com/10.11.58.123:8032
LogLength:173
LogContents:
all databases
databaseName
0 default
1 information_schema
2 sys
3 testing_data
total number of airports
count(1)
0 8108

Kerberize Livy

Livy — Kerberos Configuration

# livy-server stop
# cat /opt/livy/conf/livy.conf
...
livy.server.auth.type = kerberos
livy.server.auth.kerberos.principal = hive/kcdpmasterl1.good-code.com@HADOOPCLUSTER.LOCAL
livy.server.auth.kerberos.keytab = /var/run/cloudera-scm-agent/process/315-hive-HIVEMETASTORE/hive.keytab
livy.server.launch.kerberos.principal = HTTP/kcdpmasterl1.good-code.com@HADOOPCLUSTER.LOCAL
livy.server.launch.kerberos.keytab = /var/run/cloudera-scm-agent/process/308-yarn-RESOURCEMANAGER/yarn.keytab
livy.server.port = 8998
livy.environment= production
livy.server.session.timeout = 1h
livy.impersonation.enabled = false
livy.server.recovery.mode = off
livy.server.access-control.view-users = *
livy.server.access-control.modify-users = *
livy.server.access-control.allowed-users = *
livy.server.access-control.enabled = true
livy.server.csrf-protection.enabled = false
livy.impersonation.enabled = false
# livy-server start

Submit Spark Jobs over Livy

# kinit -kt /var/run/cloudera-scm-agent/process/376-hive-HIVEMETASTORE/hive.keytab hive/kcdpmasterl1.good-code.com@HADOOPCLUSTER.LOCAL
# curl — negotiate -u : -X POST -d ‘{“file”:”hdfs://kcdpmasterl1.good-code.com/code/pi.py”}’ -H “Content-Type: application/json” “http://kcdpmasterl1.good-code.com:8998/batches"
# curl — negotiate -u : -X POST -d ‘{“file”:”hdfs://kcdpmasterl1.good-code.com/code/cdp.hdfs.testing.py”}’ -H “Content-Type: application/json” “http://kcdpmasterl1.good-code.com:8998/batches"
# curl — negotiate -u : -X POST -d ‘{“file”:”hdfs://kcdpmasterl1.good-code.com/code/cdp.hive.testing.py”}’ -H “Content-Type: application/json” “http://kcdpmasterl1.good-code.com:8998/batches"

Conclusion

This is indeed a lengthy article, and it describes the end to end configuration and validation of running Spark Jobs in a Cloudera Data Platform, by describing the following items

  • Preparing VMs for installing Cloudera Data Platform
  • Installation of Python, JDK, MySQL, Setting up Kerberos.
  • Installation of Cloudera Manager and enabling various Cloudera Parcels.
  • Kerberizing Cloudera
  • Validation of the setup by running Spark Jobs in Cloudera Data Platform.

--

--

Ravi Chamarthy
Ravi Chamarthy

Written by Ravi Chamarthy

STSM, IBM watsonx.governance - Monitoring & IBM Master Inventor

No responses yet