Apache Hive Configuration with MySQL metastore

8 min readSep 17, 2020

This is a fifth part of the Apache Hadoop ecosystem setup as explained in Apache Hadoop Multi-Node Kerberized Cluster Setup, where in the previous stories we had had gone through the following chapters:

Chapter 1. Users Creation and initial setup
Chapter 2. Kerberos Installation and configuration
Chapter 3. Unpacking Hadoop Distributions
Chapter 4. Configuring HDFS and YARN
Chapter 5. Configure Spark and Run Spark Applications
Chapter 6. Configuring Edge Node and Run Spark Applications

In this story we shall perform Hive configuration.

Chapter 7. Hive Configuration

Select one of the data node to install hive in it. We could have the hive/mysql in the name node itself, but the name node becomes way too heavy with so many components residing in one system — NN, RM, DN, NM, Hive, MySQL, Spark. Instead we could have Hive and MySQL in one of the Data Node system, or different node itself. We are using vernoa1.wsdm.ami.com system to install hive and mysql, which is a dedicated node to Hive and MySQL.

1. Download Hive

[hadoop@verona1 ~]$ hostname
verona1.wsdm.ami.com
[hadoop@verona1 ~]$ whoami
hadoop
[hadoop@verona1 ~]$ pwd
/home/hadoop
[hadoop@verona1 ~]$ wget https://mirrors.estointernet.in/apache/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz
[hadoop@verona1 ~]$ tar xvzf apache-hive-2.3.7-bin.tar.gz
…
[hadoop@verona1 ~]$ mv apache-hive-2.3.7-bin hive

Remove the log4j-slf4j-impl-2.6.2.jar. This library is needed because MySQL also comes with the same library and duplicate library in the class path will cause unnecessary warning while running hive.

[hadoop@verona1 ~]$ cd hive/lib/
[hadoop@verona1 lib]$ rm -fr log4j-slf4j-impl-2.6.2.jar

2. Hive directories in HDFS

Create the following directors in hdfs, and change their directory permissions

[hadoop@verona1 lib]$ hdfs dfs -mkdir -p /user/hive/warehouse
[hadoop@verona1 lib]$ hdfs dfs -mkdir -p /tmp/hive
[hadoop@verona1 lib]$ hdfs dfs -chmod 777 /tmp/
[hadoop@verona1 lib]$ hdfs dfs -chmod 777 /user/hive/warehouse
[hadoop@verona1 lib]$ hdfs dfs -chmod 777 /tmp/hive
[hadoop@verona1 conf]$ hdfs dfs -ls /
Found 4 items
drwxr-xr-x — HTTP supergroup 0 2020–09–05 23:07 /examples
drwxr-xr-x — HTTP supergroup 0 2020–09–05 20:42 /testing_data
drwxrwxrwx — HTTP supergroup 0 2020–09–06 01:11 /tmp
drwxr-xr-x — yarn supergroup 0 2020–09–06 01:11 /user
[hadoop@verona1 conf]$

3. Install MySQL

Download and add the following MySQL Yum repository to RedHat repository list to install the latest version of MySQL

[root@verona1 tmp]# whoami
root
[root@verona1 tmp]# cd /opt
[root@verona1 opt]# mkdir mysql
[root@verona1 opt]# cd mysql/
[root@verona1 mysql]# wget https://repo.mysql.com/mysql80-community-release-el8-1.noarch.rpm

After downloading the package, install the downloaded package with the following command

[root@verona1 mysql]# yum localinstall mysql80-community-release-el8–1.noarch.rpm

You can verify that the MySQL Yum repository has been added successfully by using the following command.

[root@verona1 mysql]# yum repolist enabled | grep “mysql.*-community.*”
mysql-connectors-community MySQL Connectors Community
mysql-tools-community MySQL Tools Community
mysql80-community MySQL 8.0 Community Server

Install latest version of MySQL using the following command

This disabling is needed as we are doing a local yum installation

[root@verona1 mysql]# yum module disable mysql

And then upgrade it.

[root@verona1 mysql]# yum -y install mysql-community-server

Starting the MySQL Server

[root@verona1 mysql]# service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[root@verona1 mysql]# service mysqld status
Redirecting to /bin/systemctl status mysqld.service
…
Sep 06 01:15:28 verona1.wsdm.ami.com systemd[1]: Starting MySQL Server…
Sep 06 01:16:04 verona1.wsdm.ami.com systemd[1]: Started MySQL Server.
[root@verona1 mysql]#

Confirmation

[root@verona1 ~]# mysql --version
mysql Ver 8.0.21 for Linux on x86_64 (MySQL Community Server — GPL)

4. Securing the MySQL Installation

The command mysql_secure_installation allows you to secure your MySQL installation by performing important settings like setting the root password, removing anonymous users, removing root login, and so on.

Note: MySQL version 8.0 or higher generates a temporary random password in /var/log/mysqld.log after installation.

Use the below command to see the password before running MySQL secure command.

[root@verona1 ~]# grep ‘temporary password’ /var/log/mysqld.log
2020–09–06T08:15:51.718523Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: U*hx35%gS0s<

Once you know the password you can now run following command to secure your MySQL installation.

[root@verona1 mysql]# mysql_secure_installation
Securing the MySQL server deployment.
Enter password for user root: U*hx35%gS0s<
The ‘validate_password’ component is installed on the server. The subsequent steps will run with the existing configuration of the component.Using existing password for root.
Estimated strength of the password: 100
Change the password for root ? ((Press y|Y for Yes, any other key for No) : yNew password: Xxxxx
Re-enter new password: XxxxxEstimated strength of the password: 100Do you wish to continue with the password provided?(Press y|Y for Yes, any other key for No) : yBy default, a MySQL installation has an anonymous user,allowing anyone to log into MySQL without having to have a user account created for them. This is intended only for testing, and to make the installation go a bit smoother. You should remove them before moving into a production environment.Remove anonymous users? (Press y|Y for Yes, any other key for No) : ySuccess.Normally, root should only be allowed to connect from ‘localhost’. This ensures that someone cannot guess at the root password from the network.Disallow root login remotely? (Press y|Y for Yes, any other key for No) :
… skipping.By default, MySQL comes with a database named ‘test’ that anyone can access. This is also intended only for testing, and should be removed before moving into a production environment.Remove test database and access to it? (Press y|Y for Yes, any other key for No) : y- Dropping test database…
Success.- Removing privileges on test database…
Success.Reloading the privilege tables will ensure that all changes made so far will take effect immediately.Reload privilege tables now? (Press y|Y for Yes, any other key for No) : y
Success.All done!

5. Connecting to MySQL Server and initialising the metastore

[root@verona1 mysql]# mysql -u root -p
Enter password: Xxxxx
…
mysql> show databases;
+ — — — — — — — — — — +
| Database |
+ — — — — — — — — — — +
| information_schema |
| mysql |
| performance_schema |
| sys |
+ — — — — — — — — — — +
4 rows in set (0.00 sec)mysql> CREATE DATABASE metastore;
Query OK, 1 row affected (0.02 sec)
mysql> use metastore;
Database changed
mysql> create user ‘hive’@’%’ identified by ‘Xxxxx’;
Query OK, 0 rows affected (0.02 sec)
mysql> grant all PRIVILEGES on *.* to ‘hive’@’%’ with grant option;
Query OK, 0 rows affected (0.02 sec)
mysql> create user ‘hdfs’@’%’ identified by ‘Xxxxx’;
Query OK, 0 rows affected (0.04 sec)
mysql> grant all PRIVILEGES on *.* to ‘hdfs’@’%’ with grant option;
Query OK, 0 rows affected (0.03 sec)
mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.02 sec)
mysql> exit
Bye

6. Configure the Metastore Service to Communicate with the MySQL Database

Before making hive configuration, make sure to refresh the ticket with “hive/verona1.wsdm.ami.com@HADOOPCLUSTER.LOCAL”

[hadoop@verona1 ~]$ kinit -kt /home/hadoop/keytabs/hdfs.keytab hive/verona1.wsdm.ami.com@HADOOPCLUSTER.LOCAL

7. Configure the Metastore Service to Communicate with the MySQL Database

Create hive-site.xml file in $HIVE_HOME/conf directory and add the following configurations, Change the following properties as below.

[root@verona1 mysql]# su — hadoop
[hadoop@verona1 ~]$ pwd
/home/hadoop
[hadoop@verona1 conf]$ mv hive-default.xml.template hive-site.xml
[hadoop@verona1 conf]$ vi hive-site.xml

Make the following property changes, in the form of property/name/value XML elements.

hive.exec.local.scratchdir = /tmp/hadoop
hive.downloaded.resources.dir = /tmp/${hive.session.id}_resources
hive.metastore.uris = thrift://verona1.wsdm.ami.com:9083
javax.jdo.option.ConnectionPassword = Xxxxx
javax.jdo.option.ConnectionURL = jdbc:mysql://verona1.wsdm.ami.com/metastore?createDatebaseIfNotExist=true
datanucleus.autoCreateSchema = false
datanucleus.fixedDatastore = true
datanucleus.autoCreateTables = true
datanucleus.autoStartMechanism = SchemaTable
javax.jdo.option.ConnectionUserName = hive
hive.metastore.kerberos.principal = hive/_HOST@HADOOPCLUSTER.LOCAL
hive.server2.webui.spnego.principal = HTTP/_HOST@HADOOPCLUSTER.LOCAL
hive.server2.authentication.kerberos.principal = hive/_HOST@HADOOPCLUSTER.LOCAL
hive.server2.authentication.spnego.principal = HTTP/_HOST@HADOOPCLUSTER.LOCAL
hive.llap.task.principal = hive/_HOST@HADOOPCLUSTER.LOCAL
hive.server2.thrift.bind.host = verona1.wsdm.ami.com
hive.server2.authentication = KERBEROS
hive.metastore.kerberos.keytab.file = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
hive.server2.webui.spnego.keytab = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
hive.server2.authentication.kerberos.keytab = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
hive.server2.authentication.spnego.keytab = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
hive.llap.task.keytab.file = /home/hadoop/hadoop/etc/hadoop/hdfs.keytab
hive.metastore.sasl.enabled = true
hive.security.authorization.enabled = true
hive.security.authenticator.manager = org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator
hive.prewarm.enabled = true
hive.server2.transport.mode = binary
hive.server2.enable.doAs = false

8. Set HADOOP_HOME

[hadoop@verona1 conf]$ mv hive-env.sh.template hive-env.sh
[hadoop@verona1 conf]$ vi hive-env.sh
…
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/home/hadoop/hadoop
…

9. Add HIVE_HOME to .bashrc

[hadoop@verona1 conf]$ vi ~/.bashrc
…
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$SPARK_HOME/bin
…
[hadoop@verona1 conf]$ source ~/.bashrc

10. Load the mysql-connector jar to $HIVE_HOME/lib

[root@verona1 mysql]# whoami
root
[root@verona1 mysql]# wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.21.tar.gz
[root@verona1 mysql-connector-java-8.0.21]# tar xvzf mysql-connector-java-8.0.21.tar.gz
[root@verona1 mysql]# cd mysql-connector-java-8.0.21/
[root@verona1 mysql]# chown hadoop:hadoop mysql-connector-java-8.0.21.jar
[root@verona1 mysql]# cp mysql-connector-java-8.0.21.jar /home/hadoop/hive/lib/

11. Restart MySQL service

[root@verona1 mysql]# service mysqld stop
Redirecting to /bin/systemctl stop mysqld.service
[root@verona1 mysql]# service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[root@verona1 mysql]# service mysqld status
Redirecting to /bin/systemctl status mysqld.service
…
Sep 06 11:33:43 verona1.wsdm.ami.com systemd[1]: Starting MySQL Server…
Sep 06 11:33:48 verona1.wsdm.ami.com systemd[1]: Started MySQL Server.
[root@verona1 mysql]#

12. Initialize the metastore

Need to change the system timezone to UTC. By default the system (the one that I provisioned in fyre) timezone is PDT.

If we do not change the timezone, and run the tool, this is the error that we get:

hadoop@ verona1~]$ schematool -initSchema -dbType mysql
Metastore connection URL: jdbc:mysql://verona1.wsdm.ami.com/metastore?createDatebaseIfNotExist=true
Metastore Connection Driver : com.mysql.cj.jdbc.Driver
Metastore connection User: hive
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
Underlying cause: java.sql.SQLException : The server time zone value ‘PDT’ is unrecognized or represents more than one time zone. You must configure either the server or JDBC driver (via the ‘serverTimezone’ configuration property) to use a more specifc time zone value if you want to utilize time zone support.
SQL Error code: 0
Use — verbose for detailed stacktrace.
*** schemaTool failed ***

Run as root

[root@verona1 mysql]# timedatectl
Local time: Mon 2020–08–10 07:26:54 PDT
…[root@verona1 mysql]# timedatectl set-timezone UTC[root@verona1 mysql]# timedatectl
Local time: Mon 2020–08–10 14:28:12 UTC
Universal time: Mon 2020–08–10 14:28:12 UTC
…

Restart the mysqld service

[root@verona1 mysql]# service mysqld stop
[root@verona1 mysql]# service mysqld start
[root@verona1 mysql]# service mysqld status
…
Sep 06 11:33:43 verona1.wsdm.ami.com systemd[1]: Starting MySQL Server…
Sep 06 11:33:48 verona1.wsdm.ami.com systemd[1]: Started MySQL Server.

13. Initialize the metastore schema

[root@verona1 mysql]# su — hadoop
[hadoop@verona1 ~]$ schematool -initSchema -dbType mysql
Metastore connection URL: jdbc:mysql://verona1.wsdm.ami.com/metastore?createDatebaseIfNotExist=true
Metastore Connection Driver : com.mysql.cj.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed
[hadoop@verona1 ~]$

14. Start the hive metastore server

[hadoop@verona1 ~]$ hive — service metastore
2020–08–10 14:37:56: Starting Hive Metastore Server

15. Start the Hive Server 2

[hadoop@verona1 ~]$ hive — service hiveserver2 — hiveconf hive.server2.thrift.port=10000 — hiveconf hive.root.logger=INFO,console
…
2020–09–06 14:07:49: Starting HiveServer2

16. Login to Hive prompt. And running any query will execute it as a Spark job

[hadoop@verona1 conf]$ hive
hive> show databases;
OK
default
Time taken: 1.523 seconds, Fetched: 2 row(s)
hive> create database payload_data;
OK
Time taken: 1.897 seconds
hive> use payload_data;
OK
Time taken: 0.105 seconds
hive> CREATE TABLE airports (Airport_ID int, Name string, City string, Country string, IATA_FAA string, ICAO string, Latitude float, Longitude float, Altitude int, Timezone float, DST string, Tz_db_time_zone string) COMMENT “The table [airports]” ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE LOCATION ‘/payload_data’;
OK
Time taken: 0.534 seconds
hive> show tables;
OK
airports
Time taken: 0.09 seconds, Fetched: 1 row(s)
hive> select count(*) from airports;
…
Starting Spark Job = 2c7f2f60–11c5–4896–804d-e198cac6fa71
Running with YARN Application = application_1599364548601_0010
…
Status: Finished successfully in 8.16 seconds
OK
8108
Time taken: 41.427 seconds, Fetched: 1 row(s)

17. Login to the Beeline client using the JDBC url, and execute some queries.

[hadoop@verona1 conf]$ beeline -u “jdbc:hive2://verona1.wsdm.ami.com:10000/default;principal=hive/verona1.wsdm.ami.com@HADOOPCLUSTER.LOCAL”
Connecting to jdbc:hive2://verona1.wsdm.ami.com:10000/default;principal=hive/verona1.wsdm.ami.com@HADOOPCLUSTER.LOCAL
Connected to: Apache Hive (version 2.3.7)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://verona1.wsdm.ami.com:10000/de> show databases;
+ — — — — — — — — +
| database_name |
+ — — — — — — — — +
| default |
| payload_data |
| testing_data |
+ — — — — — — — — +3 rows selected (2.001 seconds)
0: jdbc:hive2://verona1.wsdm.ami.com:10000/de> use testing_data;
No rows affected (0.12 seconds)
0: jdbc:hive2://verona1.wsdm.ami.com:10000/de> select distinct airports.country from testing_data.airports;
…
| “Gabon” |
| “Zambia” |
| “Kyrgyzstan” |
| “Palau” |
+ — — — — — — — — — — — — — — — — — — -+
245 rows selected (44.58 seconds)

As a summary, in this story we configured Kerberized Hive with MySQL as the metastore, and executed sample queries. In the next chapter we shall proceed with running spark applications using from Edge node.