Working with the Hive Data Source

Foresight Analytics Platform enables the use of the Hive infrastructure as a data source. Hive is a data management system with the use of SQL-like query language (HiveQL) and it works on the basis of the Hadoop system of data storing and handling.

To start working with the Hive data source, execute the following steps.

1. Application Installation

To create a connection to the Hive data source it is necessary to use the appropriate ODBC driver.

The Hortonworks Hive ODBC Driver driver is recommended to use as the ODBC driver. Load the distribution file and install the driver by executing the command:

sudo dpkg -i ./clouderahiveodbc*.deb

IMPORTANT. Bitness of the installed application must be the same as bitness of Foresight Analytics Platform.

Step 2. Setting Up ODBC Driver

Open the file for edit: /opt/cloudera/hiveodbc/lib/64/cloudera.hiveodbc.ini and finf the parameters:

DriverManagerEncoding. Replace the parameter value with UTF-16.
ODBCInstLib in the iODBC category. Comment the parameter value.
ODBCInstLib in the unixODBC category. Uncomment the parameter value and set the libodbcinst.so.1 value.

The example of the cloudera.hiveodbc.ini file.

Step 3. Setting Up Connection

For details about connection setup see the guide.

There are two main methods for connection setup:

Full-string connection setup.
DSN-based connection setup.

Full-String Connection Setup

Prepare the /etc/odbcinst.ini file. The example of file contents:

[ODBC Drivers]

Cloudera ODBC Driver for Apache Hive 32-bit=Installed

Cloudera ODBC Driver for Apache Hive 64-bit=Installed

UsageCount=1

[Cloudera ODBC Driver for Apache Hive 32-bit]

Description=Cloudera ODBC Driver for Apache Hive (32-bit)

Driver=/opt/cloudera/hiveodbc/lib/32/libclouderahiveodbc32.so

UsageCount=1

[Cloudera ODBC Driver for Apache Hive 64-bit]

Description=Cloudera ODBC Driver for Apache Hive (64-bit)

Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so

UsageCount=1

Check connection using the command:

isql -v -k "Driver={Cloudera ODBC Driver for Apache Hive 64-bit};Host=10.0.2.118;Port=10000;AuthMech=3;UID=root;PWD=hadoop"

DSN-Based Connection Setup

Change the /etc/odbc.ini file according to the guide. The example of file contents:

[ODBC Data Sources]

HiveDSN=Cloudera

[HiveDSN]

Driver=/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so

HiveServerType=2

ServiceDiscoveryMode=0

Host=10.0.2.118

Port=10000

UID=root

PWD=root

Check connection using the command:

isql -v "HiveDSN"

Step 4. Setting Up ODBC Data Source

Before working with Hive it is necessary to set up a data source:

Open the control panel in the client computer, next go to the Administration > ODBC Data Source section.
In the ODBC Data Source Administration dialog box go to the Custom DNS tab. Select the Add item.
In the Create New Data Source dialog box select the Hortonworks Hive ODBC Driver:

The Hortonworks Hive ODBC Driver DNS Setup dialog box opens:

Specify the following settings:

Data Source Name. Specify connection name.
Description. If required, specify connection description.
Host. Specify IP address or Hive server name.
Hive Server Type. Select Hive Server 2 as a server type.
Mechanism. Specify User Name (specify user name, of an account, created in your configuration of the Hadoop/Hive) authentication mechanism.

Step 5. Setting Up Database in Foresight Analytics Platform

To set up database in Foresight Analytics Platform:

Create a database in the repository.
Specify the OLE DB(ODBC HIVE) DBMS client type in the database properties.
Specify a name of created data source (Data Source Name) as a server name.
Select automatic connection.
Fill the User Name/Password fields: specify user name of an account created in your configuration of Hadoop/Hive.

Database creation wizard window after specifying required parameters looks as follows:

After executing all configurations a new connection can be used to connect Hive tables to the platform repository and creating queries using the HiveQL language.

Step 6. Setting Up Platform Work with Hive Data Source via JDBC Driver

To connect to Hive data source correctly, set up JDBC driver:

Make sure that BI server is set up and integration with the Python programming language is executed.
Install pip3 with default parameters by executing the command:

sudo apt install python3-pip

Download the jaydebeapi library:

pip3 install jaydebeapi

Execute integration with the Java programming language.
Create the opt/jdbc directory:

sudo mkdir /opt/jdbc

Load the file with the JDBC driver HiveJDBC.jar to the /opt/jdbc folder.
Make the www-data user the owner of the opt/jdbc directory:

sudo chown -R www-data:www-data /opt/jdbc

Open the update file in the update manager.
The example of the update file: Update_pefx.zip.

The file contains the objects:

Tables and data dictionaries.
Cube built based on table and dictionary data.
Regular report and express report built based on cube data.
The config.ini module with the JDBC_HIVE_CONFIG identifier.
The example of module: config.ini.

Python module with the PYTHON_READ_DATA_JDBC_EXAMPLE identifier
The example of module: python.py.

The Install Callback for Cube module with the MOD_SET_CUBE_CALLBACK_JDBC_EXAMPLE identifier.
The example of module: Install Callback for Cube.

The Cube Callback module with the MOD_CUBE_FILLDATA_JDBC_EXAMPLE identifier.
The example of module: Cube Callback.

Start the Install Callback for Cube module for execution included in the update file. Set the parameters in the module:

Name	Constant description
CUBE_ID	Cube identifier for data population.
MOD_ID	Identifier of the module executed on working with cube.

Set the parameters in the config.ini module included in the update file:

Name	Constant description
TABLE_DATA_ID	Identifier of the table, from which data should be obtained.
PY_MOD_ID	Python module identifier.
LINUX_PATH	Path to JDBC driver.
JDBC connection parameters (Hive):
JDBC_CLASS	Driver class.
JAR_NAME	Name of file with JDBC driver in the operating system.
JDBC_URL	Service URL.
DB_USER	User name.
DB_PSWD	User password.

After executing a sequence of operations, on each opening, data will be loaded to the cube from the Hive data source.