Big Data, mind the gap? Part II: Data processing on Hadoop

Last week, Olivier introduced the Apache Hadoop ecosystem as well as a concrete IoT data ingestion flow with NiFi into a Swisscom Hadoop cluster (Part I). The sensor data is now flowing into HDFS, persisted as raw JSON format and merged with a simple folder logic.

Analyzing raw data is usually not that convenient and data often needs to be transformed, modified, joined and isn’t ideal either regarding performance aspects. We therefore need to process this raw data and this is where the Apache frameworks Hive and HBase come into play.

The path: from raw data to valuable information

Our first job will be to create a Hive table per persisted folder data. But wait a minute, what is Hive? Apache Hive is a warehousing framework running on top of Hadoop for querying and analyzing data. It provides a SQL-like (HiveQL) interface to query your data and transforms HiveQL into MapReduce jobs.

We will create external tables, as the data itself will not be stored in Hive (a link will be created to the Hive metastore).

A simple Hive query creates a table based on the raw JSON data from March 10th 2017. A specific SerDe (serializer/deserializer) argument enables us to transform the JSON string into a table-like structure:

CREATE EXTERNAL TABLE IoTData ( 

Timestamp timestamp,

Machine_ID string,

a002_ruecklauftemp_we int,

columnX int,

columnY string,

ROW FORMAT SERDE ‚org.apache.hive.hcatalog.data.JsonSerDe‘

LOCATION ‚/Ingestion/10-03-17/‘;

We are now able to easily query the created Hive table in HiveQL:

A simple SELECT statement on a HIVE table

A simple SELECT statement on a Hive table

A simple COUNT statement with a WHERE clause in HiveQL

A simple COUNT statement with a WHERE clause in HiveQL

The problem is that the query is slow as it is de-facto based on non-optimized MapReduce jobs. The Hadoop ecosystem is full of optimization tricks and we’ll illustrate two here.

The first step is to change Hive’s execution engine MR to TEZ.

Switching execution engine to TEZ

Switching execution engine to TEZ

This step can also simply be done by adding the following context before querying data in the HiveQL editor:

set hive.execution.engine=tez;

The TEZ engine optimizes the MapReduce jobs, reduces the amount of jobs and increases the speed dramatically:

Hive execution engine overview

Hive execution engine overview

With TEZ as the execution engine, our Hive queries are now much faster and it’s even possible to track and analyze the details of the execution in the TEZ view.

TEZ job overview

TEZ job overview

Graphical analysis of an executed job

Graphical analysis of an executed job

The second step is to merge our files into an ORC file. The ORC format is well optimized for this case and a highly efficient way to store Hive data in Hadoop.

But why? Due to these two main features:

  • High level of compression
  • Increases speed on read/write because of the built-in index, min/max values and other aggregates

We now have optimized Hive tables and are able to have a look at another Hadoop framework: HBase.

HBase is perfect if you need random, real-time read and write access to your data and provides consistent read/write times. It could be used perfectly as a backend or as a messaging platform as Facebook uses it. Another perfect case is, when it comes to heavy analytic processing workloads where large datasets have to be read.

The transformation: from Hive to HBase

Our next step is, to load the data into HBase and try to get near real-time responses for a web UI. Fortunately, the data ingestion from Hive to HBase is just a query away, but first we need to create and register a HBase table in the Hive metastore. This provides us the possibility to manage the HBase table with Hive.

Creating a HBase table with Hive and copying all the data from the daily Hive tables into HBase is pretty straight-forward:

CREATE TABLE hbase_pumpdata(

Timestamp timestamp,

Machine_ID string,

a002_ruecklauftemp_we int,

columnX int,

columnY string,

…)

STORED BY ‚org.apache.hadoop.hive.hbase.HBaseStorageHandler‘

WITH SERDEPROPERTIES (‚hbase.columns.mapping‘ =

‚:key,

f:Machine_ID string,

f:a002_ruecklauftemp_we,

f:columnX int,

f:columnY string,

…)

TBLPROPERTIES (‚hbase.table.name‘ = ‚hbase_pumpdata‘);

INSERT OVERWRITE TABLE hbase_pumpdata SELECT * FROM pumpdata_DDMMYYYY;

Everything is now stored in the HBase table and can be queried in the HBase shell (SELECT COUNT *):

Example of a simple SELECT statement on the data

Example of a simple SELECT statement on the data

Example of a SELECT statement with WHERE clause

Example of a SELECT statement with WHERE clause

Querying all our IoT data and filtering on specific fields took us 8.3 sec. With indexing, optimization and eventually more nodes in the cluster, we could significantly improve the querying time.

To conclude, we have seen that we can achieve interesting IoT data processing with Hive and HBase (and transform the data in the way we need). But Hive and HBase can do a lot more such as offer an open-source data warehouse alternative with Hive and scale it up easily to Petabytes or offer a backend messaging service based on HBase.

Olivier will give some insights on Apache Spark and machine learning next week, so stay tuned!

In case you are interested in investigating Big Data and Advanced Analytics, our Swisscom Analytics Lab can support you, don’t hesitate to contact us.

Olivier Gwynn, Hadoop – Data Science – Predictive Analytics

Thomas Jeschka,  Hadoop – Data Architect – Data Warehouse

Tim Giger, Hadoop – Data Warehouse –  Cloud & DevOps