HIVE & HBASE

Hello,

This blog is about how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it.

What is Hive?

Apache Hive is data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Hive provides SQL-like language called HiveQL for querying the data. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

Hive is best suited for Data Warehousing Applications where data is stored, mined and reporting is done based on processing. Hive bridges the gap between data warehouse applications and hadoop as relational database models are the base of most data warehousing applications.

It resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy.

What is HBase?

Apache HBase is an open-source, distributed, versioned, column-oriented store modelled after Google’s big table: A Distributed Storage System for Structured Data. Just as Big table leverages the distributed data storage provided by the Google FileSystem, Apache HBase provides Big table-like capabilities on top of Hadoop and HDFS. HBase can be used when we need random, real-time read/write access to our Big Data. It is a scale-out table store which can support a very high rate of row-level updates over a large amount of data. It solves Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging data intelligently based on data distribution change.

Why do we need to integrate Hive with HBase?

Hive can store information of hundreds of millions of users effortlessly, but faces some difficulties when it comes to keeping the data warehouse up to date with the latest information. Hive uses HDFS as an underlying storage which come with limitations like append-only, block-oriented storage. This makes it impossible to directly apply individual updates to warehouse tables. Up till now the only practical option to overcome this limitation is to pull the snapshots from MySQL databases and dump them to new Hive partitions. This expensive operation of pulling the data from one location to another location is not frequently practiced. (Leading to stale data in the warehouse), and it also does not scale well as the data volume continues to shoot through the roof.

To overcome this problem, Apache HBase is used in place of MySQL, with Hive.

HBase is based on Hadoop, Integrating it with Hive is pretty straightforward as HBase tables can be accessed like native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within the HBase itself.

How is HBase integrated with Hive?

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the HBaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon DynamoDB. Storage Handlers are not only limited to NoSQL databases, a storage handler could be designed for several different kinds of data stores.

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. We can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table

Directly we’ll have to use the Hbase shell to do so.

CREATE [EXTERNAL] TABLE foo (…)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);

The above statement registers the HBase table named bar in the Hive metastore, accessible from

Hive by the name foo. Under the hood, HBaseStorageHandler is delegating interaction with the HBase table to HiveHBaseTableInputFormat and HiveHBaseTableOutputFormat. We can register our HBase table in Hive using those classes directly if we desire. The above statement is roughly equivalent to:

CREATE TABLE foo (…)

STORED AS

INPUTFORMAT ‘org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat’

OUTPUTFORMAT ‘org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat’

TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);

Also provided is the HiveHFileOutputFormat which means it should be possible to generate HFiles for bulkloading from Hive as well.

Schema Mapping:

Registering the table is only the first step. As part of that registration, we also need to specify a column mapping. This is how we link Hive column names to the HBase table’s rowkey and columns.

Do so using the hbase.columns.mapping SerDe property.

CREATE TABLE foo (rowkey STRING, a STRING, b STRING)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘: key, f: c1, f: c2’)

TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);

The values provided in the mapping property correspond one-for-one with column names of the hive table. HBase column names are fully qualified by column family, and you use the special

Token: key to represent the rowkey. The above example makes rows from the Hbase table bar available via the Hive table foo. The foo column rowkey maps to the HBase’s table’s row

Key, a to c1 in the f column family and b to c2 also in f column family.

We can also associate Hive’s MAP data structures to HBase column families. In this case, only the STRING Hive type is used. The other Hive type currently supported is BINARY.

Interacting with data

With the column mappings defined, we can now access HBase data just like you would any other

Hive data. Only simple query predicates are currently supported.

SELECT * FROM foo WHERE..;

We can also populate and HBase table using Hive. This works with both INTO and OVERWRITE clauses.

Following is the example for integrating Hive and HBase using Storage Handler.

Make sure Hadoop and HBase is started and running successfully, if not use the following commands:

For starting Hadoop: start-all.sh.

For starting HBase: start-hbase.sh.

  1. Create the HBase Table:create ’emp’,’personaldetails’,’deptdetails’.Here personaldetails and deptdetails are two column family of emp table.

    Each column family of an HBase table can be split into any number of attributes which cannot be seen in traditional SQL.

    1

  2. Insert Data into HBase Table:Put ’emp’,’eid01′,’personaldetails: Fname’, ‘Riya’

    Put ’emp’,’eid01′,’personaldetails: Lname’, ‘Kapoor’

    Put ’emp’,’eid01′,’personaldetails: Salary’,’10000′

    Put ’emp’,’eid01′,’deptdetails: name’, ‘R&D’

    Put ’emp’,’eid01′,’deptdetails: location’, ‘Bangalore’

    Personaldetails column family of emp HBase table is split into 3 more attributes: Fname, Lname, salary.

    Deptdetails column family of emp HBase table is also split into two attributes: name and location.

2

3: Create the Hive table pointing to HBase table.

In case of multiple columns family in HBase table, we can create one table for each column families.

In the example we have 2 column families and so we are creating two tables, one for each column families.

In the Hive shell:

Table for personal details column family:

Create external table emp_hbase (Eid string, f_name string, s_name string, salary int)

STORED BY ‘org.apache.hadoop.hive.hbase.HbaseStorageHandler’

with serdeproperties

(“hbase.columns.mapping”=”: key, personaldetails: Fname, personaldetails: Lname, personaldetails: Salary)

Tblproperties (“hbase.table.name”=”emp”);

We created the non-native Hive table using Storage Handler so we should specify the STORED BY clause.

hbase.columns.mapping: It is used to map the Hive columns with the HBase columns. The first column must be the key column which would also be same as the HBase’s row key column.

Tblproperties : We need to specify the name of the HBase table created in HBase shell.

3

Table for department details column family:

Create external table emp_dept (Eid string, name string, location string)

STORED BY ‘org.apache.hadoop.hive.hbase.HbaseStorageHandler’

With serdeproperties

(“hbase.columns.mapping”=”: key, deptdetails: name, deptdetails: location,)

Tblproperties (“hbase.table.name”=”emp”);

4

We can query the HBase table with SQL queries in hive.

Following are few of the queries:

  1. Select * from emp_hbase: This query returns all the personal details of an employee.
  2. Select * from emp_hbase where Salary <20000: This query returns all the details of employees whose salary is less than 20000.
  3. Select * from emp_dept : This query returns all the department details of an employee.
  4. Select * from emp_dept where name= ‘EEE’;:This query returns all the department details of an employee where dept name is ‘EEE’.

5

We have successfully mapped the HBase table with the Hive external table.

Conclusion:

The interface between HBase and Hive is young, but has nice potential. There’s a lot of low-hanging fruit that can be picked up to make things easier and faster. The most glaring issue barring real application development is the impedance mismatch between Hive’s typed, dense schema and HBase’s untyped, sparse schema. This is as much a cognitive problem as technical issue.

Basic operations mostly work, at least in a rudimentary way. We can read data out of and write data back into HBase using Hive. Configuring the environment is an opaque and manual process, one which likely stymies novices from adopting the tools. Hive provides a very usable SQL interface on top of HBase, one which integrates easily into many existing ETL workflows. That interface requires simplifying some of the Big Table semantics HBase provides, but the result will be to open up HBase to a much broader audience of users.

Thank you..!!!

References:

  1. https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
  2. https://mevivs.wordpress.com/2010/11/24/hivehbase-integration/https://acadgild.com/blog/integrating-hive-with-hbase/
  3. http://blog.cloudera.com/blog/2010/06/integrating-hive-and-hbase/
  4. http://www.tutorialspoint.com/hive/
  5. http://www.tutorialspoint.com/hbase/
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s