HCatalog

                                                      

Hello Readers !!!

Everywhere around us we hear conversations about big and small versions of unstructured data, semi-structured data, structured data… it is all very interesting. If we want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it. Here comes the HCatalog which provides metadata service within Hadoop.

So What is HCatalog …??

It is a key component of Apache Hive. HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog.

Hive/HCatalog enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.

hcatalog

We will deep dive into HCatalog working now..

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, Parquet, ORCFile CSV, JSON, and SequenceFile formats. To use a custom format, we must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

HCatalog Interfaces for Apache Pig

This concept is understood better if we have a sound knowledge in Apache Pig.

There are two interfaces in Pig for loading and storing. No HCatalog-specific setup is required for these interfaces.

HCatLoader: HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. HCatLoader is implemented on top of HCatInputFormat. We can indicate which partitions to scan by immediately following the load statement with a partition filter statement.

Syntax:  For loading  data into HDFS

A = LOAD ‘table name’ USING org.apache.HCatalog.pig.HCatLoader ();

We must specify the table name in single quotes: LOAD ‘table name’. If we are using a non-default database, then we must specify input as ‘dbname.tablename’.

The Hive meta store lets us create tables without specifying a database. If the table is created in this way then the database name is ‘default’ and is not required when specifying the table for HCatLoader.

HCatStorer : HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. We can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause and we can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatStorer is implemented on top of HCatOutputFormat.    

Syntax: For storing operation.

A = LOAD…

B = FOREACH A……

my_processed_data =..

STORE my_processed_data INTO ‘tablename’ USING org.apache.HCatalog.pig.HCatStorer ();

We must specify the table name in single quotes: LOAD ‘tablename’. Both the database and the table must be created prior to running your Pig script. If we are using a non-default database, then we must specify our input as ‘dbname.tablename’.

The Hive metastore lets us create tables without specifying a database. If we have created tables in this way, then the database name is ‘default’ and we do not need to specify the database name in the store statement.

For the USING clause, we can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when we are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.

Uses of HCatalog

  • Enabling the Right Tool for the Right Job: The majority of heavy Hadoop users do not use a single tool for data processing. Often users and teams will begin with a single tool:  Hive, Pig, MapReduce, or another tool.  As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on.  Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models.  Users who start with Pig discover they would like to use Hive for analytics type queries.  While tools such as Pig and MapReduce do not require metadata, they can benefit from it when it is present.  Sharing a metadata store also enables users across tools to share data more easily.  A workflow where data is loaded and normalized using Map Reduce or Pig and then analysed via Hive is very common. When all these tools share one metastore users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
  • Capture Processing States to Enable Sharing: When used for analytics , users will discover information using Hadoop.  Again, they will often use Hive, Pig and Map Reduce to uncover information. The information is valuable but typically only in the context of a larger analysis. With HCatalog you can publish results so they can be accessed by your analytics platform via REST.  In this case, the schema defines the discovery. These discoveries are also useful to other data scientists.  Often they will want to build on what others have created or use results as input into a subsequent discovery.
  • Integrate Hadoop with Everything: Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools.  Hadoop should serve as input into our analytics platform or integrate with our operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. By tying in more closely they can hide complexity from users and create a better experience. A great example of this is the SQL-H integration from Teradata Aster. SQL-H queries the structure of data stored in HCatalog and exposes that back through to Aster enabling Aster to access just the relevant data stored within the Hortonworks Data Platform.

Conclusion

HCatalog allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes.  It is a repository for schema that can be referred to in these programming models so that we don’t have to explicitly type our structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements.  It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.

References

  1. https://cwiki.apache.org/confluence/display/Hive/HCatalog 
  2. http://www.tutorialspoint.com/hcatalog/ 
  3. http://vijayanyayapathi.com/pig-and-hive-interaction-hcatalog-hcatloaderhcatstorer-tutorial/ 
  4. http://hadooptutorial.info/hcatalog-and-pig-integration

Thank you …!!!

Apache Pig

Hello Everyone..!!

In this blog I’ll be taking you through Apache Pig, why should it be used and few basic concepts of it.

So let’s begin with..

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Pig works with data from many sources, including structured and unstructured data, and store the results into HDFS. Pig scripts are translated into a series of MapReduce jobs that run on the Apache Hadoop Cluster. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved.

Pig originated as a Yahoo Research initiative for creating and executing map-reduce jobs on very large data sets. In 2007 Pig became an open source project of the Apache Software Foundation.

Why Apache Pig..?

Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks. Apache Pig became a boon for all such programmers.

  • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.
  • Apache Pig uses multi-query approach, thereby reducing the length of codes.
  • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
  • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Pig Architecture

The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data.

To perform a particular task Programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, and Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy.

Pig Architect

Pig Architecture consists of Pig Latin Interpreter and it will be executed on client Machine. It uses Pig Latin scripts and it converts the script into a series of MR jobs. Then it will execute MR jobs and saves the output result into HDFS. In between, it performs different operations such as Parse, Compile, Optimize and plan the Execution on data that comes into the system.

Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. These MapReduce jobs are executed on Hadoop producing the desired results.

Job Execution Flow in Apache Pig

The Scripts developed by the programmer are stored in the local file system in the form of user defined functions. When we submit Pig Script, it comes in contact with Pig Latin Compiler which splits the task and run a series of MR jobs, meanwhile Pig Compiler fetches data from HDFS (i.e. input file present). After running MR jobs, the output file is stored in HDFS.

Components of Apache Pig :

Pig is a scripting language for exploring huge data sets of size gigabytes or terabytes very easily. Pig provides an engine for executing data flows in parallel on hadoop.

Pig is made up of two things:

  1. Pig Latin:It is language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications.
  2. Pig Engine: It is an Execution Environment to run Pig Latin programs. It has two modes

Local Mode: We can execute the pig script in local file system. In this case we don’t need to store the data in Hadoop HDFS file system, instead we can work with the data stored in local file system itself. In this, parallel mapper execution is not possible because the earlier versions of the Hadoop versions are not thread safe .Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local FileSystem. To run in local mode, we pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt:

          $ pig -x local.

          grunt>

MapReduce Mode: MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute Pig Latin statements to process data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS. Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster. First we need to check the compatibility of the Pig and Hadoop versions being used.

          $ pig -x mapreduce

           grunt>

Pig Execution

Execution Mechanisms used by Pig Engine

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

  • Interactive Mode (Grunt shell): We can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
  • Batch Mode (Script): We can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig
  • Embedded Mode (UDF): Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.

Execution Mec

After knowing the architecture and its components, we will further go ahead with the installation procedure of pig on Ubuntu.

Setting up Pig

 Before installing Apache Pig, it is essential that we have Hadoop and Java installed on our system.

  • Unpack the tarball in the directory of your choice, using the following command

          $ cd hadoop/apache-pig

          $tar -xzvf pig-0.14.0.tar.gz

  • Set the environment variable PIG_HOME to point to the installation directory for convenience:

          $ export PIG_HOME=/home/hduser/hadoop/apache-pig

                                                or

  • Set PIG_HOME in .bashrc so it will be set every time you login.

          Add the following line to it.

          $ export PIG_HOME=/home/hduser/hadoop/apache-pig

          $ export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH

  • Verify the installation

          $pig -version

Pig Latin Data Model

Pig’s data types make up the data model for how Pig thinks of the structure of the data it is processing. With Pig, the data model gets defined when the data is loaded. Any data we load into Pig is going to have a particular schema and structure. Pig needs to understand that structure, so when we do the loading, the data automatically goes through a mapping.

The Pig data model is rich enough to handle most anything that is thrown its way, including table- like structures and nested hierarchical data structures. In general terms, though, Pig data types can be broken into two categories: scalar types and complex types. Scalar types contain a single value, whereas complex types contain other types, such as the Tuple, Bag and Map types.

data model

Atom: Any single value in Pig Latin, irrespective of their data type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple: A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag: A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map: A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’.               Example−[name#Raja,age#30].

data

The value of all these types can also be null. The semantics for null are similar to those used in SQL. The concept of null in Pig means that the value is unknown. Nulls can show up in the data in cases where values are unreadable or unrecognizable — for example, if you were to use a wrong data type in the LOAD statement.

Null could be used as a placeholder until data is added or as a value for a field that is optional.

SCALAR TYPES
1 int Represents a signed 32-bit integer.

Example : 8

 

2 long Represents a signed 64-bit integer.

Example : 5L

 

3 float Represents a signed 32-bit floating point.

Example : 5.5F

 

4 Double Represents a 64-bit floating point.

Example : 10.5

 

5 charaaray Represents a character array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’

 

6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.

Example : true/ false.

 

8 Datetime Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00

 

9 Biginteger Represents a Java BigInteger.

Example : 60708090709

 

10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883

COMPLEX TYPES
11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)

 

12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

 

13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]

 

Pig Latin has a simple syntax with powerful semantics to carry out two primary operations: access and transform data.

In a Hadoop context, accessing data means allowing developers to load, store, and stream data, whereas transforming data means taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data. The table gives an overview of the operators associated with each operation.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.

Data Flow in Pig Programming:

Pig program consists of three parts: Loading, Transforming and Dumping of the data.

Loading: As is the case with all the Hadoop features, the objects that are being worked on by Hadoop are stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what file (or files) it will use, and that’s done through the LOAD ‘data file’ command (where ‘data file’ specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

Transforming: The transformation logic is where all the data manipulation happens. Here we can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more.

Dumping and Storing: If we don’t specify the DUMP or STORE command, the results of a Pig program are not generated. We would typically use the DUMP command, which sends the output to the screen, when we are debugging our Pig programs. When we go into production, we simply change the DUMP call to a STORE call so that any results from running our programs are stored in a file for further processing or analysis. Note that we can use the DUMP command anywhere in our program to dump intermediate result sets to the screen, which is very useful for debugging purposes.

Example

Given below is a Pig Latin statement, which loads data to Apache Pig.

Grunt> Student_data = LOAD ‘student_data.txt’ USING PigStorage (‘,’)  AS

 (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files. (This part of the USING statement is often referred to as a LOAD Func and works in a fashion similar to a custom deserializer.)

The optional AS clause defines a schema for the data that is being mapped. If we don’t use an AS clause, we’re basically telling the default LOAD Func to expect a plain text file that is tab delimited. With no schema provided, the fields must be referenced by position because no name is defined.

Using AS clauses means that we have a schema in place at read-time for your text files, which allows users to get started quickly and provides agile schema modeling and flexibility so that we can add more data to our analytics.

Features of Apache Pig:

  • PigLatin is a procedural data flow language mainly used for programming.
  • It can handle all kinds of data e.g. Structured as well as Unstructured.
  • By using Pig’s multi-query approach anyone can operate many operations together in a single flow, reducing the time of multiple times data scanned.
  • It’s providing Rich Set of operators for filter, join, sort etc.
  • It’s providing complex data types e.g. tuples, bags, and maps.
  • It is generally used by the researcher and programmer.
  • It operates on the client side of any cluster.
  • It does not have a dedicated metadata database and schema or data types will be defined in the script itself.
  • Through User Defined Functions (UDF) facility in Pig, anyone can execute many languages code like Ruby, Python and Java.

Conclusion: In this blog, we have seen that Pig is a very powerful scripting language based on the Hadoop eco-system and MapReduce programming. It can be used to process large volumes of data in a distributed environment. Pig statements and scripts are similar to SQL statements, so developers can use it without focusing much on the underlying mechanism. Through the User Defined Functions (UDF) facility in Pig, Pig can invoke code in many languages like JRuby, python and Java. We can also embed Pig scripts in other languages. The result is that we can use Pig as a component to build larger and more complex applications that tackle real business problems.

References

  1. http://www.tutorialspoint.com/apache_pig/
  2. http://tech.globant.com/en/pig/ 
  3. https://www.dezyre.com/hadoop-tutorial/pig-tutorial 
  4. http://www.dummies.com/how-to/content/hadoops-pig-data-types-and-syntax.html 
  5. http://www.hadooptpoint.com/apache-pig-introduction/

 

Thank you  ..!!

HIVE & HBASE

Hello,

This blog is about how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it.

What is Hive?

Apache Hive is data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Hive provides SQL-like language called HiveQL for querying the data. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

Hive is best suited for Data Warehousing Applications where data is stored, mined and reporting is done based on processing. Hive bridges the gap between data warehouse applications and hadoop as relational database models are the base of most data warehousing applications.

It resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy.

What is HBase?

Apache HBase is an open-source, distributed, versioned, column-oriented store modelled after Google’s big table: A Distributed Storage System for Structured Data. Just as Big table leverages the distributed data storage provided by the Google FileSystem, Apache HBase provides Big table-like capabilities on top of Hadoop and HDFS. HBase can be used when we need random, real-time read/write access to our Big Data. It is a scale-out table store which can support a very high rate of row-level updates over a large amount of data. It solves Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging data intelligently based on data distribution change.

Why do we need to integrate Hive with HBase?

Hive can store information of hundreds of millions of users effortlessly, but faces some difficulties when it comes to keeping the data warehouse up to date with the latest information. Hive uses HDFS as an underlying storage which come with limitations like append-only, block-oriented storage. This makes it impossible to directly apply individual updates to warehouse tables. Up till now the only practical option to overcome this limitation is to pull the snapshots from MySQL databases and dump them to new Hive partitions. This expensive operation of pulling the data from one location to another location is not frequently practiced. (Leading to stale data in the warehouse), and it also does not scale well as the data volume continues to shoot through the roof.

To overcome this problem, Apache HBase is used in place of MySQL, with Hive.

HBase is based on Hadoop, Integrating it with Hive is pretty straightforward as HBase tables can be accessed like native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within the HBase itself.

How is HBase integrated with Hive?

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the HBaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon DynamoDB. Storage Handlers are not only limited to NoSQL databases, a storage handler could be designed for several different kinds of data stores.

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. We can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table

Directly we’ll have to use the Hbase shell to do so.

CREATE [EXTERNAL] TABLE foo (…)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);

The above statement registers the HBase table named bar in the Hive metastore, accessible from

Hive by the name foo. Under the hood, HBaseStorageHandler is delegating interaction with the HBase table to HiveHBaseTableInputFormat and HiveHBaseTableOutputFormat. We can register our HBase table in Hive using those classes directly if we desire. The above statement is roughly equivalent to:

CREATE TABLE foo (…)

STORED AS

INPUTFORMAT ‘org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat’

OUTPUTFORMAT ‘org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat’

TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);

Also provided is the HiveHFileOutputFormat which means it should be possible to generate HFiles for bulkloading from Hive as well.

Schema Mapping:

Registering the table is only the first step. As part of that registration, we also need to specify a column mapping. This is how we link Hive column names to the HBase table’s rowkey and columns.

Do so using the hbase.columns.mapping SerDe property.

CREATE TABLE foo (rowkey STRING, a STRING, b STRING)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘: key, f: c1, f: c2’)

TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);

The values provided in the mapping property correspond one-for-one with column names of the hive table. HBase column names are fully qualified by column family, and you use the special

Token: key to represent the rowkey. The above example makes rows from the Hbase table bar available via the Hive table foo. The foo column rowkey maps to the HBase’s table’s row

Key, a to c1 in the f column family and b to c2 also in f column family.

We can also associate Hive’s MAP data structures to HBase column families. In this case, only the STRING Hive type is used. The other Hive type currently supported is BINARY.

Interacting with data

With the column mappings defined, we can now access HBase data just like you would any other

Hive data. Only simple query predicates are currently supported.

SELECT * FROM foo WHERE..;

We can also populate and HBase table using Hive. This works with both INTO and OVERWRITE clauses.

Following is the example for integrating Hive and HBase using Storage Handler.

Make sure Hadoop and HBase is started and running successfully, if not use the following commands:

For starting Hadoop: start-all.sh.

For starting HBase: start-hbase.sh.

  1. Create the HBase Table:create ’emp’,’personaldetails’,’deptdetails’.Here personaldetails and deptdetails are two column family of emp table.

    Each column family of an HBase table can be split into any number of attributes which cannot be seen in traditional SQL.

    1

  2. Insert Data into HBase Table:Put ’emp’,’eid01′,’personaldetails: Fname’, ‘Riya’

    Put ’emp’,’eid01′,’personaldetails: Lname’, ‘Kapoor’

    Put ’emp’,’eid01′,’personaldetails: Salary’,’10000′

    Put ’emp’,’eid01′,’deptdetails: name’, ‘R&D’

    Put ’emp’,’eid01′,’deptdetails: location’, ‘Bangalore’

    Personaldetails column family of emp HBase table is split into 3 more attributes: Fname, Lname, salary.

    Deptdetails column family of emp HBase table is also split into two attributes: name and location.

2

3: Create the Hive table pointing to HBase table.

In case of multiple columns family in HBase table, we can create one table for each column families.

In the example we have 2 column families and so we are creating two tables, one for each column families.

In the Hive shell:

Table for personal details column family:

Create external table emp_hbase (Eid string, f_name string, s_name string, salary int)

STORED BY ‘org.apache.hadoop.hive.hbase.HbaseStorageHandler’

with serdeproperties

(“hbase.columns.mapping”=”: key, personaldetails: Fname, personaldetails: Lname, personaldetails: Salary)

Tblproperties (“hbase.table.name”=”emp”);

We created the non-native Hive table using Storage Handler so we should specify the STORED BY clause.

hbase.columns.mapping: It is used to map the Hive columns with the HBase columns. The first column must be the key column which would also be same as the HBase’s row key column.

Tblproperties : We need to specify the name of the HBase table created in HBase shell.

3

Table for department details column family:

Create external table emp_dept (Eid string, name string, location string)

STORED BY ‘org.apache.hadoop.hive.hbase.HbaseStorageHandler’

With serdeproperties

(“hbase.columns.mapping”=”: key, deptdetails: name, deptdetails: location,)

Tblproperties (“hbase.table.name”=”emp”);

4

We can query the HBase table with SQL queries in hive.

Following are few of the queries:

  1. Select * from emp_hbase: This query returns all the personal details of an employee.
  2. Select * from emp_hbase where Salary <20000: This query returns all the details of employees whose salary is less than 20000.
  3. Select * from emp_dept : This query returns all the department details of an employee.
  4. Select * from emp_dept where name= ‘EEE’;:This query returns all the department details of an employee where dept name is ‘EEE’.

5

We have successfully mapped the HBase table with the Hive external table.

Conclusion:

The interface between HBase and Hive is young, but has nice potential. There’s a lot of low-hanging fruit that can be picked up to make things easier and faster. The most glaring issue barring real application development is the impedance mismatch between Hive’s typed, dense schema and HBase’s untyped, sparse schema. This is as much a cognitive problem as technical issue.

Basic operations mostly work, at least in a rudimentary way. We can read data out of and write data back into HBase using Hive. Configuring the environment is an opaque and manual process, one which likely stymies novices from adopting the tools. Hive provides a very usable SQL interface on top of HBase, one which integrates easily into many existing ETL workflows. That interface requires simplifying some of the Big Table semantics HBase provides, but the result will be to open up HBase to a much broader audience of users.

Thank you..!!!

References:

  1. https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
  2. https://mevivs.wordpress.com/2010/11/24/hivehbase-integration/https://acadgild.com/blog/integrating-hive-with-hbase/
  3. http://blog.cloudera.com/blog/2010/06/integrating-hive-and-hbase/
  4. http://www.tutorialspoint.com/hive/
  5. http://www.tutorialspoint.com/hbase/

Working of HIVE

Hi,

In my previous blog we came to know what is hive and how to install it. In this blog I’ll take you through the architecture and its working.

Hive Architecture

architecture

The diagram represents CLI (Command Line Interface), JDBC/ODBC and Web GUI (Web Graphical User Interface).

When user comes with CLI (Hive Terminal) it  is directly connected to Hive Drivers, When User comes with JDBC/ODBC (JDBC Program) at that time by using API (Thrift Server) it  is connected to Hive driver and when the user comes with Web GUI (Ambari server) it  is directly connected to Hive Driver.

The Hive driver receives the tasks (Queries) from user and send to Hadoop architecture. The Hadoop architecture uses namenode, datanode, job tracker and task tracker for receiving and dividing the work what Hive sends to Hadoop (MapReduce Architecture).

Components & Working of Hive:

working

Components of Hive and their functionalities:

  • UI (User Interface): The user interface is for users to submit queries and other operations to the system.
  • Driver:The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modelled on JDBC/ODBC interfaces.
  • Compiler:The component that parses the query does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the Metastore.
  • Metastore :The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
  • Execution Engine:The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.

We are aware of all the components of hive and their functionalities. So now let’s see the working of hive

Step 1: The UI calls the execute interface to the Driver.

Step 2: The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan.

Step 3 & 4: The compiler gets the necessary metadata from the Metastore.

Step 5: This metadata is used to type check the expressions in the query tree as well as to prune partitions based on query predicates. The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).

Step 6: The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3). In each task (mapper/reducer) the deserializers associated with the table or intermediate outputs is used to read the rows from HDFS files and these are passed through the associated operator tree. Once the output is generated, it is written to a temporary HDFS file though the serializers (this happens in the mapper in case the operation does not need a reduce). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to the table’s location.

Step 7 & 8 & 9: For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver.

Thank you..!!

References:

  1. http://www.tutorialspoint.com/hive/
  2. http://www.hadooptpoint.com/hadoop-hive-architecture/
  3. http://www.hadoopmaterial.com/2013/10/hive-architecture.html

 

Installation of Apache HIVE

Hi  ,

In this blog I will be discussing about what is Hive and the steps for installation it on Ubuntu machine. So let’s begin with.

What is Hive..??

Apache Hive is a data warehouse software built on top of Hadoop. It facilitates querying, data analysis and data summarization. It also supports in the analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like language called HiveQL (Hive Query Language). SQL knowledge is wide spread and anyone who has decent knowledge would be able to use Hive effectively. Hive translates the query into Java Map Reduce code and runs the same on Hadoop cluster. Hive is best suited for Data Warehousing applications where data is structured, static and formatted. Hive is not a complete database. Hive processor converts most of its queries into a Map Reduce job which runs on Hadoop cluster. Hive is designed for easy and effective data aggregation, ad-hoc querying and analysis of huge volumes of data.

Hive converts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java MapReduce will required a lot of code to be written/debugged compared to HiveQL. It increases the developer productivity to use Hive.

Hive does not give SQL like latency as it ultimately runs Map Reduce programs underneath. Map Reduce framework is built for batch processing jobs it has high latency, even the fastest hive query would take several minutes to get executed on relatively smaller set of data in few megabytes. We cannot simply compare the performance of traditional SQL systems with hive. Hive is not an OLTP (On-line transaction Processing) application and not meant to be connected with systems which needs interactive processing. It is meant to be used to process batch jobs on huge data which is immutable.
To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce.
Installation of Hive on Ubuntu:

Before installing hive there are few basic pre-requisites which are mandatory:

Java and JDK should be installed.

Hadoop must be installed and running.

After the basic pre-requisites are met we can go ahead with installing hive.

Following are the steps for it:

Step 1: Download Apache Hive & Extract it.
Download from the link: http://apache.claz.org/hive/stable/
Click the apache-hive-1.2.1-bin.tar.gz and Save it.

Enter into Downloads directory, where Hive is downloaded.
$ cd Downloads
Extract hive tar file using following command
$ tar -xzvf hive-1.2.1-bin.tar.gz

1

Step 2: Setting Hive environment variable:

Edit the .bashrc file to update the environment variable for user.

Add the following at the end of the file:

export HIVE_HOME=/home/hduser/hadoop/apache-hive

export PATH=$PATH: $HIVE_HOME/bin

export HADOOP_USER_CLASSPATH_FIRST=true

2

3

Step 3: Create Hive Directories within HDFS.

The directory warehouse is the location to store the table or data related to hive.

Command:

$hadoop fs -mkdir /user/hive/warehouse

4

Set read/write permissions for table.

Command:

$hadoop fs -chmod g+w /user/hive/warehouse

5

Step 4: Set Hadoop path in Hive config.sh

6

Go to the line where the following statements are written

# Allow alternate conf dir location.

HIVE_CONF_DIR=”${HIVE_CONF_DIR:-$HIVE_HOME/conf”

export HIVE_CONF_DIR=$HIVE_CONF_DIR

export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH

Below these lines write the following

export HADOOP_HOME=/home/hduser/hadoop/apache-hadoop

7

Step 5: Launch hive

$ hive

Type exit to quit from hive.

Hive is installed successfully.

8

References:

  1. http://www.tutorialspoint.com/hive
  2. http://www.hadooptpoint.com/hadoop-hive-architecture/
  3. http://www.hadoopmaterial.com/2013/10/hive-architecture.html