Hello Readers !!!
Everywhere around us we hear conversations about big and small versions of unstructured data, semi-structured data, structured data… it is all very interesting. If we want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it. Here comes the HCatalog which provides metadata service within Hadoop.
So What is HCatalog …??
It is a key component of Apache Hive. HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog.
Hive/HCatalog enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.
We will deep dive into HCatalog working now..
HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, Parquet, ORCFile CSV, JSON, and SequenceFile formats. To use a custom format, we must provide the InputFormat, OutputFormat, and SerDe.
HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.
HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).
HCatalog Interfaces for Apache Pig
This concept is understood better if we have a sound knowledge in Apache Pig.
There are two interfaces in Pig for loading and storing. No HCatalog-specific setup is required for these interfaces.
HCatLoader: HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. HCatLoader is implemented on top of HCatInputFormat. We can indicate which partitions to scan by immediately following the load statement with a partition filter statement.
Syntax: For loading data into HDFS
A = LOAD ‘table name’ USING org.apache.HCatalog.pig.HCatLoader ();
We must specify the table name in single quotes: LOAD ‘table name’. If we are using a non-default database, then we must specify input as ‘dbname.tablename’.
The Hive meta store lets us create tables without specifying a database. If the table is created in this way then the database name is ‘default’ and is not required when specifying the table for HCatLoader.
HCatStorer : HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. We can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause and we can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatStorer is implemented on top of HCatOutputFormat.
Syntax: For storing operation.
A = LOAD…
B = FOREACH A……
STORE my_processed_data INTO ‘tablename’ USING org.apache.HCatalog.pig.HCatStorer ();
We must specify the table name in single quotes: LOAD ‘tablename’. Both the database and the table must be created prior to running your Pig script. If we are using a non-default database, then we must specify our input as ‘dbname.tablename’.
The Hive metastore lets us create tables without specifying a database. If we have created tables in this way, then the database name is ‘default’ and we do not need to specify the database name in the store statement.
For the USING clause, we can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when we are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.
Uses of HCatalog
- Enabling the Right Tool for the Right Job: The majority of heavy Hadoop users do not use a single tool for data processing. Often users and teams will begin with a single tool: Hive, Pig, MapReduce, or another tool. As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on. Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models. Users who start with Pig discover they would like to use Hive for analytics type queries. While tools such as Pig and MapReduce do not require metadata, they can benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using Map Reduce or Pig and then analysed via Hive is very common. When all these tools share one metastore users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
- Capture Processing States to Enable Sharing: When used for analytics , users will discover information using Hadoop. Again, they will often use Hive, Pig and Map Reduce to uncover information. The information is valuable but typically only in the context of a larger analysis. With HCatalog you can publish results so they can be accessed by your analytics platform via REST. In this case, the schema defines the discovery. These discoveries are also useful to other data scientists. Often they will want to build on what others have created or use results as input into a subsequent discovery.
- Integrate Hadoop with Everything: Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools. Hadoop should serve as input into our analytics platform or integrate with our operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. By tying in more closely they can hide complexity from users and create a better experience. A great example of this is the SQL-H integration from Teradata Aster. SQL-H queries the structure of data stored in HCatalog and exposes that back through to Aster enabling Aster to access just the relevant data stored within the Hortonworks Data Platform.
HCatalog allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes. It is a repository for schema that can be referred to in these programming models so that we don’t have to explicitly type our structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements. It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.
Thank you …!!!