Introduction to Apache Thrift

Hello readers,

Firstly, let me begin with a brief definition of Apache Thrift, it is a software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Apache Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.


Thrift includes a complete stack for creating clients and servers. The top part is generated code from the Thrift definition. The services generate from this file client and processor code. In contrast to built-in types, created data structures are sent as result in generated code. The protocol and transport layer are part of the runtime library. With Thrift, it is possible to define a service and change the protocol and transport without recompiling the code. Besides the client part, Thrift includes server infrastructure to tie protocols and transports together, like blocking, non-blocking, and multi-threaded servers. The underlying I/O part of the stack is differently implemented for different languages.


Installation of Apache Thrift

 First install the dependencies and then you are ready to install Thrift.

Install the languages with which you plan to use thrift. To use with Java for example, install a Java JDK you prefer and check if is installed on the system.

Install JAVA:

$ sudo apt-get install openjdk-8-jdk -y

$ java -version


To use  Java you will also need to install Apache ant

$ sudo apt-get install ant



Install required tools and libraries:


$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev


$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel


 1) Download apache thrift version 0.9.0 rather than 0.9.1



 2) Copy the downloaded file into the desired directory and untar the file


$ tar -xvf thrift-0.9.0.tar.gz

3) Build Thrfit according to instructions – Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:

$ sudo nano ~/.bashrc

  • paste the following line at the end

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/home/hduser/hadoop/thrift-0.9.0/lib/cpp

  • Then source the bashrc via

source ~/.bashrc

  • Verifiy pkg-config path is correct: Type this in terminal

# pkg-config –cflags thrift


  • Copy Thrift library sudo cp /usr/local/lib/libthrift-0.9.0 /usr/lib/


4) Get some core development tools that are not direct dependencies below

$ sudo apt-get install git-core make

5) The following pulls in openjdk-6-jre-headless to provide /usr/bin/java:

$ sudo apt-get install maven2


6)  Pull in the main thrift dependencies


$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

7) Add curl since we’ll need it shortly:

$ sudo apt-get install curl


8) Python is enabled by default (although broken since there is no Python.h header). This line will build with C++ and Java:

$ cd thrift-0.9.0

$ ./configure –prefix=/usr –without-python JAVA_PREFIX=/usr/share/java

9) Build:


$ make


10) Install:

$ sudo make install

After the execution of this command, Apache Thrift will be installed in the system. I had to install thrift for installing rhbase on my Ubuntu system as thrift is the main dependency of rhbase for building services. But, you can install thrift whenever there is need to build services between languages such as C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, etc.

Thank you for Reading.


References :







Introduction to RHadoop

Hello readers,

I have been working on how can R and Hadoop integrated to be used together. By very hard verification process, finally I got the possible ways to use R and Hadoop together for performing Big Data Analytics, RHadoop is one of the four different ways of using Hadoop and R together.

Hadoop is a disruptive Java-based programming framework that supports the processing of large data sets in a distributed computing environment, while R is a programming language and software environment for statistical computing and graphics.


RHadoop is an open source project developed by Revolution Analytics that provides client-side integration of R and Hadoop. RHadoop is a collection of five R packages that allow users to manage and analyse data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR’s distribution.

RHadoop consists of the following packages:

Package Name Description
rhdfs This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.
rhbase This package provides basic connectivity to the HBASE distributed database, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE from within R. Install this package only on the node that will run the R client.
plyrmr This package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop MapReduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the MapReduce details. Install this package only every node in the cluster.
rmr2 A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.
ravro A package that adds the ability to read and write avro files from local and HDFS file system and adds an avro input format for rmr2. Install this package only on the node that will run the R client.

Setting up RHadoop is a complicated task as RHadoop has dependencies on other R packages. Working with RHadoop implies to install R and RHadoop packages with dependencies on each Data node of the Hadoop cluster.

Setting up RHadoop on Ubuntu 14.04

Prerequisites for installing RHadoop on Ubuntu 14.04

  1. Make sure Java and Hadoop binaries are installed in the machine

$ java -version


$ hadoop version


  1. R

In the terminal type the below commands to install the necessary R packages.

sudo apt-get install r-base

sudo apt-get install r-base-core

sudo apt-get install r-base-dev

  1. Rstudio

Download Rstudio Desktop/Server from .

Double click on the downloaded file to install Rstudio on the system.

  1. Thrift 0.9.0

 Thrift is needed for installing rhbase. If you do not use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn’t work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.



  1. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.
  2. Set Environment variables in R/RStudio:




Installing RHadoop Packages :

a) Using Terminal

Install the RHadoop package dependencies at one go by using :

install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”, dependencies=TRUE, repos=’‘))

or else if you have downloaded the .tar files you can install the Rhadoop package dependencies separately one by one, using : 

install.packages(“~/rJava_0.9-8.tar.gz”, repos = NULL, type = “source”)

b) If your need to use R requires a particular package/library to be installed in R-studio. You can follow the instructions below to do so

  1. Run R studio
  2. Click on the Packages tab in the bottom-right section and then click on install. The following dialog box will appear
  3. In the Install Packages dialog, write the package name you want to install under the Packages field and then click install. This will install the package you searched for or give you a list of matching package based on your package text.

Other packages such as rhdfs, rhbase etc. can be installed in the similar way.

After installing Rhadoop packages, now you will able to access Hadoop FileSystem via R/RStudio.

Thank you for reading.

References :











Introduction to RHDFS

Hi readers,

Before starting with RHDFS, let’s have a look on what is HDFS and R, and the connection between these two i.e., what is RHDFS and how does it work.


Hadoop Distributed File System (HDFS) is the file system component of Hadoop. While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favour of improved performance for the applications at hand.

It is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.

What is R?

R is a language and environment for statistical computing and graphics. The term “environment” is intended to characterize, it is a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has the following features :

  • An effective data handling and storage facility
  • A suite of operators for calculations on arrays, in particular matrices
  • A large, coherent, integrated collection of intermediate tools for data analysis
  • Graphical facilities for data analysis and display either directly at the computer or on hard-copy
  • A well-developed, simple and effective programming language (called ‘S’) which includes
  • Conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)I find R to be very interesting as it is very much like a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

In the areas of interactive data analysis, general purpose statistics and predictive modelling, R has gained massive popularity due to its classification, clustering and ranking capabilities. The R language is widely used among statisticians and data miners for developing statistical software and performing data analysis.



This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS.

 The following functions are part of this package

  • File Manipulations hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
  • File Read/Write hdfs.file, hdfs.write, hdfs.close, hdfs.flush,,, hdfs.tell, hdfs.line.reader,
  • Directory hdfs.dircreate, hdfs.mkdir
  • Utility, hdfs.list.files,, hdfs.exists
  • Initialization hdfs.init, hdfs.defaults

1Pre-requisites of RHDFS

  • This package has a dependency on rJava
  • Access to HDFS via this R package is dependent upon the HADOOP_CMD environment variable. As you can see in the below screenshot, HADOOP_CMD points to the full path for the Hadoop binary. If this variable is not properly set, the package will fail when the init() function is invoked.

Below are few commands which I have used in RStudio for accessing Hadoop FileSystem.

  • HADOOP_CMD environment should point to the Hadoop.
  • Use library(rhdfs), to load rhdfs package into RStudio.


For initiating the HDFS, type hdfs.init() and press enter.


Once the above command is executed you will have access over Hadoop FileSytem, thus you are being able to access HDFS using R console or RStudio.


Type“/user”) to list the files in ‘/user’.


To view the files under ‘/user/hduser’, type“/user/hduser”)


 hdfs.put() is used to uploading files from local system to Hadoop FileSystem as shown in the screenshot below.

I am uploading a folder named logs from local FileSystem to HDFS.


If it returns TRUE, it means that the folder/file has been successfully uploaded else you will have look into the error that has occurred and rectify it.


hdfs.get() is used to downloading files from Hadoop FileSystem to Local FileSystem as shown in the screenshot below.


 And if you want to delete a folder/file use the command hdfs.rm().


In the below screenshot you can notice that the logs folder has been removed and it doesn’t exist in the HDFS.


Going a bit more into RHDFS, the next topic is reading a CSV file using RHDFS.

I have created a new directory rhdfs under deepika in the Hadoop FileSystem in which I shall upload all the RHDFS related works. As you can see in the screenshot below, I have uploaded a new csv file named sample1.csv into the Hadoop FileSystem.


Here is a small code snippet on how to read the csv data from HDFS using rhdfs

f = hdfs.file(“/usr/hduser/deepika/rhdfs/sample.csv”,”r”)

m =

c = rawToChar(m)

In the screenshot, you can see that the sample.csv file is being read and raw data is converted to char so that it would be in format which we can understand.


 The output is displayed in tabular format which is depicted in the below screenshot:


As I am still new to R and RStudio, these are the few basic commands which I have tried to begin with RHDFS. I shall share more information in detail in the upcoming blogs.

Thanks for Reading.

References :






Hello Readers !!!

Everywhere around us we hear conversations about big and small versions of unstructured data, semi-structured data, structured data… it is all very interesting. If we want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it. Here comes the HCatalog which provides metadata service within Hadoop.

So What is HCatalog …??

It is a key component of Apache Hive. HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog.

Hive/HCatalog enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.


We will deep dive into HCatalog working now..

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, Parquet, ORCFile CSV, JSON, and SequenceFile formats. To use a custom format, we must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

HCatalog Interfaces for Apache Pig

This concept is understood better if we have a sound knowledge in Apache Pig.

There are two interfaces in Pig for loading and storing. No HCatalog-specific setup is required for these interfaces.

HCatLoader: HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. HCatLoader is implemented on top of HCatInputFormat. We can indicate which partitions to scan by immediately following the load statement with a partition filter statement.

Syntax:  For loading  data into HDFS

A = LOAD ‘table name’ USING org.apache.HCatalog.pig.HCatLoader ();

We must specify the table name in single quotes: LOAD ‘table name’. If we are using a non-default database, then we must specify input as ‘dbname.tablename’.

The Hive meta store lets us create tables without specifying a database. If the table is created in this way then the database name is ‘default’ and is not required when specifying the table for HCatLoader.

HCatStorer : HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. We can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause and we can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatStorer is implemented on top of HCatOutputFormat.    

Syntax: For storing operation.



my_processed_data =..

STORE my_processed_data INTO ‘tablename’ USING org.apache.HCatalog.pig.HCatStorer ();

We must specify the table name in single quotes: LOAD ‘tablename’. Both the database and the table must be created prior to running your Pig script. If we are using a non-default database, then we must specify our input as ‘dbname.tablename’.

The Hive metastore lets us create tables without specifying a database. If we have created tables in this way, then the database name is ‘default’ and we do not need to specify the database name in the store statement.

For the USING clause, we can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when we are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.

Uses of HCatalog

  • Enabling the Right Tool for the Right Job: The majority of heavy Hadoop users do not use a single tool for data processing. Often users and teams will begin with a single tool:  Hive, Pig, MapReduce, or another tool.  As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on.  Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models.  Users who start with Pig discover they would like to use Hive for analytics type queries.  While tools such as Pig and MapReduce do not require metadata, they can benefit from it when it is present.  Sharing a metadata store also enables users across tools to share data more easily.  A workflow where data is loaded and normalized using Map Reduce or Pig and then analysed via Hive is very common. When all these tools share one metastore users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
  • Capture Processing States to Enable Sharing: When used for analytics , users will discover information using Hadoop.  Again, they will often use Hive, Pig and Map Reduce to uncover information. The information is valuable but typically only in the context of a larger analysis. With HCatalog you can publish results so they can be accessed by your analytics platform via REST.  In this case, the schema defines the discovery. These discoveries are also useful to other data scientists.  Often they will want to build on what others have created or use results as input into a subsequent discovery.
  • Integrate Hadoop with Everything: Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools.  Hadoop should serve as input into our analytics platform or integrate with our operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. By tying in more closely they can hide complexity from users and create a better experience. A great example of this is the SQL-H integration from Teradata Aster. SQL-H queries the structure of data stored in HCatalog and exposes that back through to Aster enabling Aster to access just the relevant data stored within the Hortonworks Data Platform.


HCatalog allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes.  It is a repository for schema that can be referred to in these programming models so that we don’t have to explicitly type our structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements.  It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.



Thank you …!!!

Apache Pig

Hello Everyone..!!

In this blog I’ll be taking you through Apache Pig, why should it be used and few basic concepts of it.

So let’s begin with..

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Pig works with data from many sources, including structured and unstructured data, and store the results into HDFS. Pig scripts are translated into a series of MapReduce jobs that run on the Apache Hadoop Cluster. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved.

Pig originated as a Yahoo Research initiative for creating and executing map-reduce jobs on very large data sets. In 2007 Pig became an open source project of the Apache Software Foundation.

Why Apache Pig..?

Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks. Apache Pig became a boon for all such programmers.

  • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.
  • Apache Pig uses multi-query approach, thereby reducing the length of codes.
  • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
  • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Pig Architecture

The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data.

To perform a particular task Programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, and Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy.

Pig Architect

Pig Architecture consists of Pig Latin Interpreter and it will be executed on client Machine. It uses Pig Latin scripts and it converts the script into a series of MR jobs. Then it will execute MR jobs and saves the output result into HDFS. In between, it performs different operations such as Parse, Compile, Optimize and plan the Execution on data that comes into the system.

Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. These MapReduce jobs are executed on Hadoop producing the desired results.

Job Execution Flow in Apache Pig

The Scripts developed by the programmer are stored in the local file system in the form of user defined functions. When we submit Pig Script, it comes in contact with Pig Latin Compiler which splits the task and run a series of MR jobs, meanwhile Pig Compiler fetches data from HDFS (i.e. input file present). After running MR jobs, the output file is stored in HDFS.

Components of Apache Pig :

Pig is a scripting language for exploring huge data sets of size gigabytes or terabytes very easily. Pig provides an engine for executing data flows in parallel on hadoop.

Pig is made up of two things:

  1. Pig Latin:It is language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications.
  2. Pig Engine: It is an Execution Environment to run Pig Latin programs. It has two modes

Local Mode: We can execute the pig script in local file system. In this case we don’t need to store the data in Hadoop HDFS file system, instead we can work with the data stored in local file system itself. In this, parallel mapper execution is not possible because the earlier versions of the Hadoop versions are not thread safe .Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local FileSystem. To run in local mode, we pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt:

          $ pig -x local.


MapReduce Mode: MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute Pig Latin statements to process data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS. Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster. First we need to check the compatibility of the Pig and Hadoop versions being used.

          $ pig -x mapreduce


Pig Execution

Execution Mechanisms used by Pig Engine

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

  • Interactive Mode (Grunt shell): We can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
  • Batch Mode (Script): We can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig
  • Embedded Mode (UDF): Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.

Execution Mec

After knowing the architecture and its components, we will further go ahead with the installation procedure of pig on Ubuntu.

Setting up Pig

 Before installing Apache Pig, it is essential that we have Hadoop and Java installed on our system.

  • Unpack the tarball in the directory of your choice, using the following command

          $ cd hadoop/apache-pig

          $tar -xzvf pig-0.14.0.tar.gz

  • Set the environment variable PIG_HOME to point to the installation directory for convenience:

          $ export PIG_HOME=/home/hduser/hadoop/apache-pig


  • Set PIG_HOME in .bashrc so it will be set every time you login.

          Add the following line to it.

          $ export PIG_HOME=/home/hduser/hadoop/apache-pig

          $ export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH

  • Verify the installation

          $pig -version

Pig Latin Data Model

Pig’s data types make up the data model for how Pig thinks of the structure of the data it is processing. With Pig, the data model gets defined when the data is loaded. Any data we load into Pig is going to have a particular schema and structure. Pig needs to understand that structure, so when we do the loading, the data automatically goes through a mapping.

The Pig data model is rich enough to handle most anything that is thrown its way, including table- like structures and nested hierarchical data structures. In general terms, though, Pig data types can be broken into two categories: scalar types and complex types. Scalar types contain a single value, whereas complex types contain other types, such as the Tuple, Bag and Map types.

data model

Atom: Any single value in Pig Latin, irrespective of their data type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple: A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag: A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338,,}}

Map: A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’.               Example−[name#Raja,age#30].


The value of all these types can also be null. The semantics for null are similar to those used in SQL. The concept of null in Pig means that the value is unknown. Nulls can show up in the data in cases where values are unreadable or unrecognizable — for example, if you were to use a wrong data type in the LOAD statement.

Null could be used as a placeholder until data is added or as a value for a field that is optional.

1 int Represents a signed 32-bit integer.

Example : 8


2 long Represents a signed 64-bit integer.

Example : 5L


3 float Represents a signed 32-bit floating point.

Example : 5.5F


4 Double Represents a 64-bit floating point.

Example : 10.5


5 charaaray Represents a character array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’


6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.

Example : true/ false.


8 Datetime Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00


9 Biginteger Represents a Java BigInteger.

Example : 60708090709


10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883

11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)


12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}


13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]


Pig Latin has a simple syntax with powerful semantics to carry out two primary operations: access and transform data.

In a Hadoop context, accessing data means allowing developers to load, store, and stream data, whereas transforming data means taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data. The table gives an overview of the operators associated with each operation.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.

Data Flow in Pig Programming:

Pig program consists of three parts: Loading, Transforming and Dumping of the data.

Loading: As is the case with all the Hadoop features, the objects that are being worked on by Hadoop are stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what file (or files) it will use, and that’s done through the LOAD ‘data file’ command (where ‘data file’ specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

Transforming: The transformation logic is where all the data manipulation happens. Here we can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more.

Dumping and Storing: If we don’t specify the DUMP or STORE command, the results of a Pig program are not generated. We would typically use the DUMP command, which sends the output to the screen, when we are debugging our Pig programs. When we go into production, we simply change the DUMP call to a STORE call so that any results from running our programs are stored in a file for further processing or analysis. Note that we can use the DUMP command anywhere in our program to dump intermediate result sets to the screen, which is very useful for debugging purposes.


Given below is a Pig Latin statement, which loads data to Apache Pig.

Grunt> Student_data = LOAD ‘student_data.txt’ USING PigStorage (‘,’)  AS

 (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files. (This part of the USING statement is often referred to as a LOAD Func and works in a fashion similar to a custom deserializer.)

The optional AS clause defines a schema for the data that is being mapped. If we don’t use an AS clause, we’re basically telling the default LOAD Func to expect a plain text file that is tab delimited. With no schema provided, the fields must be referenced by position because no name is defined.

Using AS clauses means that we have a schema in place at read-time for your text files, which allows users to get started quickly and provides agile schema modeling and flexibility so that we can add more data to our analytics.

Features of Apache Pig:

  • PigLatin is a procedural data flow language mainly used for programming.
  • It can handle all kinds of data e.g. Structured as well as Unstructured.
  • By using Pig’s multi-query approach anyone can operate many operations together in a single flow, reducing the time of multiple times data scanned.
  • It’s providing Rich Set of operators for filter, join, sort etc.
  • It’s providing complex data types e.g. tuples, bags, and maps.
  • It is generally used by the researcher and programmer.
  • It operates on the client side of any cluster.
  • It does not have a dedicated metadata database and schema or data types will be defined in the script itself.
  • Through User Defined Functions (UDF) facility in Pig, anyone can execute many languages code like Ruby, Python and Java.

Conclusion: In this blog, we have seen that Pig is a very powerful scripting language based on the Hadoop eco-system and MapReduce programming. It can be used to process large volumes of data in a distributed environment. Pig statements and scripts are similar to SQL statements, so developers can use it without focusing much on the underlying mechanism. Through the User Defined Functions (UDF) facility in Pig, Pig can invoke code in many languages like JRuby, python and Java. We can also embed Pig scripts in other languages. The result is that we can use Pig as a component to build larger and more complex applications that tackle real business problems.




Thank you  ..!!


Hello Everyone,


Big Data is the buzz word circulating in IT industry from 2008, you must be knowing this, if not let me tell you that the amount of data being generated by social networks, manufacturing, retail, stocks, telecom, insurance, banking, and health care industries is way beyond our imaginations.

Before the advent of Hadoop, storage and processing of big data was a big challenge. But now that Hadoop is available, companies have realized the business impact of Big Data and how understanding this data will drive the growth. For example:

  • Banking sectors have a better chance to understand loyal customers, loan defaulters and fraud transactions.
  • Retail sectors now have enough data to forecast demand.
  • Manufacturing sectors need not depend on the costly mechanisms for quality testing. Capturing sensors data and analysing it would reveal many patterns.
  • E-Commerce, social networks can personalize the pages based on customer interests.
  • Stock markets generate a humongous amount of data, correlating from time to time will reveal beautiful insights.

Big Data has many useful and insightful applications.

Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of technologies which have proficient advantage in solving business problems.

In this blog post, I’ll give an overview of that ecosystem. If you’re new to Hadoop, this could be an easy introduction.  If you are steep in some Hadoop technology, this post could give you the opportunity to look around the ecosystem for things you aren’t specialized in.

We all know Hadoop is a framework which deals with Big Data but unlike any other frame-work it’s not a simple framework, it has its own family for processing different thing which is tied up in one umbrella called as Hadoop Ecosystem.

Hadoop Ecosystem can be divided into various categories as mentioned below :

  1. Core Hadoop
  2. Data Access
  3. Data Storage
  4. Interaction-Visualization-Execution-Development
  5. Data Intelligence
  6. Data Serialization
  7. Data Integration
  8. Management, Monitoring and Orchestration


Let’s try to understand each component in detail:


1. Core Hadoop

Core Hadoop in the first category comprises of 4 core components –

Hadoop Common

Apache Foundation has pre-defined set of utilities and libraries that can be used by other modules within the Hadoop ecosystem. For example, if HBase and Hive want to access HDFS they need to make of Java archives (JAR files) that are stored in Hadoop Common.

Hadoop Distributed File System (HDFS)

The default big data storage layer for Apache Hadoop is HDFS. Hadoop Distributed File System (HDFS) is a distributed file system. It is built to make sure data is local to the machine processing it. HDFS is the “Secret Sauce” of Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until the user wants to leverage it for analysis. It is the key tool for managing Big Data and supporting analytic applications in a scalable, cheap and rapid way. Hadoop is usually used on low-cost commodity machines, where server failures are fairly common. To accommodate a high failure environment the file system is designed to distribute data throughout different servers in different server racks making the data highly available.

HDFS operates on a Master-Slave architecture model where the NameNode acts as the master node for keeping a track of the storage cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop cluster. HDFS component creates several replicas of the data block to be distributed across different clusters for reliable and quick data access.

Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured data. Nokia uses HDFS for storing all the structured and unstructured data sets as it allows processing of the stored data at a petabyte scale.


MapReduce is a Java-based system created by Google where the actual data from the HDFS store gets processed efficiently. Map Reduce is the distributed, parallel computing programming model for Hadoop.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

Skybox has developed an economical image satellite system for capturing videos and images from any location on earth. Skybox uses Hadoop to analyse the large volumes of image data downloaded from the satellites. The image processing algorithms of Skybox are written in C++. Busboy, a proprietary framework of Skybox makes use of built-in code from java based MapReduce framework.


YARN which stands for Yet Another Resource Negotiator is a programming model for processing and generating large sets of data. It is the successor of Hadoop MapReduce. It is great enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads.

YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes.  Running on commodity hardware clusters, Hadoop has attracted particular interest as a staging area and data store for large volumes of structured and unstructured data intended for use in analytics applications. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can’t wait for batch jobs to finish.

Yahoo has close to 40,000 nodes running Apache Hadoop with 500,000 MapReduce jobs per day taking 230 compute years extra for processing every day. YARN at Yahoo helped them increase the load on the most heavily used Hadoop cluster to 125,000 jobs a day when compared to 80,000 jobs a day which is close to 50% increase.

The above listed core components of Apache Hadoop form the basic distributed Hadoop framework. Let me tell you, there are several other Hadoop components that form an integral part of the Hadoop ecosystem with the intent of enhancing the power of Apache Hadoop in some way or the other like- providing better integration with databases, making Hadoop faster or developing novel features and functionalities.

Here are some of the eminent Hadoop components used by enterprises extensively –

2. Data Access Components of Hadoop Ecosystem 

Under this category, we have Hive, Pig, HCatalog and Tez which are explained below :


Hive is a data warehouse management and analytics system that is built for Hadoop. It was initially developed by Facebook, but soon after became an open-source project and is being used by many other companies ever since.

Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 Filesystem. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs.

Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore. By default, It stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. It supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way).

Hive simplifies Hadoop at Facebook with the execution of 7500+ Hive jobs daily for Ad-hoc analysis, reporting and machine learning.


Pig is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily. It is a programming language that simplifies the common tasks of working with Hadoop, like loading data, expressing transformations on the data, and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.

Similar to Hive, Pig also deals with structured data using Pig Latin language. It is an alternative provided to programmer who loves scripting and don’t want to use Java/Python or SQL to process data. A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data which runs MapReduce program in backend to produce output. The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy for handling large data sets.

The personal healthcare data of an individual is confidential and should not be exposed to others. This information should be masked to maintain confidentiality, am I not right, but the healthcare data is so huge that identifying and removing personal healthcare data is crucial. Apache Pig can be used under such circumstances to de-identify health information.

Hive & Pig overlap :

You would choose Pig in scenarios where you are importing and transforming data and you would like to be able to see the intermediate.

Apache Pig is somewhat similar to Apache Hive though some users say that it is easier to transition to Hive rather than Pig if you come from a RDBMS SQL background. However, both platforms have a place in the market. Hive is more optimised to run standard queries and is easier to pick up where as Pig is better for tasks that require more customisation. Where Hive is used for structured data, Pig excels in transforming semi-structured and unstructured data.


HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).


Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. It is an application framework allowing complex directed-acyclic-graph of tasks to processing data. It is built on top of YARN and is a substitute to Map Reduce in some scenarios

Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. It is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility.

However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit well into the MapReduce paradigm. Tez helps Hadoop address these use cases.

3. Data Storage Component of Hadoop Ecosystem

HBase and Cassandra come under this category,


Hadoop Database or HBASE is a non-relational (NoSQL) database that runs on top of HDFS. It was created for large table which have billions of rows and millions of columns with fault tolerance capability and horizontal scalability and based on Google Big Table. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner for random access of huge data HBASE is used.

HBase is a column-oriented database that uses HDFS for underlying storage of data. HBase supports random reads and also batch computations using MapReduce. With HBase NoSQL database enterprise can create large tables with millions of rows and columns on hardware machine. The best practice to use HBase is when there is a requirement for random ‘read or write’ access to big datasets.

Facebook is one the largest users of HBase with its messaging platform built on top of HBase in 2010.HBase is also used by Facebook for streaming data analysis, internal monitoring system, nearby Friends Feature, Search Indexing and scraping data for their internal data warehouses.


Apache Cassandra is a column oriented NoSQL data store which offers scalability, high availability without compromising on performance. It is the right choice when you need scalability and high availability without compromising performance.

It offers capabilities that relational databases and other NoSQL databases simply cannot match such as: continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data.

Linear scalability  and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. It’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views, and powerful built-in caching.

4. Data Interaction-Visualization-Execution-Development


Apache Spark is actually complementary to Hadoop. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Spark Core is the foundation of the overall project. It combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3. It is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

The main feature of Spark is its in memory cluster computing that increases the processing speed of an application. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools. 

5. Data Intelligence Components of Hadoop Ecosystem


Apache Mahout is a platform to run scalable Machine Learning algorithms leveraging Hadoop distributed computing. Mahout is an open source machine learning library written in java. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. In its current incarnation, these scalable machine learning implementations in Mahout are written in Java, and some portions are built upon Apache’s Hadoop distributed computation project.

It is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion.

Every organization’s data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses performed on that data. The Mahout project is a library of Hadoop implementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering and classification.

6. Data Serialization Components of Hadoop Ecosystem

Thrift and Avro which are meant for serialization come under this category,

Apache Thrift

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

It allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.

Thrift includes a complete stack for creating clients and servers. The top part is generated code from the Thrift definition. The services generate from this file client and processor code. In contrast to built-in types, created data structures are sent as result in generated code.


Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It is a serialization system for efficient, cross-language RPC and persistent data storage. It is a framework for performing remote procedure calls and data serialization. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro.

7. Data Integration Components of Hadoop Ecosystem

This category includes Sqoop, Flume and Chukwa,


Apache Sqoop is Hadoop data movement involving booth relational and non-relational data sources. It provides a bi-directional data transfer between Hadoop-HDFS and your favourite relational database. For example you might be storing your app data in relational store such as Oracle, now you want to scale your application with Hadoop so you can migrate Oracle database data to Hadoop HDFS using Sqoop.

This component is used for importing data from external sources into related Hadoop components like HDFS, HBase or Hive. It can also be used for exporting data from Hadoop o other external structured data stores. Sqoop parallelized data transfer, mitigates excessive loads, allows data imports, efficient data analysis and copies data quickly.

When we import any structured data from table (RDBMS) to HDFS a file is created in HDFS which we can process by either Map Reduce program directly or by HIVE or PIG. Similarly after processing data in HDFS we can store the processed structured data back to another table in RDBMS by exporting through Sqoop.

Online Marketer uses Sqoop component of the Hadoop ecosystem to enable transmission of data between Hadoop and the IBM Netezza data warehouse and pipes backs the results into Hadoop using Sqoop.


Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.

Twitter source connects through the streaming API and continuously downloads the tweets (called as events). These tweets are converted into JSON format and sent to the downstream Flume sinks for further analysis of tweets and retweets to engage users on Twitter.


Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. It is an open source data collection system for monitoring large distributed systems. It is built on top of the Hadoop distributed Filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. It also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.

It aims to provide a flexible and powerful platform for distributed data collection and rapid data processing. The goal is to produce a system that’s usable today, but that can be modified to take advantage of newer storage technologies (HDFS appends, HBase, etc.) as they mature. In order to maintain this flexibility, Chukwa is structured as a pipeline of collection and processing stages, with clean and narrow interfaces between stages. This will facilitate future innovation without breaking existing code.

8. Monitoring, Management and Orchestration components of Hadoop Ecosystem


Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. It is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store:

  • Workflow definitions
  • Currently running workflow instances, including instance states and variables

It is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. It is implemented as a Java Web-Application that runs in a Java Servlet-Container. Hadoop basically deals with Bigdata and when some programmer wants to run many job in a sequential manner like output of job A will be input to Job B and similarly output of job B is input to job C and final output will be output of job C. To automate this sequence we need a workflow and to execute same we need engine for which OOZIE is used.

The American video game publisher Riot Games uses Hadoop and the open source tool Oozie to understand the player experience.


Zookeeper provides centralized services for Hadoop cluster configuration management, synchronization and group services. For example, think about how a global configuration file works on a Web application; Zookeeper is like that configuration file, but at a much higher level. It provides primitives such as distributed locks that can be used for building the highly scalable applications. It is used to manage synchronization for cluster.

With a growing family of services running as part of a Hadoop cluster, there’s a need for coordination and naming services. As computing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and know how they should be configured. This is the purpose of Zookeeper.


One of most useful tools you’ll use if you’re administering a Hadoop cluster, Ambari that allows administrators to install, manage and monitor Hadoop clusters with a simple Web interface. It provides an easy-to-follow wizard for setting up a Hadoop cluster of any size.

Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations. With Ambari, Hadoop operators get the following core benefits:

  • Simplified Installation, Configuration and Management
  • Centralized Security Setup
  • Full Visibility into Cluster Health
  • Highly Extensible and Customizable

By now, you must have got to know that Hadoop is a very rich ecosystem. Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large amount of data, quickly and cost effectively through clusters of commodity hardware. It won’t be wrong if we say that Apache Hadoop is actually a collection of several components and not just a single product.

Thanks for reading.

References :





Poisson regression

Hello everyone, this blog is about a regression technique called Poisson regression. Poisson regression is used to model count variables.

Let me explain about count data first..!! It is type of data in which observations can take only non-negative integer values and these integers arise from counting rather than ranking. Count variable is an individual piece of count data.

An example of where Poisson regression can be used will help you understand better; Poisson regression can be used to estimate number of people in line in front of you at the grocery store. Predictors may include the number of items currently offered at a special discounted price and whether a special event (e.g., a holiday, a big sporting event) is a few days away.

Difference between a Poisson regression and a linear regression is that the linear regression assumes that true values of dependent variable are distributed normally around the expected value and can take any real, positive, negative or fractional value.

Whereas Poisson regression is used to model count of events and typical scenarios are no: of orders placed by a customer in a time frame or no: of visits to the website by an individual user.

Now let’s discuss how to create a Poisson regression model using R.

I will use the data on the distribution of 3605 individual trees of Beilschmiedia pendula(a tree species) in 500 x 1000 m forest plot in Panama. The dataset is freely available as a part of the R’s spatstat library.

Step 1:  loading the data and plotting it for better understanding.





plot(bei$x, bei$y, pch = 19, cex = 0.5, main = “Spatial distribution of individuals in the 50-ha Barro Colorado plot”,

xlab = “x coordinate [m]”, ylab = “y coordinate [m]”, frame = FALSE)

abline(h = 0, col = “black”)

abline(h = 500, col = “black”)

abline(v = 0, col = “black”)

abline(v = 1000, col = “black”)

The dataset comes with two environmental layers elevation, slope. My goal is to model density of tree individuals as a function of elevation. I am interested in predicting density of the trees i.e. number n of individuals per unit area. Hence, I will resample the data into a grid of 50 x 50 m:

elev <- raster(bei.extra[[1]])  #coarsening the predictor data into the 50*50 grid by taking mean of   the 5*5m grid cells

ext <- extent(2.5, 1002.5, 2.5, 1002.5) # cropping the data so that they have exactly 500 x 1000 cells

elev <- crop(elev, ext)

elev50 <- aggregate(elev, fact = 10, fun = mean) # aggregating the elevation data

xy <- data.frame(x = bei$x, y = bei$y) # fitting the point data into the 50 x 50 m grid

n50 <- rasterize(xy, elev50, fun = “count”)

n50[] <- 0        # replacing the NA values by 0


STEP 2: Initial data visualization.

plot(elev50[], n50[], cex = 1, pch = 19, col = “grey”, ylab = “# of Individuals”,

xlab = “Mean Elevation [m]”)


STEP 3:centering  and standardization

I find it necessary to center (to 0 mean) and standardize (to variance of 1) the variables. Hence,

scale2 <- function(x) {

sdx <- sqrt(var(x))

meanx <- mean(x)

return((x – meanx)/sdx)


elev50 <- scale2(elev50[])

pow.elev50 <- elev50^2

n50 <- n50[]

Step 4: Fitting the model using glm()

Fitting the model with the glm() function is easy. You just need to specify that the data is drawn from Poisson distribution and that is modeled in logarithmic space.

m.glm <- glm(n50 ~ elev50 + pow.elev50, family = “poisson”)



Follow this link to understand the summary of m.glm.

Now, this helps me in creating a smooth prediction curve.

elev.seq <- seq(-3, 2, by = 0.05) #generates a sequence of numbers <- data.frame(elev50 = elev.seq, pow.elev50 = elev.seq^2)# The numbers generated by sequence are stored in data frame new.predict <- predict(m.glm, newdata =, type = “response”) #generating the prediction curve

lines(elev.seq, new.predict, col = “red”, lwd = 2) #plotting the predicted curve


With the help of this prediction curve the prediction can be made.!!


Intro to Poisson regression