Apache Pig

Hello Everyone..!!

In this blog I’ll be taking you through Apache Pig, why should it be used and few basic concepts of it.

So let’s begin with..

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Pig works with data from many sources, including structured and unstructured data, and store the results into HDFS. Pig scripts are translated into a series of MapReduce jobs that run on the Apache Hadoop Cluster. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved.

Pig originated as a Yahoo Research initiative for creating and executing map-reduce jobs on very large data sets. In 2007 Pig became an open source project of the Apache Software Foundation.

Why Apache Pig..?

Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks. Apache Pig became a boon for all such programmers.

  • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.
  • Apache Pig uses multi-query approach, thereby reducing the length of codes.
  • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
  • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Pig Architecture

The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data.

To perform a particular task Programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, and Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy.

Pig Architect

Pig Architecture consists of Pig Latin Interpreter and it will be executed on client Machine. It uses Pig Latin scripts and it converts the script into a series of MR jobs. Then it will execute MR jobs and saves the output result into HDFS. In between, it performs different operations such as Parse, Compile, Optimize and plan the Execution on data that comes into the system.

Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. These MapReduce jobs are executed on Hadoop producing the desired results.

Job Execution Flow in Apache Pig

The Scripts developed by the programmer are stored in the local file system in the form of user defined functions. When we submit Pig Script, it comes in contact with Pig Latin Compiler which splits the task and run a series of MR jobs, meanwhile Pig Compiler fetches data from HDFS (i.e. input file present). After running MR jobs, the output file is stored in HDFS.

Components of Apache Pig :

Pig is a scripting language for exploring huge data sets of size gigabytes or terabytes very easily. Pig provides an engine for executing data flows in parallel on hadoop.

Pig is made up of two things:

  1. Pig Latin:It is language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications.
  2. Pig Engine: It is an Execution Environment to run Pig Latin programs. It has two modes

Local Mode: We can execute the pig script in local file system. In this case we don’t need to store the data in Hadoop HDFS file system, instead we can work with the data stored in local file system itself. In this, parallel mapper execution is not possible because the earlier versions of the Hadoop versions are not thread safe .Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local FileSystem. To run in local mode, we pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt:

          $ pig -x local.

          grunt>

MapReduce Mode: MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute Pig Latin statements to process data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS. Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster. First we need to check the compatibility of the Pig and Hadoop versions being used.

          $ pig -x mapreduce

           grunt>

Pig Execution

Execution Mechanisms used by Pig Engine

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

  • Interactive Mode (Grunt shell): We can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
  • Batch Mode (Script): We can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig
  • Embedded Mode (UDF): Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.

Execution Mec

After knowing the architecture and its components, we will further go ahead with the installation procedure of pig on Ubuntu.

Setting up Pig

 Before installing Apache Pig, it is essential that we have Hadoop and Java installed on our system.

  • Unpack the tarball in the directory of your choice, using the following command

          $ cd hadoop/apache-pig

          $tar -xzvf pig-0.14.0.tar.gz

  • Set the environment variable PIG_HOME to point to the installation directory for convenience:

          $ export PIG_HOME=/home/hduser/hadoop/apache-pig

                                                or

  • Set PIG_HOME in .bashrc so it will be set every time you login.

          Add the following line to it.

          $ export PIG_HOME=/home/hduser/hadoop/apache-pig

          $ export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH

  • Verify the installation

          $pig -version

Pig Latin Data Model

Pig’s data types make up the data model for how Pig thinks of the structure of the data it is processing. With Pig, the data model gets defined when the data is loaded. Any data we load into Pig is going to have a particular schema and structure. Pig needs to understand that structure, so when we do the loading, the data automatically goes through a mapping.

The Pig data model is rich enough to handle most anything that is thrown its way, including table- like structures and nested hierarchical data structures. In general terms, though, Pig data types can be broken into two categories: scalar types and complex types. Scalar types contain a single value, whereas complex types contain other types, such as the Tuple, Bag and Map types.

data model

Atom: Any single value in Pig Latin, irrespective of their data type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple: A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag: A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map: A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’.               Example−[name#Raja,age#30].

data

The value of all these types can also be null. The semantics for null are similar to those used in SQL. The concept of null in Pig means that the value is unknown. Nulls can show up in the data in cases where values are unreadable or unrecognizable — for example, if you were to use a wrong data type in the LOAD statement.

Null could be used as a placeholder until data is added or as a value for a field that is optional.

SCALAR TYPES
1 int Represents a signed 32-bit integer.

Example : 8

 

2 long Represents a signed 64-bit integer.

Example : 5L

 

3 float Represents a signed 32-bit floating point.

Example : 5.5F

 

4 Double Represents a 64-bit floating point.

Example : 10.5

 

5 charaaray Represents a character array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’

 

6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.

Example : true/ false.

 

8 Datetime Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00

 

9 Biginteger Represents a Java BigInteger.

Example : 60708090709

 

10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883

COMPLEX TYPES
11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)

 

12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

 

13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]

 

Pig Latin has a simple syntax with powerful semantics to carry out two primary operations: access and transform data.

In a Hadoop context, accessing data means allowing developers to load, store, and stream data, whereas transforming data means taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data. The table gives an overview of the operators associated with each operation.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.

Data Flow in Pig Programming:

Pig program consists of three parts: Loading, Transforming and Dumping of the data.

Loading: As is the case with all the Hadoop features, the objects that are being worked on by Hadoop are stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what file (or files) it will use, and that’s done through the LOAD ‘data file’ command (where ‘data file’ specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

Transforming: The transformation logic is where all the data manipulation happens. Here we can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more.

Dumping and Storing: If we don’t specify the DUMP or STORE command, the results of a Pig program are not generated. We would typically use the DUMP command, which sends the output to the screen, when we are debugging our Pig programs. When we go into production, we simply change the DUMP call to a STORE call so that any results from running our programs are stored in a file for further processing or analysis. Note that we can use the DUMP command anywhere in our program to dump intermediate result sets to the screen, which is very useful for debugging purposes.

Example

Given below is a Pig Latin statement, which loads data to Apache Pig.

Grunt> Student_data = LOAD ‘student_data.txt’ USING PigStorage (‘,’)  AS

 (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files. (This part of the USING statement is often referred to as a LOAD Func and works in a fashion similar to a custom deserializer.)

The optional AS clause defines a schema for the data that is being mapped. If we don’t use an AS clause, we’re basically telling the default LOAD Func to expect a plain text file that is tab delimited. With no schema provided, the fields must be referenced by position because no name is defined.

Using AS clauses means that we have a schema in place at read-time for your text files, which allows users to get started quickly and provides agile schema modeling and flexibility so that we can add more data to our analytics.

Features of Apache Pig:

  • PigLatin is a procedural data flow language mainly used for programming.
  • It can handle all kinds of data e.g. Structured as well as Unstructured.
  • By using Pig’s multi-query approach anyone can operate many operations together in a single flow, reducing the time of multiple times data scanned.
  • It’s providing Rich Set of operators for filter, join, sort etc.
  • It’s providing complex data types e.g. tuples, bags, and maps.
  • It is generally used by the researcher and programmer.
  • It operates on the client side of any cluster.
  • It does not have a dedicated metadata database and schema or data types will be defined in the script itself.
  • Through User Defined Functions (UDF) facility in Pig, anyone can execute many languages code like Ruby, Python and Java.

Conclusion: In this blog, we have seen that Pig is a very powerful scripting language based on the Hadoop eco-system and MapReduce programming. It can be used to process large volumes of data in a distributed environment. Pig statements and scripts are similar to SQL statements, so developers can use it without focusing much on the underlying mechanism. Through the User Defined Functions (UDF) facility in Pig, Pig can invoke code in many languages like JRuby, python and Java. We can also embed Pig scripts in other languages. The result is that we can use Pig as a component to build larger and more complex applications that tackle real business problems.

References

  1. http://www.tutorialspoint.com/apache_pig/
  2. http://tech.globant.com/en/pig/ 
  3. https://www.dezyre.com/hadoop-tutorial/pig-tutorial 
  4. http://www.dummies.com/how-to/content/hadoops-pig-data-types-and-syntax.html 
  5. http://www.hadooptpoint.com/apache-pig-introduction/

 

Thank you  ..!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s