Introduction to Apache Oozie….

Hello Readers, you might have already read my other blogs on different components of Hadoop Ecosystem, this article helps you  know about Apache Oozie, so let’s get started with….

What is Oozie??

Apache Oozie is a workflow scheduler for Hadoop. It is a system which runs workflow of dependent jobs. Here, users are permitted to create Directed Acyclic Graphs of workflows, which can be run in parallel and sequentially in Hadoop.

It consists of two parts:

  • Workflow engine: Responsibility of a workflow engine is to store and run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
  • Coordinator engine: It runs workflow jobs based on predefined schedules and availability of data.

Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.

01
Figure 1: Oozie Workflow

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language). DAG is an in-built process in the OOZIE structure that is used by the program to define the actions. In the DAG process of OOZIE, the Acyclical term refers to the graph having no loops i.e. the action graph has a separate starting point as well as an end point. The DAG process is built using both the action nodes and the various defined dependencies, each one having a starting point and pointing towards the end without ever going back to starting point.  hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path (decision, fork and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for the following types of actions: Hadoop map-reduce, Hadoop file system, Pig, Java and Oozie sub-workflow .

How Oozie works??

An Oozie workflow is a DAG of Hadoop computation/processing tasks (referred to as “actions”) and flow “controls” to coordinate the tasks and manage dependencies of actions and their results.

  • Actions:

Oozie workflow actions start jobs on remote nodes, and upon completion of the same, the processes executing the jobs callback Oozie and notify completion in response to which Oozie will start the next action. Actions can be Hadoop fs, ssh, map reduce, hive, pig, Sqoop, distcp, http, email commands or custom actions.

  • Controls: 

Controls manage the execution path of actions and include start, fork, join, decision and end.

  • Parameterizing actions and decisions:

Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc.) Formal parameters are expressed in the workflow definition as ${VAR} variables.

  • Workflow application:

A workflow application is an instance of a workflow, and is essentially a zip file containing everything needed to execute the actions within the workflows – the workflow definition (an XML file), JARs for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Hive/Pig/Sqoop scripts, files for distributed cache and other resource files.

  • Workflow definition:

A workflow definition is a DAG with control flow nodes and action nodes expressed in the XML based workflow definition language called hPDL (Hadoop Process Definition Language).

  • Workflow nodes:

Nodes encompassing actions in hPDL are called action nodes, and nodes encompassing controls are called control flow nodes and together are referred to as workflow nodes.

Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing. Oozie workflow consists of action nodes and control-flow nodes.

An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.

control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node.

Start NodeEnd Node and Error Node fall under this category of nodes.

Start Node:  The start node is the entry point for a workflow job; it indicates the first workflow node the workflow job must transition to. When a workflow is started, it automatically transitions to the node specified in the start .A workflow definition must have one start node.

End Node: The end node is the end for a workflow job, it indicates that the workflow job has completed successfully.  When a workflow job reaches the end it finishes successfully (SUCCEEDED).  If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run.  A workflow definition must have one end node.

Error Node:  designates an occurrence of error and corresponding error message to be printed. At the end of execution of workflow, HTTP callback is used by Oozie to update client with the workflow status. Entry-to or exit-from an action node may also trigger callback.

Kill control node: The kill node allows a workflow job to kill itself.  When a workflow job reaches the kill it finishes in error (KILLED).  If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed.  A workflow definition may have zero or more kill nodes.

Decision node: A decision node enables a workflow to make a selection on the execution path to follow.  The behavior of a decision node can be seen as a switch-case statement.
A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.  Predicates are JSP Expression Language (EL) expressions that resolve into a Boolean value, true or false.  The default element in the decision node indicates the transition to take if none of the predicates evaluates to true.  All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.

Fork/join control nodes:
A fork node splits one path of execution into multiple concurrent paths of execution.  A join node waits until every concurrent execution path of a previous fork node arrives to it.  The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.

OOZIE is quite flexible in manner of the different type of tasks it can handle, as the action node in the program can be a job to ReduceMap, a Java app, a file system job, or even a Pig application. The representation of the flow control in the develop DAG graph is done by the use of node elements, which function on the logic taken from the input generated by the preceding job in the same DAG graph. Flow control nodes in the Oozie programs can be join nodes, forks nodes, and decisions nodes. The figure 2 is an example of workflow in the OOZIE application.

02
Figure 2 : Example for Oozie Workflow

All computation/processing tasks triggered by an action node are remote to Oozie – they are executed by Hadoop Map/Reduce framework. This approach allows Oozie to leverage existing Hadoop machinery for load balancing, fail over, etc. The majority of these tasks are executed asynchronously (the exception is the file system action that is handled synchronously). This means that for most types of computation/processing tasks triggered by workflow action, the workflow job has to wait until the computation/processing task completes before transitioning to the following node in the workflow. Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling. When a computation/processing tasks is started by Oozie, Oozie provides a unique callback URL to the task, the task should invoke the given URL to notify its completion. For cases that the task failed to invoke the callback URL for any reason (i.e. a transient network failure) or when the type of task cannot invoke the callback URL upon completion, Oozie has a mechanism to poll computation/processing tasks for completion.

Oozie workflows can be parameterized (using variables like ${inputDir} within the workflow definition). When submitting a workflow job values for the parameters must be provided. If properly parameterized (i.e. using different output directories) several identical workflow jobs can concurrently. Some of the workflows are invoked on demand, but the majority of times it is necessary to run them based on regular time intervals and/or data availability and/or external events. The Oozie Coordinator system allows the user to define workflow execution schedules based on these parameters. Oozie coordinator allows to model workflow execution triggers in the form of the predicates, which can reference to data, time and/or external events. The workflow job is started after the predicate is satisfied. It is also often necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. Chaining together these workflows result it is referred as a data application pipeline. Oozie coordinator support creation of such data Application pipelines.

 Why Oozie?

Main purpose of using Oozie is to manage different type of jobs being processed in Hadoop system. Dependencies between jobs are specified by user in the form of Directed Acyclic Graphs. Oozie consumes this information and takes care of their execution in correct order as specified in a workflow. That is the way user’s time to manage complete workflow is saved. In addition, Oozie has a provision to specify frequency of execution of a particular job.

     Features:

  • Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.
  • Using its Web Service APIs one can control jobs from anywhere.
  • Oozie has provision to execute jobs which are scheduled to run periodically.
  • Oozie has provision to send email notifications upon completion of jobs.
  • It Executes and monitors workflows in Hadoop.
  • IT is responsible for Periodic scheduling of workflows.
  • It Triggers execution of data availability.
  • It provides HTTP and command line interface and web console.

Architecture of Oozie

Oozie was designed to be stateless: if the Oozie server goes down, you don’t actually lose anything. Already-running jobs will actually continue running while the Oozie server is down; once the server comes back up, it will start any pending jobs and transition any workflows with finished actions.  This is convenient because you don’t have to worry about losing anything when an Oozie server goes down.

Oozie maintains in-memory locks for each job to prevent multiple threads (intra-process) from trying to process the same job at the same time. With multiple Oozie servers, we have to extend these locks to distributed locks that all the Oozie servers can share (inter-process). To that end, we used Apache Zookeeper for the distributed locks, and to interact with ZooKeeper, we used Apache Curator, which is a set of libraries that make interacting with ZooKeeper much easier (and something you’ve read about here before).

03
Figure 3 : Basic Architecture of Oozie

Oozie database:  Oozie stores almost all its state (submitted jobs, workflow definitions, and so on) in a database.

Oozie Server: Whenever the job is submitted the Oozie server approaches the Oozie Database to get the stored workflows and then it uses Map/Reduce to execute the job. In this case we can have many masters based on the requirement.  Once the job is submitted it will use the slaves to execute the jobs. Slaves uses Map/Reduce to execute. They will read the required data from the Hadoop Distributed File Systems (HDFS). Apart from the Map/Reduce it can use Oozie Actions to invoke the shell commands.

Oozie Client:   Client should be configured Oozie. Almost all the web interfaces which are  installed on the top of Hadoop (like Hue) can actually understand Oozie and try to provide better interfaces to avoid people struggling to develop the xml file with plain hands.

Advantages of Oozie:

  • Oozie is designed to scale in a Hadoop cluster. Each job will be launched from a different data node. This means that the workflow load will be balanced and no single machine will become overburdened by launching workflows. This also means that the capacity to launch workflows will grow as the cluster grows.
  • Oozie is well integrated with Hadoop security. This is especially important in a kerberized cluster. Oozie knows which user submitted the job and will launch all actions as that user, with the proper privileges. It will handle all the authentication details for the user as well.
  • Oozie is the only workflow manager with built-in Hadoop actions, making workflow development, maintenance and troubleshooting easier.
  • Oozie UI makes it easier to drill down to specific errors in the data nodes. Other systems would require significantly more work to correlate job tracker jobs with the workflow actions.
  • Oozie is proven to scale in some of the world’s largest clusters. The white paper discusses a deployment at Yahoo! that can handle 1250 job submissions a minute.
  • Oozie gets callbacks from MapReduce jobs so it knows when they finish and whether they hang without expensive polling. No other workflow manager can do this.
  • Oozie Coordinator allows triggering actions when files arrive at HDFS. This will be challenging to implement anywhere else.
  • Oozie is supported by Hadoop vendors. If there is ever an issue with how the workflow manager integrates with Hadoop – you can turn to the people who wrote the code for answers.

Conclusion:

Oozie allows a user to create Directed Acyclic Graphs of workflows and these can run in parallel and sequential in Hadoop. Oozie can also run plain java classes, Pig workflows, and interact with the HDFS Nice if you need to delete or move files before a job runs Oozie can run job’s sequentially (one after the other) and in parallel (multiple at a time). Oozie allows you to restart from a failure. One can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes. It has Major flexibility Start, Stop, Suspend, and re-run jobs. Oozie is aimed to manage different type of jobs being processed in Hadoop system. Hence, saving user’s time to manage complete workflow.

Thank you for reading!!

References:

  1. https://www.infoq.com/articles/introductionOozie
  2. http://hortonworks.com/apache/oozie/
  3. https://www-01.ibm.com/software/data/infosphere/hadoop/oozie/
  4. http://bigdatauniversity.com/courses/controlling-hadoop-jobs-with-oozie/
  5. https://www.altiscale.com/blog/oozie-launcher-tips-for-tackling-its-challenges/
  6. https://blogs.oracle.com/datawarehousing/entry/building_simple_workflows_in_oozie
  7. https://www.safaribooksonline.com/library/view/apache-oozie/9781449369910/ch04.html
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s