OOZIE WORKFLOW





Simple Oozie workflow diagram:

Oozie Workflow:

Workflow in Oozie is a sequence of actions(schedules) are arranged in the Directed Acyclic Graph (DAG). The schedules are in controlled dependency as the next schedule can only run as per the output of the current schedule. In subsequent schedules are not independent on its previous schedules.

The Oozie workflow action can be in Java action, Hive action and some Shell scriptings actions, etc. There can be decision trees to decide how and on which condition a job should run.

Some chron jobs like Kafka jobs, Scripting jobs are scheduled by Oozie. It detects completion of tasks through a callback and polling. When Oozie starts a task then will take a unique call back HTTP URL to the task and notifications that URL until task completed. If the Oozie takes the task fails to instance the callback URL Oozie can poll the task for completion.

Mainly these three types of jobs in Oozie:

Oozie Workflow Jobs – Oozie jobs are represented as Directed Acyclic Graphs to specific actions to be executed.

Oozie Coordinator Jobs – Oozie coordinator jobs are consist of workflow jobs triggered by time and data availability (Scheduling).

Oozie Bundle – Oozie like as a package of multiple coordinators and workflow jobs( Chron jobs).

What is Apache Oozie? Why Oozie? and Workflow

What is Apache Oozie?






Apache Oozie is an open source scheduler (time) system to run and manage Hadoop jobs in a distributed environment.  And it is a one of the component of Hadoop which is exclusively meant for work flow creations and scheduling of same work flow.

(Other Definition):

Oozie is mainly open source, distributed, scalable and fault tolerant scheduling component which is java based GUI component in Hadoop which run is below host name:

http:// <<hostnameOfOOZIE>>:11000/oozie

Why Oozie DAG(Directed Acyclic Graph) scheduler?

Simply it is DAG because one task execution depending upon the dependent task completion.

Core Building Blocks of Oozie:

1.  Property File (job.properties)

2. Workflow (workflow. xml)

3. Coordinator (Coordinator. xml)

1. Property File (job.properties):

To configure the high level information we will use this file:

I)Physical location Name node in the cluster

II)Physical location of the resource manager in the cluster.

III)Work flow path on HDFS etc.

Example:

#properties

name Node = hdfs :// root

job Tracker = hostname.com:8088

2. Workflow (workflow. xml)

It is a collection of action tags where each and every action denotes one unit of task, will not talk about job level scheduling information instead. It will only talk about action level details.

Workflow .xml sample template:

<workflow – app xmlns =” url. oozie. work flow:0.4″ name =”First Template”><start to = “Create_external table”/>

<action name = “Create_External_Table”/>

<hive xmlns = “url : oozie : hive-action:0.4”>

<job – tracker> hostname .com:8088</job=tracker>

<name-node> hdfs://rootname</name-node>

<script>hdfs_path_of_script/external.hive</script>

</hive>

</workflow – app>

3. Coordinator (Coordinator. xml)

In order to schedule the job we will make use of coordinator . xml some of the attributes of coordinator .xml are:

I) Unique job id

II) Job name

III) status (either running, killed, success, suspended,resume etc)

IV) username

V) Group name

VI) Start time and End time

VII) Frequency of the running job

VIII) Next generation

Example:

Modified Workflow:

<workflow -app xmlns = “url:oozie:workflow:0.4” name = “First Template”>

<start to = “Insert_into_Table”/>

<action name = “Insert_into_Table”>

<hive xmlns = “url:oozie:workflow:0.4 “>

<job-tracker>${job Tracker}</job-tracker>

<name-node>${nameNode}</name – node>

</hive>

< ok to = “end”/>

<error to = “kill_job”/> </action>

</workflow – app>