What is Apache Oozie? Why Oozie? and Workflow

What is Apache Oozie?






Apache Oozie is an open source scheduler (time) system to run and manage Hadoop jobs in a distributed environment.  And it is a one of the component of Hadoop which is exclusively meant for work flow creations and scheduling of same work flow.

(Other Definition):

Oozie is mainly open source, distributed, scalable and fault tolerant scheduling component which is java based GUI component in Hadoop which run is below host name:

http:// <<hostnameOfOOZIE>>:11000/oozie

Why Oozie DAG(Directed Acyclic Graph) scheduler?

Simply it is DAG because one task execution depending upon the dependent task completion.

Core Building Blocks of Oozie:

1.  Property File (job.properties)

2. Workflow (workflow. xml)

3. Coordinator (Coordinator. xml)

1. Property File (job.properties):

To configure the high level information we will use this file:

I)Physical location Name node in the cluster

II)Physical location of the resource manager in the cluster.

III)Work flow path on HDFS etc.

Example:

#properties

name Node = hdfs :// root

job Tracker = hostname.com:8088

2. Workflow (workflow. xml)

It is a collection of action tags where each and every action denotes one unit of task, will not talk about job level scheduling information instead. It will only talk about action level details.

Workflow .xml sample template:

<workflow – app xmlns =” url. oozie. work flow:0.4″ name =”First Template”><start to = “Create_external table”/>

<action name = “Create_External_Table”/>

<hive xmlns = “url : oozie : hive-action:0.4”>

<job – tracker> hostname .com:8088</job=tracker>

<name-node> hdfs://rootname</name-node>

<script>hdfs_path_of_script/external.hive</script>

</hive>

</workflow – app>

3. Coordinator (Coordinator. xml)

In order to schedule the job we will make use of coordinator . xml some of the attributes of coordinator .xml are:

I) Unique job id

II) Job name

III) status (either running, killed, success, suspended,resume etc)

IV) username

V) Group name

VI) Start time and End time

VII) Frequency of the running job

VIII) Next generation

Example:

Modified Workflow:

<workflow -app xmlns = “url:oozie:workflow:0.4” name = “First Template”>

<start to = “Insert_into_Table”/>

<action name = “Insert_into_Table”>

<hive xmlns = “url:oozie:workflow:0.4 “>

<job-tracker>${job Tracker}</job-tracker>

<name-node>${nameNode}</name – node>

</hive>

< ok to = “end”/>

<error to = “kill_job”/> </action>

</workflow – app>