Skip to main content

Apache Oozie

What Is Oozie?

Apache Oozie is a workflow scheduler and coordinator for Hadoop jobs. It lets you define multi-step pipelines — chains of MapReduce, Hive, Pig, Spark, HDFS, and shell actions — and schedule them based on time or data availability.

Oozie Server

├── Workflow Jobs ── sequential/parallel DAG of actions
├── Coordinator Jobs ── trigger workflows on schedule or data arrival
└── Bundle Jobs ── manage groups of coordinator jobs

All job definitions are written in XML and stored on HDFS.

Core Job Types

TypeDescription
WorkflowA DAG of actions; runs once when triggered
CoordinatorSchedules workflow runs based on time or input data readiness
BundleGroups multiple coordinator jobs for lifecycle management

Workflow Definition (workflow.xml)

A simple ETL workflow: ingest → transform with Hive → verify output.

<workflow-app name="daily-etl" xmlns="uri:oozie:workflow:0.5">

<start to="ingest-data"/>

<!-- Action 1: Shell script to pull data -->
<action name="ingest-data">
<shell xmlns="uri:oozie:shell-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>ingest.sh</exec>
<argument>${dt}</argument>
<file>ingest.sh#ingest.sh</file>
</shell>
<ok to="transform-hive"/>
<error to="send-failure-email"/>
</action>

<!-- Action 2: Hive query -->
<action name="transform-hive">
<hive xmlns="uri:oozie:hive-action:0.6">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>transform.hql</script>
<param>dt=${dt}</param>
<file>transform.hql#transform.hql</file>
</hive>
<ok to="verify-output"/>
<error to="send-failure-email"/>
</action>

<!-- Action 3: Check output exists -->
<action name="verify-output">
<fs>
<chmod path="${outputDir}" permissions="755" dir-files="false"/>
</fs>
<ok to="end"/>
<error to="send-failure-email"/>
</action>

<!-- Email on failure -->
<action name="send-failure-email">
<email xmlns="uri:oozie:email-action:0.2">
<to>ops@example.com</to>
<subject>Oozie workflow FAILED: daily-etl ${dt}</subject>
<body>Check Oozie console for details: http://oozie-host:11000/oozie</body>
</email>
<ok to="fail"/>
<error to="fail"/>
</action>

<kill name="fail">
<message>Workflow failed: ${wf:errorMessage(wf:lastErrorNode())}</message>
</kill>

<end name="end"/>

</workflow-app>

Job Properties (job.properties)

nameNode=hdfs://namenode:8020
jobTracker=resourcemanager:8032
oozie.wf.application.path=${nameNode}/user/oozie/workflows/daily-etl
outputDir=/data/output/daily
dt=2026-04-29
queueName=default
oozie.use.system.libpath=true

Coordinator Definition (coordinator.xml)

Run the workflow daily at 2 AM UTC and only when input data for that day exists:

<coordinator-app name="daily-etl-coord"
frequency="${coord:days(1)}"
start="2026-01-01T02:00Z"
end="2030-12-31T02:00Z"
timezone="UTC"
xmlns="uri:oozie:coordinator:0.4">

<datasets>
<dataset name="input-data"
frequency="${coord:days(1)}"
initial-instance="2026-01-01T00:00Z"
timezone="UTC">
<uri-template>/data/input/${YEAR}/${MONTH}/${DAY}</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
</datasets>

<input-events>
<data-in name="input" dataset="input-data">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>

<action>
<workflow>
<app-path>${nameNode}/user/oozie/workflows/daily-etl</app-path>
<configuration>
<property>
<name>dt</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}</value>
</property>
</configuration>
</workflow>
</action>

</coordinator-app>

Submitting and Managing Jobs

# Submit and start a workflow
oozie job -oozie http://oozie-host:11000/oozie \
-config job.properties -run

# Submit a coordinator
oozie job -oozie http://oozie-host:11000/oozie \
-config coordinator-job.properties -run

# Check job status
oozie job -oozie http://oozie-host:11000/oozie \
-info 0000001-260429120000000-oozie-oozi-W

# List running jobs
oozie jobs -oozie http://oozie-host:11000/oozie \
-status RUNNING

# Kill a job
oozie job -oozie http://oozie-host:11000/oozie \
-kill 0000001-260429120000000-oozie-oozi-W

# Rerun a failed workflow from a specific node
oozie job -oozie http://oozie-host:11000/oozie \
-rerun 0000001-260429120000000-oozie-oozi-W \
-Doozie.wf.rerun.failnodes=true

HDFS Layout for a Workflow

/user/oozie/workflows/daily-etl/
├── workflow.xml # workflow definition
├── coordinator.xml # (optional) coordinator
├── job.properties # (local) submitted with -config
├── lib/ # JARs auto-added to Hadoop classpath
│ └── my-udf.jar
├── transform.hql # Hive script
└── ingest.sh # Shell script

Supported Action Types

ActionDescription
<map-reduce>Run a MapReduce job
<hive>Execute a HiveQL script
<pig>Run a Pig Latin script
<spark>Submit a Spark application
<shell>Run a shell script
<fs>HDFS operations (mkdir, move, delete, chmod)
<java>Run a Java main class
<email>Send notification emails
<sub-workflow>Invoke another workflow
<fork> / <join>Run actions in parallel

Parallel Execution with Fork/Join

<fork name="parallel-tasks">
<path start="task-a"/>
<path start="task-b"/>
</fork>

<action name="task-a">
<hive ...> ... </hive>
<ok to="wait-for-both"/>
<error to="fail"/>
</action>

<action name="task-b">
<pig ...> ... </pig>
<ok to="wait-for-both"/>
<error to="fail"/>
</action>

<join name="wait-for-both" to="next-step"/>