Apache Oozie
What Is Oozie?
Apache Oozie is a workflow scheduler and coordinator for Hadoop jobs. It lets you define multi-step pipelines — chains of MapReduce, Hive, Pig, Spark, HDFS, and shell actions — and schedule them based on time or data availability.
Oozie Server
│
├── Workflow Jobs ── sequential/parallel DAG of actions
├── Coordinator Jobs ── trigger workflows on schedule or data arrival
└── Bundle Jobs ── manage groups of coordinator jobs
All job definitions are written in XML and stored on HDFS.
Core Job Types
| Type | Description |
|---|---|
| Workflow | A DAG of actions; runs once when triggered |
| Coordinator | Schedules workflow runs based on time or input data readiness |
| Bundle | Groups multiple coordinator jobs for lifecycle management |
Workflow Definition (workflow.xml)
A simple ETL workflow: ingest → transform with Hive → verify output.
<workflow-app name="daily-etl" xmlns="uri:oozie:workflow:0.5">
<start to="ingest-data"/>
<!-- Action 1: Shell script to pull data -->
<action name="ingest-data">
<shell xmlns="uri:oozie:shell-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>ingest.sh</exec>
<argument>${dt}</argument>
<file>ingest.sh#ingest.sh</file>
</shell>
<ok to="transform-hive"/>
<error to="send-failure-email"/>
</action>
<!-- Action 2: Hive query -->
<action name="transform-hive">
<hive xmlns="uri:oozie:hive-action:0.6">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>transform.hql</script>
<param>dt=${dt}</param>
<file>transform.hql#transform.hql</file>
</hive>
<ok to="verify-output"/>
<error to="send-failure-email"/>
</action>
<!-- Action 3: Check output exists -->
<action name="verify-output">
<fs>
<chmod path="${outputDir}" permissions="755" dir-files="false"/>
</fs>
<ok to="end"/>
<error to="send-failure-email"/>
</action>
<!-- Email on failure -->
<action name="send-failure-email">
<email xmlns="uri:oozie:email-action:0.2">
<to>ops@example.com</to>
<subject>Oozie workflow FAILED: daily-etl ${dt}</subject>
<body>Check Oozie console for details: http://oozie-host:11000/oozie</body>
</email>
<ok to="fail"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed: ${wf:errorMessage(wf:lastErrorNode())}</message>
</kill>
<end name="end"/>
</workflow-app>
Job Properties (job.properties)
nameNode=hdfs://namenode:8020
jobTracker=resourcemanager:8032
oozie.wf.application.path=${nameNode}/user/oozie/workflows/daily-etl
outputDir=/data/output/daily
dt=2026-04-29
queueName=default
oozie.use.system.libpath=true
Coordinator Definition (coordinator.xml)
Run the workflow daily at 2 AM UTC and only when input data for that day exists:
<coordinator-app name="daily-etl-coord"
frequency="${coord:days(1)}"
start="2026-01-01T02:00Z"
end="2030-12-31T02:00Z"
timezone="UTC"
xmlns="uri:oozie:coordinator:0.4">
<datasets>
<dataset name="input-data"
frequency="${coord:days(1)}"
initial-instance="2026-01-01T00:00Z"
timezone="UTC">
<uri-template>/data/input/${YEAR}/${MONTH}/${DAY}</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="input" dataset="input-data">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${nameNode}/user/oozie/workflows/daily-etl</app-path>
<configuration>
<property>
<name>dt</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
Submitting and Managing Jobs
# Submit and start a workflow
oozie job -oozie http://oozie-host:11000/oozie \
-config job.properties -run
# Submit a coordinator
oozie job -oozie http://oozie-host:11000/oozie \
-config coordinator-job.properties -run
# Check job status
oozie job -oozie http://oozie-host:11000/oozie \
-info 0000001-260429120000000-oozie-oozi-W
# List running jobs
oozie jobs -oozie http://oozie-host:11000/oozie \
-status RUNNING
# Kill a job
oozie job -oozie http://oozie-host:11000/oozie \
-kill 0000001-260429120000000-oozie-oozi-W
# Rerun a failed workflow from a specific node
oozie job -oozie http://oozie-host:11000/oozie \
-rerun 0000001-260429120000000-oozie-oozi-W \
-Doozie.wf.rerun.failnodes=true
HDFS Layout for a Workflow
/user/oozie/workflows/daily-etl/
├── workflow.xml # workflow definition
├── coordinator.xml # (optional) coordinator
├── job.properties # (local) submitted with -config
├── lib/ # JARs auto-added to Hadoop classpath
│ └── my-udf.jar
├── transform.hql # Hive script
└── ingest.sh # Shell script
Supported Action Types
| Action | Description |
|---|---|
<map-reduce> | Run a MapReduce job |
<hive> | Execute a HiveQL script |
<pig> | Run a Pig Latin script |
<spark> | Submit a Spark application |
<shell> | Run a shell script |
<fs> | HDFS operations (mkdir, move, delete, chmod) |
<java> | Run a Java main class |
<email> | Send notification emails |
<sub-workflow> | Invoke another workflow |
<fork> / <join> | Run actions in parallel |
Parallel Execution with Fork/Join
<fork name="parallel-tasks">
<path start="task-a"/>
<path start="task-b"/>
</fork>
<action name="task-a">
<hive ...> ... </hive>
<ok to="wait-for-both"/>
<error to="fail"/>
</action>
<action name="task-b">
<pig ...> ... </pig>
<ok to="wait-for-both"/>
<error to="fail"/>
</action>
<join name="wait-for-both" to="next-step"/>