Apache Oozie

What Is Oozie?

Apache Oozie is a workflow scheduler and coordinator for Hadoop jobs. It lets you define multi-step pipelines — chains of MapReduce, Hive, Pig, Spark, HDFS, and shell actions — and schedule them based on time or data availability.

Oozie Server
  │
  ├── Workflow Jobs  ── sequential/parallel DAG of actions
  ├── Coordinator Jobs ── trigger workflows on schedule or data arrival
  └── Bundle Jobs    ── manage groups of coordinator jobs

All job definitions are written in XML and stored on HDFS.

Core Job Types

Type	Description
Workflow	A DAG of actions; runs once when triggered
Coordinator	Schedules workflow runs based on time or input data readiness
Bundle	Groups multiple coordinator jobs for lifecycle management

Workflow Definition (workflow.xml)

A simple ETL workflow: ingest → transform with Hive → verify output.

<workflow-app name="daily-etl" xmlns="uri:oozie:workflow:0.5">

  <start to="ingest-data"/>

  <!-- Action 1: Shell script to pull data -->
  <action name="ingest-data">
    <shell xmlns="uri:oozie:shell-action:0.3">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <exec>ingest.sh</exec>
      <argument>${dt}</argument>
      <file>ingest.sh#ingest.sh</file>
    </shell>
    <ok to="transform-hive"/>
    <error to="send-failure-email"/>
  </action>

  <!-- Action 2: Hive query -->
  <action name="transform-hive">
    <hive xmlns="uri:oozie:hive-action:0.6">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <script>transform.hql</script>
      <param>dt=${dt}</param>
      <file>transform.hql#transform.hql</file>
    </hive>
    <ok to="verify-output"/>
    <error to="send-failure-email"/>
  </action>

  <!-- Action 3: Check output exists -->
  <action name="verify-output">
    <fs>
      <chmod path="${outputDir}" permissions="755" dir-files="false"/>
    </fs>
    <ok to="end"/>
    <error to="send-failure-email"/>
  </action>

  <!-- Email on failure -->
  <action name="send-failure-email">
    <email xmlns="uri:oozie:email-action:0.2">
      <to>ops@example.com</to>
      <subject>Oozie workflow FAILED: daily-etl ${dt}</subject>
      <body>Check Oozie console for details: http://oozie-host:11000/oozie</body>
    </email>
    <ok to="fail"/>
    <error to="fail"/>
  </action>

  <kill name="fail">
    <message>Workflow failed: ${wf:errorMessage(wf:lastErrorNode())}</message>
  </kill>

  <end name="end"/>

</workflow-app>

Job Properties (job.properties)

nameNode=hdfs://namenode:8020
jobTracker=resourcemanager:8032
oozie.wf.application.path=${nameNode}/user/oozie/workflows/daily-etl
outputDir=/data/output/daily
dt=2026-04-29
queueName=default
oozie.use.system.libpath=true

Coordinator Definition (coordinator.xml)

Run the workflow daily at 2 AM UTC and only when input data for that day exists:

<coordinator-app name="daily-etl-coord"
  frequency="${coord:days(1)}"
  start="2026-01-01T02:00Z"
  end="2030-12-31T02:00Z"
  timezone="UTC"
  xmlns="uri:oozie:coordinator:0.4">

  <datasets>
    <dataset name="input-data"
      frequency="${coord:days(1)}"
      initial-instance="2026-01-01T00:00Z"
      timezone="UTC">
      <uri-template>/data/input/${YEAR}/${MONTH}/${DAY}</uri-template>
      <done-flag>_SUCCESS</done-flag>
    </dataset>
  </datasets>

  <input-events>
    <data-in name="input" dataset="input-data">
      <instance>${coord:current(0)}</instance>
    </data-in>
  </input-events>

  <action>
    <workflow>
      <app-path>${nameNode}/user/oozie/workflows/daily-etl</app-path>
      <configuration>
        <property>
          <name>dt</name>
          <value>${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}</value>
        </property>
      </configuration>
    </workflow>
  </action>

</coordinator-app>

Submitting and Managing Jobs

# Submit and start a workflow
oozie job -oozie http://oozie-host:11000/oozie \
  -config job.properties -run

# Submit a coordinator
oozie job -oozie http://oozie-host:11000/oozie \
  -config coordinator-job.properties -run

# Check job status
oozie job -oozie http://oozie-host:11000/oozie \
  -info 0000001-260429120000000-oozie-oozi-W

# List running jobs
oozie jobs -oozie http://oozie-host:11000/oozie \
  -status RUNNING

# Kill a job
oozie job -oozie http://oozie-host:11000/oozie \
  -kill 0000001-260429120000000-oozie-oozi-W

# Rerun a failed workflow from a specific node
oozie job -oozie http://oozie-host:11000/oozie \
  -rerun 0000001-260429120000000-oozie-oozi-W \
  -Doozie.wf.rerun.failnodes=true

HDFS Layout for a Workflow

/user/oozie/workflows/daily-etl/
  ├── workflow.xml        # workflow definition
  ├── coordinator.xml     # (optional) coordinator
  ├── job.properties      # (local) submitted with -config
  ├── lib/                # JARs auto-added to Hadoop classpath
  │   └── my-udf.jar
  ├── transform.hql       # Hive script
  └── ingest.sh           # Shell script

Supported Action Types

Action	Description
`<map-reduce>`	Run a MapReduce job
`<hive>`	Execute a HiveQL script
`<pig>`	Run a Pig Latin script
`<spark>`	Submit a Spark application
`<shell>`	Run a shell script
`<fs>`	HDFS operations (mkdir, move, delete, chmod)
`<java>`	Run a Java main class
`<email>`	Send notification emails
`<sub-workflow>`	Invoke another workflow
`<fork>` / `<join>`	Run actions in parallel

Parallel Execution with Fork/Join

<fork name="parallel-tasks">
  <path start="task-a"/>
  <path start="task-b"/>
</fork>

<action name="task-a">
  <hive ...> ... </hive>
  <ok to="wait-for-both"/>
  <error to="fail"/>
</action>

<action name="task-b">
  <pig ...> ... </pig>
  <ok to="wait-for-both"/>
  <error to="fail"/>
</action>

<join name="wait-for-both" to="next-step"/>

What Is Oozie?​

Core Job Types​

Workflow Definition (workflow.xml)​

Job Properties (job.properties)​

Coordinator Definition (coordinator.xml)​

Submitting and Managing Jobs​

HDFS Layout for a Workflow​

Supported Action Types​

Parallel Execution with Fork/Join​