Apache Flink vs MapReduce: Batch Processing Has Evolved

April 16, 2026 · 7 min read

Big Data Engineers

MapReduce was the original distributed computing model that made Hadoop famous. Apache Flink is its modern successor — a unified stream and batch processing engine that runs up to 100x faster on certain workloads. But MapReduce isn't dead. Understanding when each is appropriate is a valuable engineering skill.

The Problem MapReduce Solved (and Created)

MapReduce was revolutionary in 2004. It gave engineers a simple mental model — map inputs to key-value pairs, reduce groups to aggregates — and automatically handled parallelism, fault tolerance, and data movement across thousands of machines.

The problem is the execution model. Every MapReduce job:

Reads from disk (HDFS)
Maps (in memory)
Writes intermediate results to disk
Shuffles data across the network
Reads shuffle data from disk
Reduces (in memory)
Writes final output to disk

Disk I/O at every stage. For iterative algorithms (machine learning, graph processing) that run hundreds of passes over data, this means hundreds of disk reads/writes. It's catastrophically slow for anything beyond single-pass aggregations.

What Flink Does Differently

Apache Flink uses a pipelined, in-memory execution model based on data streams. Data flows through operators without touching disk (unless explicitly checkpointed or memory is exhausted):

Source → Map → Filter → KeyBy → Window → Aggregate → Sink
  (records flow through pipeline; no disk writes between operators)

For iterative workloads, Flink supports native iteration — data cycles back through the pipeline without being written to and read back from HDFS:

Source → Iteration Start → [processing] → Iteration End → Sink
                    ↑________________________↑
                    (in-memory feedback loop)

This makes Flink dramatically faster for:

Machine learning training loops
Graph algorithms (PageRank, shortest path)
Stream processing with event-time windows
Multi-stage ETL with joins

Execution Model Comparison

Aspect	MapReduce	Apache Flink
Processing model	Batch only	Unified stream + batch
Execution style	Stage-by-stage, disk between stages	Pipelined, in-memory
Latency	Minutes (seconds minimum)	Milliseconds (streaming) to seconds (batch)
State management	External (HDFS, HBase)	Built-in keyed state with checkpointing
Fault tolerance	Task re-execution from HDFS	Checkpoint-based recovery
Iterative processing	Manual re-submission of jobs	Native iteration
Windowing	Manual time bucketing	First-class event-time windows
Exactly-once semantics	At-least-once (re-execution)	Yes (with checkpointing)
Memory management	JVM heap	Custom off-heap memory manager
API	Java (Mapper/Reducer interface)	Java, Scala, Python, SQL

API Comparison: Word Count

The classic example shows how much simpler modern APIs are.

MapReduce Word Count

// Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

// Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Flink Word Count (DataStream API)

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> text = env.readTextFile("hdfs:///input/");

DataStream<Tuple2<String, Integer>> counts = text
    .flatMap((String line, Collector<Tuple2<String, Integer>> out) -> {
        for (String word : line.split("\\s")) {
            out.collect(Tuple2.of(word, 1));
        }
    })
    .keyBy(t -> t.f0)
    .sum(1);

counts.writeAsText("hdfs:///output/");
env.execute("Word Count");

Flink Word Count (Table API / SQL)

TableEnvironment tenv = TableEnvironment.create(settings);

tenv.executeSql("""
    CREATE TABLE words (word STRING)
    WITH ('connector' = 'filesystem', 'path' = 'hdfs:///input/', 'format' = 'csv')
""");

tenv.executeSql("""
    SELECT word, COUNT(*) as cnt
    FROM words
    GROUP BY word
""").print();

Streaming: Where MapReduce Simply Doesn't Compete

MapReduce has no native streaming support. Workarounds (micro-batch with frequent MapReduce jobs, Apache Storm) were painful. Flink was built streaming-first.

Flink Streaming Example: Real-Time Fraud Detection

DataStream<Transaction> transactions = env
    .addSource(new KafkaSource<Transaction>(...));

DataStream<Alert> alerts = transactions
    .keyBy(Transaction::getUserId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .process(new FraudDetectionFunction());  // stateful processing

alerts.addSink(new KafkaSink<Alert>(...));

This processes millions of transactions per second with sub-second latency and maintains per-user state across the 5-minute window. MapReduce cannot do this — it requires a complete dataset to be available before processing begins.

Fault Tolerance: Different Philosophies

MapReduce Fault Tolerance

If a task fails, the task tracker re-runs it from the original HDFS input. No state is lost because intermediate data is written to HDFS. Recovery is simple but expensive — re-reading from HDFS and re-running the task takes time proportional to the task's runtime.

Flink Fault Tolerance: Checkpointing

Flink takes periodic snapshots (checkpoints) of all operator state to durable storage (HDFS or S3). On failure, Flink restores from the last checkpoint and replays input events:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Checkpoint every 60 seconds
env.enableCheckpointing(60_000);

// Use RocksDB for large state (doesn't fit in JVM heap)
env.setStateBackend(new EmbeddedRocksDBStateBackend());

// Store checkpoints in HDFS
env.getCheckpointConfig().setCheckpointStorage("hdfs:///flink-checkpoints/");

Recovery time is bounded by checkpoint interval + replay time (not full re-computation). For long-running streaming jobs, this is far superior to MapReduce's restart-from-scratch model.

Running Flink on Hadoop

Flink integrates with the Hadoop ecosystem as a first-class citizen:

# Submit a Flink job to YARN
flink run -m yarn-cluster \
  -yjm 2048 \        # JobManager memory (MB)
  -ytm 4096 \        # TaskManager memory (MB)
  -ys 4 \            # Slots per TaskManager
  my-flink-app.jar

# Read from HDFS
# Flink uses Hadoop's filesystem abstraction natively
env.readTextFile("hdfs://namenode:9000/input/data.txt");

# Write to HDFS
stream.writeAsText("hdfs://namenode:9000/output/");

# Read/write Parquet via Flink's FileSystem connector
tenv.executeSql("""
    CREATE TABLE sales (
        id BIGINT, amount DECIMAL(10,2), dt STRING
    ) WITH (
        'connector' = 'filesystem',
        'path' = 'hdfs:///warehouse/sales/',
        'format' = 'parquet'
    )
""");

Flink also reads from HBase, Hive tables (via Hive connector), Kafka, JDBC, and anything reachable via a Hadoop filesystem.

When MapReduce Is Still the Right Choice

MapReduce hasn't disappeared, and it's still reasonable in specific scenarios:

Legacy job stability — a MapReduce ETL that's run reliably for years with no issues. Rewriting it in Flink adds risk with limited benefit.
Very simple batch jobs — a single-pass aggregation with no iterative logic. MapReduce's simplicity is actually a virtue here.
Hive on MapReduce — existing Hive jobs backed by MapReduce work fine for overnight ETL. Migrating to Tez or Spark is a better target than Flink for HiveQL workloads.
Team expertise — if your team knows MapReduce deeply and the jobs work, the operational cost of a Flink cluster may not be justified.

Performance Benchmark: Representative Numbers

These are indicative figures — actual results vary by hardware, data, and job complexity:

Workload	MapReduce	Flink
Word count (1TB text)	~15 min	~2 min
Iterative PageRank (30 iters)	~10 hrs	~20 min
Sort (1TB)	~8 min	~3 min
Streaming (events/sec per node)	N/A	500K–2M
Complex ETL with joins	~45 min	~5 min

The performance gap widens significantly for iterative algorithms and multi-stage jobs with joins.

Summary

Criteria	MapReduce	Apache Flink
Primary use case	Simple batch ETL	Unified stream + batch
Latency	Minutes	Milliseconds to seconds
Iterative algorithms	Poor	Excellent (native iteration)
Streaming	No	Yes (event-time, exactly-once)
State management	External only	Built-in, checkpointed
API complexity	High (boilerplate)	Moderate (DataStream/SQL)
Hadoop integration	Native	Full (HDFS, YARN, HMS)
Production maturity	Very high (20 years)	High (10+ years)
Best migration path	Hive → Tez, or Spark	New streaming/batch workloads

For new workloads in 2025, Flink is the stronger technical choice across almost every dimension. MapReduce's legacy is in the ecosystem it spawned — Hive, Pig, and the architectural patterns that Flink improved upon. Understanding MapReduce makes you a better Flink engineer; building new pipelines on raw MapReduce is now rarely justified.

The Problem MapReduce Solved (and Created)​

What Flink Does Differently​

Execution Model Comparison​

API Comparison: Word Count​

MapReduce Word Count​

Flink Word Count (DataStream API)​

Flink Word Count (Table API / SQL)​

Streaming: Where MapReduce Simply Doesn't Compete​

Flink Streaming Example: Real-Time Fraud Detection​

Fault Tolerance: Different Philosophies​

MapReduce Fault Tolerance​

Flink Fault Tolerance: Checkpointing​

Running Flink on Hadoop​

When MapReduce Is Still the Right Choice​

Performance Benchmark: Representative Numbers​

Summary​