Skip to main content

Apache Flink vs MapReduce: Batch Processing Has Evolved

· 7 min read
Hadoop.so Editorial Team
Big Data Engineers

MapReduce was the original distributed computing model that made Hadoop famous. Apache Flink is its modern successor — a unified stream and batch processing engine that runs up to 100x faster on certain workloads. But MapReduce isn't dead. Understanding when each is appropriate is a valuable engineering skill.

The Problem MapReduce Solved (and Created)

MapReduce was revolutionary in 2004. It gave engineers a simple mental model — map inputs to key-value pairs, reduce groups to aggregates — and automatically handled parallelism, fault tolerance, and data movement across thousands of machines.

The problem is the execution model. Every MapReduce job:

  1. Reads from disk (HDFS)
  2. Maps (in memory)
  3. Writes intermediate results to disk
  4. Shuffles data across the network
  5. Reads shuffle data from disk
  6. Reduces (in memory)
  7. Writes final output to disk

Disk I/O at every stage. For iterative algorithms (machine learning, graph processing) that run hundreds of passes over data, this means hundreds of disk reads/writes. It's catastrophically slow for anything beyond single-pass aggregations.


Apache Flink uses a pipelined, in-memory execution model based on data streams. Data flows through operators without touching disk (unless explicitly checkpointed or memory is exhausted):

Source → Map → Filter → KeyBy → Window → Aggregate → Sink
(records flow through pipeline; no disk writes between operators)

For iterative workloads, Flink supports native iteration — data cycles back through the pipeline without being written to and read back from HDFS:

Source → Iteration Start → [processing] → Iteration End → Sink
↑________________________↑
(in-memory feedback loop)

This makes Flink dramatically faster for:

  • Machine learning training loops
  • Graph algorithms (PageRank, shortest path)
  • Stream processing with event-time windows
  • Multi-stage ETL with joins

Execution Model Comparison

AspectMapReduceApache Flink
Processing modelBatch onlyUnified stream + batch
Execution styleStage-by-stage, disk between stagesPipelined, in-memory
LatencyMinutes (seconds minimum)Milliseconds (streaming) to seconds (batch)
State managementExternal (HDFS, HBase)Built-in keyed state with checkpointing
Fault toleranceTask re-execution from HDFSCheckpoint-based recovery
Iterative processingManual re-submission of jobsNative iteration
WindowingManual time bucketingFirst-class event-time windows
Exactly-once semanticsAt-least-once (re-execution)Yes (with checkpointing)
Memory managementJVM heapCustom off-heap memory manager
APIJava (Mapper/Reducer interface)Java, Scala, Python, SQL

API Comparison: Word Count

The classic example shows how much simpler modern APIs are.

MapReduce Word Count

// Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

// Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> text = env.readTextFile("hdfs:///input/");

DataStream<Tuple2<String, Integer>> counts = text
.flatMap((String line, Collector<Tuple2<String, Integer>> out) -> {
for (String word : line.split("\\s")) {
out.collect(Tuple2.of(word, 1));
}
})
.keyBy(t -> t.f0)
.sum(1);

counts.writeAsText("hdfs:///output/");
env.execute("Word Count");
TableEnvironment tenv = TableEnvironment.create(settings);

tenv.executeSql("""
CREATE TABLE words (word STRING)
WITH ('connector' = 'filesystem', 'path' = 'hdfs:///input/', 'format' = 'csv')
""");

tenv.executeSql("""
SELECT word, COUNT(*) as cnt
FROM words
GROUP BY word
""").print();

Streaming: Where MapReduce Simply Doesn't Compete

MapReduce has no native streaming support. Workarounds (micro-batch with frequent MapReduce jobs, Apache Storm) were painful. Flink was built streaming-first.

DataStream<Transaction> transactions = env
.addSource(new KafkaSource<Transaction>(...));

DataStream<Alert> alerts = transactions
.keyBy(Transaction::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new FraudDetectionFunction()); // stateful processing

alerts.addSink(new KafkaSink<Alert>(...));

This processes millions of transactions per second with sub-second latency and maintains per-user state across the 5-minute window. MapReduce cannot do this — it requires a complete dataset to be available before processing begins.


Fault Tolerance: Different Philosophies

MapReduce Fault Tolerance

If a task fails, the task tracker re-runs it from the original HDFS input. No state is lost because intermediate data is written to HDFS. Recovery is simple but expensive — re-reading from HDFS and re-running the task takes time proportional to the task's runtime.

Flink takes periodic snapshots (checkpoints) of all operator state to durable storage (HDFS or S3). On failure, Flink restores from the last checkpoint and replays input events:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Checkpoint every 60 seconds
env.enableCheckpointing(60_000);

// Use RocksDB for large state (doesn't fit in JVM heap)
env.setStateBackend(new EmbeddedRocksDBStateBackend());

// Store checkpoints in HDFS
env.getCheckpointConfig().setCheckpointStorage("hdfs:///flink-checkpoints/");

Recovery time is bounded by checkpoint interval + replay time (not full re-computation). For long-running streaming jobs, this is far superior to MapReduce's restart-from-scratch model.


Flink integrates with the Hadoop ecosystem as a first-class citizen:

# Submit a Flink job to YARN
flink run -m yarn-cluster \
-yjm 2048 \ # JobManager memory (MB)
-ytm 4096 \ # TaskManager memory (MB)
-ys 4 \ # Slots per TaskManager
my-flink-app.jar

# Read from HDFS
# Flink uses Hadoop's filesystem abstraction natively
env.readTextFile("hdfs://namenode:9000/input/data.txt");

# Write to HDFS
stream.writeAsText("hdfs://namenode:9000/output/");

# Read/write Parquet via Flink's FileSystem connector
tenv.executeSql("""
CREATE TABLE sales (
id BIGINT, amount DECIMAL(10,2), dt STRING
) WITH (
'connector' = 'filesystem',
'path' = 'hdfs:///warehouse/sales/',
'format' = 'parquet'
)
""");

Flink also reads from HBase, Hive tables (via Hive connector), Kafka, JDBC, and anything reachable via a Hadoop filesystem.


When MapReduce Is Still the Right Choice

MapReduce hasn't disappeared, and it's still reasonable in specific scenarios:

  • Legacy job stability — a MapReduce ETL that's run reliably for years with no issues. Rewriting it in Flink adds risk with limited benefit.
  • Very simple batch jobs — a single-pass aggregation with no iterative logic. MapReduce's simplicity is actually a virtue here.
  • Hive on MapReduce — existing Hive jobs backed by MapReduce work fine for overnight ETL. Migrating to Tez or Spark is a better target than Flink for HiveQL workloads.
  • Team expertise — if your team knows MapReduce deeply and the jobs work, the operational cost of a Flink cluster may not be justified.

Performance Benchmark: Representative Numbers

These are indicative figures — actual results vary by hardware, data, and job complexity:

WorkloadMapReduceFlink
Word count (1TB text)~15 min~2 min
Iterative PageRank (30 iters)~10 hrs~20 min
Sort (1TB)~8 min~3 min
Streaming (events/sec per node)N/A500K–2M
Complex ETL with joins~45 min~5 min

The performance gap widens significantly for iterative algorithms and multi-stage jobs with joins.


Summary

CriteriaMapReduceApache Flink
Primary use caseSimple batch ETLUnified stream + batch
LatencyMinutesMilliseconds to seconds
Iterative algorithmsPoorExcellent (native iteration)
StreamingNoYes (event-time, exactly-once)
State managementExternal onlyBuilt-in, checkpointed
API complexityHigh (boilerplate)Moderate (DataStream/SQL)
Hadoop integrationNativeFull (HDFS, YARN, HMS)
Production maturityVery high (20 years)High (10+ years)
Best migration pathHive → Tez, or SparkNew streaming/batch workloads

For new workloads in 2025, Flink is the stronger technical choice across almost every dimension. MapReduce's legacy is in the ecosystem it spawned — Hive, Pig, and the architectural patterns that Flink improved upon. Understanding MapReduce makes you a better Flink engineer; building new pipelines on raw MapReduce is now rarely justified.