Apache Flink vs MapReduce: Batch Processing Has Evolved
MapReduce was the original distributed computing model that made Hadoop famous. Apache Flink is its modern successor — a unified stream and batch processing engine that runs up to 100x faster on certain workloads. But MapReduce isn't dead. Understanding when each is appropriate is a valuable engineering skill.
The Problem MapReduce Solved (and Created)
MapReduce was revolutionary in 2004. It gave engineers a simple mental model — map inputs to key-value pairs, reduce groups to aggregates — and automatically handled parallelism, fault tolerance, and data movement across thousands of machines.
The problem is the execution model. Every MapReduce job:
- Reads from disk (HDFS)
- Maps (in memory)
- Writes intermediate results to disk
- Shuffles data across the network
- Reads shuffle data from disk
- Reduces (in memory)
- Writes final output to disk
Disk I/O at every stage. For iterative algorithms (machine learning, graph processing) that run hundreds of passes over data, this means hundreds of disk reads/writes. It's catastrophically slow for anything beyond single-pass aggregations.
What Flink Does Differently
Apache Flink uses a pipelined, in-memory execution model based on data streams. Data flows through operators without touching disk (unless explicitly checkpointed or memory is exhausted):
Source → Map → Filter → KeyBy → Window → Aggregate → Sink
(records flow through pipeline; no disk writes between operators)
For iterative workloads, Flink supports native iteration — data cycles back through the pipeline without being written to and read back from HDFS:
Source → Iteration Start → [processing] → Iteration End → Sink
↑________________________↑
(in-memory feedback loop)
This makes Flink dramatically faster for:
- Machine learning training loops
- Graph algorithms (PageRank, shortest path)
- Stream processing with event-time windows
- Multi-stage ETL with joins
Execution Model Comparison
| Aspect | MapReduce | Apache Flink |
|---|---|---|
| Processing model | Batch only | Unified stream + batch |
| Execution style | Stage-by-stage, disk between stages | Pipelined, in-memory |
| Latency | Minutes (seconds minimum) | Milliseconds (streaming) to seconds (batch) |
| State management | External (HDFS, HBase) | Built-in keyed state with checkpointing |
| Fault tolerance | Task re-execution from HDFS | Checkpoint-based recovery |
| Iterative processing | Manual re-submission of jobs | Native iteration |
| Windowing | Manual time bucketing | First-class event-time windows |
| Exactly-once semantics | At-least-once (re-execution) | Yes (with checkpointing) |
| Memory management | JVM heap | Custom off-heap memory manager |
| API | Java (Mapper/Reducer interface) | Java, Scala, Python, SQL |
API Comparison: Word Count
The classic example shows how much simpler modern APIs are.
MapReduce Word Count
// Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
// Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Flink Word Count (DataStream API)
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("hdfs:///input/");
DataStream<Tuple2<String, Integer>> counts = text
.flatMap((String line, Collector<Tuple2<String, Integer>> out) -> {
for (String word : line.split("\\s")) {
out.collect(Tuple2.of(word, 1));
}
})
.keyBy(t -> t.f0)
.sum(1);
counts.writeAsText("hdfs:///output/");
env.execute("Word Count");
Flink Word Count (Table API / SQL)
TableEnvironment tenv = TableEnvironment.create(settings);
tenv.executeSql("""
CREATE TABLE words (word STRING)
WITH ('connector' = 'filesystem', 'path' = 'hdfs:///input/', 'format' = 'csv')
""");
tenv.executeSql("""
SELECT word, COUNT(*) as cnt
FROM words
GROUP BY word
""").print();
Streaming: Where MapReduce Simply Doesn't Compete
MapReduce has no native streaming support. Workarounds (micro-batch with frequent MapReduce jobs, Apache Storm) were painful. Flink was built streaming-first.
Flink Streaming Example: Real-Time Fraud Detection
DataStream<Transaction> transactions = env
.addSource(new KafkaSource<Transaction>(...));
DataStream<Alert> alerts = transactions
.keyBy(Transaction::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new FraudDetectionFunction()); // stateful processing
alerts.addSink(new KafkaSink<Alert>(...));
This processes millions of transactions per second with sub-second latency and maintains per-user state across the 5-minute window. MapReduce cannot do this — it requires a complete dataset to be available before processing begins.
Fault Tolerance: Different Philosophies
MapReduce Fault Tolerance
If a task fails, the task tracker re-runs it from the original HDFS input. No state is lost because intermediate data is written to HDFS. Recovery is simple but expensive — re-reading from HDFS and re-running the task takes time proportional to the task's runtime.
Flink Fault Tolerance: Checkpointing
Flink takes periodic snapshots (checkpoints) of all operator state to durable storage (HDFS or S3). On failure, Flink restores from the last checkpoint and replays input events:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Checkpoint every 60 seconds
env.enableCheckpointing(60_000);
// Use RocksDB for large state (doesn't fit in JVM heap)
env.setStateBackend(new EmbeddedRocksDBStateBackend());
// Store checkpoints in HDFS
env.getCheckpointConfig().setCheckpointStorage("hdfs:///flink-checkpoints/");
Recovery time is bounded by checkpoint interval + replay time (not full re-computation). For long-running streaming jobs, this is far superior to MapReduce's restart-from-scratch model.
Running Flink on Hadoop
Flink integrates with the Hadoop ecosystem as a first-class citizen:
# Submit a Flink job to YARN
flink run -m yarn-cluster \
-yjm 2048 \ # JobManager memory (MB)
-ytm 4096 \ # TaskManager memory (MB)
-ys 4 \ # Slots per TaskManager
my-flink-app.jar
# Read from HDFS
# Flink uses Hadoop's filesystem abstraction natively
env.readTextFile("hdfs://namenode:9000/input/data.txt");
# Write to HDFS
stream.writeAsText("hdfs://namenode:9000/output/");
# Read/write Parquet via Flink's FileSystem connector
tenv.executeSql("""
CREATE TABLE sales (
id BIGINT, amount DECIMAL(10,2), dt STRING
) WITH (
'connector' = 'filesystem',
'path' = 'hdfs:///warehouse/sales/',
'format' = 'parquet'
)
""");
Flink also reads from HBase, Hive tables (via Hive connector), Kafka, JDBC, and anything reachable via a Hadoop filesystem.
When MapReduce Is Still the Right Choice
MapReduce hasn't disappeared, and it's still reasonable in specific scenarios:
- Legacy job stability — a MapReduce ETL that's run reliably for years with no issues. Rewriting it in Flink adds risk with limited benefit.
- Very simple batch jobs — a single-pass aggregation with no iterative logic. MapReduce's simplicity is actually a virtue here.
- Hive on MapReduce — existing Hive jobs backed by MapReduce work fine for overnight ETL. Migrating to Tez or Spark is a better target than Flink for HiveQL workloads.
- Team expertise — if your team knows MapReduce deeply and the jobs work, the operational cost of a Flink cluster may not be justified.
Performance Benchmark: Representative Numbers
These are indicative figures — actual results vary by hardware, data, and job complexity:
| Workload | MapReduce | Flink |
|---|---|---|
| Word count (1TB text) | ~15 min | ~2 min |
| Iterative PageRank (30 iters) | ~10 hrs | ~20 min |
| Sort (1TB) | ~8 min | ~3 min |
| Streaming (events/sec per node) | N/A | 500K–2M |
| Complex ETL with joins | ~45 min | ~5 min |
The performance gap widens significantly for iterative algorithms and multi-stage jobs with joins.
Summary
| Criteria | MapReduce | Apache Flink |
|---|---|---|
| Primary use case | Simple batch ETL | Unified stream + batch |
| Latency | Minutes | Milliseconds to seconds |
| Iterative algorithms | Poor | Excellent (native iteration) |
| Streaming | No | Yes (event-time, exactly-once) |
| State management | External only | Built-in, checkpointed |
| API complexity | High (boilerplate) | Moderate (DataStream/SQL) |
| Hadoop integration | Native | Full (HDFS, YARN, HMS) |
| Production maturity | Very high (20 years) | High (10+ years) |
| Best migration path | Hive → Tez, or Spark | New streaming/batch workloads |
For new workloads in 2025, Flink is the stronger technical choice across almost every dimension. MapReduce's legacy is in the ecosystem it spawned — Hive, Pig, and the architectural patterns that Flink improved upon. Understanding MapReduce makes you a better Flink engineer; building new pipelines on raw MapReduce is now rarely justified.
