Skip to main content

What's New in Apache Hadoop 3

· 2 min read
Hadoop.so Editorial Team
Big Data Engineers

Apache Hadoop 3.x was a landmark release that brought significant improvements to performance, reliability, and scalability. Here's a quick tour of the most important changes.

Erasure Coding — Slash Storage Costs by 50%

The biggest storage improvement in Hadoop 3 is Erasure Coding (EC) for HDFS. Previously, the default 3x replication meant storing 200% overhead. With EC (using Reed-Solomon algorithms), you can achieve the same fault tolerance with just 50% overhead — cutting storage costs dramatically for cold or infrequently accessed data.

# Enable erasure coding on a directory
hdfs ec -setPolicy -policy RS-6-3-1024k -path /data/cold-storage
hdfs ec -getPolicy -path /data/cold-storage

Support for More Than 2 NameNodes in HA

Hadoop 2 supported exactly 2 NameNodes in HA mode. Hadoop 3 supports up to 5 NameNodes, enabling more resilient configurations for large-scale deployments.

YARN Timeline Service v2

The redesigned YARN Timeline Service v2 offers better scalability using HBase as its backend, replacing the single-writer bottleneck of v1. This makes job history and metrics retrieval much faster on large clusters.

Intra-DataNode Balancer

A new intra-DataNode disk balancer ensures that data is spread evenly across all disks on a single DataNode, preventing single-disk hotspots that could degrade performance.

hdfs diskbalancer -plan datanode.example.com
hdfs diskbalancer -execute datanode.example.com.plan.json

Java 8 Minimum + Dropped Support for Older Versions

Hadoop 3 dropped support for Java 7 and requires Java 8 or higher, allowing the codebase to take advantage of modern JVM features.