What's New in Apache Hadoop 3
Apache Hadoop 3.x was a landmark release that brought significant improvements to performance, reliability, and scalability. Here's a quick tour of the most important changes.
Erasure Coding — Slash Storage Costs by 50%
The biggest storage improvement in Hadoop 3 is Erasure Coding (EC) for HDFS. Previously, the default 3x replication meant storing 200% overhead. With EC (using Reed-Solomon algorithms), you can achieve the same fault tolerance with just 50% overhead — cutting storage costs dramatically for cold or infrequently accessed data.
# Enable erasure coding on a directory
hdfs ec -setPolicy -policy RS-6-3-1024k -path /data/cold-storage
hdfs ec -getPolicy -path /data/cold-storage
Support for More Than 2 NameNodes in HA
Hadoop 2 supported exactly 2 NameNodes in HA mode. Hadoop 3 supports up to 5 NameNodes, enabling more resilient configurations for large-scale deployments.
YARN Timeline Service v2
The redesigned YARN Timeline Service v2 offers better scalability using HBase as its backend, replacing the single-writer bottleneck of v1. This makes job history and metrics retrieval much faster on large clusters.
Intra-DataNode Balancer
A new intra-DataNode disk balancer ensures that data is spread evenly across all disks on a single DataNode, preventing single-disk hotspots that could degrade performance.
hdfs diskbalancer -plan datanode.example.com
hdfs diskbalancer -execute datanode.example.com.plan.json
Java 8 Minimum + Dropped Support for Older Versions
Hadoop 3 dropped support for Java 7 and requires Java 8 or higher, allowing the codebase to take advantage of modern JVM features.
