Skip to main content

HDFS Deep Dive

HDFS (Hadoop Distributed File System) is the primary storage layer of Hadoop. It is designed to run on commodity hardware and reliably store very large files — from gigabytes to petabytes.

Architecture

HDFS uses a master/worker architecture:

  • NameNode — Manages the filesystem namespace (file tree and metadata). There is one active NameNode (plus an optional Standby for HA).
  • DataNode — Stores actual data blocks. There are many DataNodes spread across the cluster.
  • Secondary NameNode — Periodically merges the NameNode's edit log with the filesystem image (not a hot standby).
Client
├─► NameNode (metadata: where is block X?)
└─► DataNode 1, DataNode 2, DataNode 3 (actual data blocks)

How Replication Works

By default, each block is replicated 3 times across different DataNodes (and different racks). If a DataNode fails, the NameNode detects missing replicas and instructs another DataNode to create a copy.

Basic HDFS Commands

# List root directory
hdfs dfs -ls /

# Create a directory
hdfs dfs -mkdir -p /user/hadoop/data

# Upload a local file
hdfs dfs -put localfile.txt /user/hadoop/data/

# Download a file from HDFS
hdfs dfs -get /user/hadoop/data/localfile.txt ./output.txt

# View file contents
hdfs dfs -cat /user/hadoop/data/localfile.txt

# Check disk usage
hdfs dfs -du -h /user/hadoop/

# Delete a file
hdfs dfs -rm /user/hadoop/data/localfile.txt

# Check filesystem health
hdfs fsck / -files -blocks

Block Size

The default HDFS block size is 128 MB (configurable). Large block sizes reduce NameNode memory usage and network overhead for sequential reads.

hdfs dfs -D dfs.blocksize=256m -put bigfile.csv /data/

Safe Mode

After startup, HDFS enters safe mode while DataNodes report their blocks to the NameNode. Writes are blocked until a minimum replication threshold is met.

hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave

Next Steps

Move on to MapReduce Fundamentals to learn how to process the data you store.