Rack Awareness

What Is Rack Awareness?

In a data center, servers are organized into racks — physical enclosures sharing a top-of-rack (ToR) switch. Network bandwidth within a rack is much higher than bandwidth between racks.

Without rack awareness, HDFS places replicas randomly. With rack awareness configured, HDFS places replicas to maximize both data locality and fault tolerance:

Default 3-replica placement:
  Replica 1 → Same node as writer (maximum locality)
  Replica 2 → Different rack (fault tolerance)
  Replica 3 → Same rack as Replica 2, different node (bandwidth efficiency)

This means the cluster survives an entire rack failure without data loss, while keeping at least one replica close to clients for fast reads.

How It Works

HDFS determines rack location by calling an external topology script. The NameNode calls this script with a DataNode's IP address and expects a rack path string in return:

/datacenter1/rack01
/datacenter1/rack02
/datacenter2/rack01

If no script is configured, all nodes are placed in the default rack (/default-rack), which disables rack-aware placement.

Writing a Topology Script

Create /etc/hadoop/topology.sh:

#!/bin/bash
# Maps IP addresses to rack paths
# Called by HDFS with one or more IPs as arguments

RACK_MAP=(
  "10.0.1.0/24=/dc1/rack01"
  "10.0.2.0/24=/dc1/rack02"
  "10.0.3.0/24=/dc2/rack01"
  "10.0.4.0/24=/dc2/rack02"
)

for ip in "$@"; do
  rack="/default-rack"
  for entry in "${RACK_MAP[@]}"; do
    subnet="${entry%%=*}"
    rack_path="${entry##*=}"
    # Simple CIDR match using ipcalc or manual comparison
    if [[ $(ipcalc -n "$ip/$subnet" 2>/dev/null | grep -c "NETWORK=") -eq 1 ]]; then
      rack="$rack_path"
      break
    fi
  done
  echo "$rack"
done

Make it executable:

chmod +x /etc/hadoop/topology.sh

Test it before deploying

/etc/hadoop/topology.sh 10.0.1.11 10.0.2.45 10.0.3.22
# Expected output:
# /dc1/rack01
# /dc1/rack02
# /dc2/rack01

Configuration

core-site.xml

<property>
  <name>net.topology.script.file.name</name>
  <value>/etc/hadoop/topology.sh</value>
</property>

<!-- Maximum args passed to the script per invocation (default: 100) -->
<property>
  <name>net.topology.script.number.args</name>
  <value>100</value>
</property>

Restart the NameNode after changing topology configuration:

hdfs --daemon stop namenode
hdfs --daemon start namenode

Verifying Rack Assignment

# Show rack topology for all DataNodes
hdfs dfsadmin -report | grep -E "Name:|Rack:"

# Display topology tree
hdfs dfsadmin -printTopology

Example output:

Rack: /dc1/rack01
   10.0.1.11:9866 (dn01.example.com)
   10.0.1.12:9866 (dn02.example.com)

Rack: /dc1/rack02
   10.0.2.21:9866 (dn03.example.com)
   10.0.2.22:9866 (dn04.example.com)

Rack: /dc2/rack01
   10.0.3.31:9866 (dn05.example.com)
   10.0.3.32:9866 (dn06.example.com)

Using a Static Topology File

For clusters where IPs are stable, a Python script reading a static map file is simpler than subnet matching:

/etc/hadoop/topology.data:

0.1.11  /dc1/rack01
0.1.12  /dc1/rack01
0.2.21  /dc1/rack02
0.2.22  /dc1/rack02
0.3.31  /dc2/rack01

/etc/hadoop/topology.py:

#!/usr/bin/env python3
import sys

topology = {}
with open("/etc/hadoop/topology.data") as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) == 2:
            topology[parts[0]] = parts[1]

for ip in sys.argv[1:]:
    print(topology.get(ip, "/default-rack"))

chmod +x /etc/hadoop/topology.py

Then set net.topology.script.file.name to /etc/hadoop/topology.py.

Impact on YARN and MapReduce

Rack awareness also benefits YARN scheduling:

YARN attempts to launch Map tasks on the node storing the input split (data-local).
If unavailable, it tries a node in the same rack (rack-local).
Only as a last resort does it schedule on a remote rack.

This locality preference dramatically reduces cross-rack network traffic for MapReduce and Spark jobs.

Multi-Datacenter Topology

For clusters spanning datacenters, use a three-level path:

/datacenter1/rack01
/datacenter1/rack02
/datacenter2/rack01

HDFS will prefer intra-datacenter placement for performance while guaranteeing at least one replica in the remote datacenter for disaster recovery — provided you have enough replication factor (typically dfs.replication=3 or higher).

Summary

Scenario	Rack Path Format	Benefit
Single DC, multiple racks	`/rack01`	Rack failure tolerance
Multiple DCs	`/dc1/rack01`	DC failure tolerance
All on one rack	`/default-rack`	No benefit (avoid this in production)

What Is Rack Awareness?​

How It Works​

Writing a Topology Script​

Test it before deploying​

Configuration​

core-site.xml​

Verifying Rack Assignment​

Using a Static Topology File​

Impact on YARN and MapReduce​

Multi-Datacenter Topology​

Summary​