Skip to main content

Rack Awareness

What Is Rack Awareness?

In a data center, servers are organized into racks — physical enclosures sharing a top-of-rack (ToR) switch. Network bandwidth within a rack is much higher than bandwidth between racks.

Without rack awareness, HDFS places replicas randomly. With rack awareness configured, HDFS places replicas to maximize both data locality and fault tolerance:

Default 3-replica placement:
Replica 1 → Same node as writer (maximum locality)
Replica 2 → Different rack (fault tolerance)
Replica 3 → Same rack as Replica 2, different node (bandwidth efficiency)

This means the cluster survives an entire rack failure without data loss, while keeping at least one replica close to clients for fast reads.

How It Works

HDFS determines rack location by calling an external topology script. The NameNode calls this script with a DataNode's IP address and expects a rack path string in return:

/datacenter1/rack01
/datacenter1/rack02
/datacenter2/rack01

If no script is configured, all nodes are placed in the default rack (/default-rack), which disables rack-aware placement.

Writing a Topology Script

Create /etc/hadoop/topology.sh:

#!/bin/bash
# Maps IP addresses to rack paths
# Called by HDFS with one or more IPs as arguments

RACK_MAP=(
"10.0.1.0/24=/dc1/rack01"
"10.0.2.0/24=/dc1/rack02"
"10.0.3.0/24=/dc2/rack01"
"10.0.4.0/24=/dc2/rack02"
)

for ip in "$@"; do
rack="/default-rack"
for entry in "${RACK_MAP[@]}"; do
subnet="${entry%%=*}"
rack_path="${entry##*=}"
# Simple CIDR match using ipcalc or manual comparison
if [[ $(ipcalc -n "$ip/$subnet" 2>/dev/null | grep -c "NETWORK=") -eq 1 ]]; then
rack="$rack_path"
break
fi
done
echo "$rack"
done

Make it executable:

chmod +x /etc/hadoop/topology.sh

Test it before deploying

/etc/hadoop/topology.sh 10.0.1.11 10.0.2.45 10.0.3.22
# Expected output:
# /dc1/rack01
# /dc1/rack02
# /dc2/rack01

Configuration

core-site.xml

<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/topology.sh</value>
</property>

<!-- Maximum args passed to the script per invocation (default: 100) -->
<property>
<name>net.topology.script.number.args</name>
<value>100</value>
</property>

Restart the NameNode after changing topology configuration:

hdfs --daemon stop namenode
hdfs --daemon start namenode

Verifying Rack Assignment

# Show rack topology for all DataNodes
hdfs dfsadmin -report | grep -E "Name:|Rack:"

# Display topology tree
hdfs dfsadmin -printTopology

Example output:

Rack: /dc1/rack01
10.0.1.11:9866 (dn01.example.com)
10.0.1.12:9866 (dn02.example.com)

Rack: /dc1/rack02
10.0.2.21:9866 (dn03.example.com)
10.0.2.22:9866 (dn04.example.com)

Rack: /dc2/rack01
10.0.3.31:9866 (dn05.example.com)
10.0.3.32:9866 (dn06.example.com)

Using a Static Topology File

For clusters where IPs are stable, a Python script reading a static map file is simpler than subnet matching:

/etc/hadoop/topology.data:

10.0.1.11  /dc1/rack01
10.0.1.12 /dc1/rack01
10.0.2.21 /dc1/rack02
10.0.2.22 /dc1/rack02
10.0.3.31 /dc2/rack01

/etc/hadoop/topology.py:

#!/usr/bin/env python3
import sys

topology = {}
with open("/etc/hadoop/topology.data") as f:
for line in f:
parts = line.strip().split()
if len(parts) == 2:
topology[parts[0]] = parts[1]

for ip in sys.argv[1:]:
print(topology.get(ip, "/default-rack"))
chmod +x /etc/hadoop/topology.py

Then set net.topology.script.file.name to /etc/hadoop/topology.py.

Impact on YARN and MapReduce

Rack awareness also benefits YARN scheduling:

  • YARN attempts to launch Map tasks on the node storing the input split (data-local).
  • If unavailable, it tries a node in the same rack (rack-local).
  • Only as a last resort does it schedule on a remote rack.

This locality preference dramatically reduces cross-rack network traffic for MapReduce and Spark jobs.

Multi-Datacenter Topology

For clusters spanning datacenters, use a three-level path:

/datacenter1/rack01
/datacenter1/rack02
/datacenter2/rack01

HDFS will prefer intra-datacenter placement for performance while guaranteeing at least one replica in the remote datacenter for disaster recovery — provided you have enough replication factor (typically dfs.replication=3 or higher).

Summary

ScenarioRack Path FormatBenefit
Single DC, multiple racks/rack01Rack failure tolerance
Multiple DCs/dc1/rack01DC failure tolerance
All on one rack/default-rackNo benefit (avoid this in production)