Hadoop Cluster Monitoring
Effective monitoring of a Hadoop cluster requires visibility at four levels: JVM internals, HDFS health, YARN resource usage, and OS-level metrics. This page covers the built-in tools and how to integrate with modern monitoring stacks.
Built-in Web UIs
Every Hadoop service exposes an HTTP dashboard — no extra software required:
| Service | Default URL | Key Information |
|---|---|---|
| NameNode | http://namenode:9870 | HDFS capacity, live/dead DataNodes, block health |
| DataNode | http://datanode:9864 | Block counts, disk usage per volume |
| ResourceManager | http://resourcemanager:8088 | Running/pending/completed applications, cluster utilization |
| NodeManager | http://nodemanager:8042 | Container logs, CPU/memory per node |
| HistoryServer | http://historyserver:19888 | Completed MapReduce job details, counters |
| HBase Master | http://hbasemaster:16010 | Region distribution, compaction status |
JMX (Java Management Extensions)
All Hadoop daemons expose metrics over JMX. Access them via HTTP:
# NameNode JMX — all beans
curl http://namenode:9870/jmx
# Filter to a specific MBean
curl "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState"
# ResourceManager JMX
curl "http://resourcemanager:8088/jmx?qry=Hadoop:service=ResourceManager,name=RMNMInfo"
# DataNode JMX — disk and block metrics
curl "http://datanode:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo"
Key JMX Metrics to Watch
NameNode (FSNamesystemState):
| Metric | Healthy Value |
|---|---|
CapacityRemainingGB | > 15% of total |
UnderReplicatedBlocks | 0 (or trending to 0) |
CorruptBlocks | 0 |
MissingBlocks | 0 |
PendingDeletionBlocks | Should not grow unboundedly |
NumLiveDataNodes | Equals expected cluster size |
NumDeadDataNodes | 0 |
NameNode RPC (RpcActivityForPort8020):
| Metric | Alert Threshold |
|---|---|
RpcQueueTimeAvgTime | > 100ms |
RpcProcessingTimeAvgTime | > 100ms |
CallQueueLength | > 100 |
DataNode (DataNodeActivity):
| Metric | Description |
|---|---|
BytesWritten | Total bytes written (rate) |
BytesRead | Total bytes read (rate) |
BlocksWritten / BlocksRead | I/O throughput |
VolumeFailures | Must be 0 — indicates disk failure |
Hadoop Metrics2 Framework
Hadoop has a built-in metrics export framework (Metrics2) that can push metrics to external systems without JMX polling.
hadoop-metrics2.properties
# Push NameNode metrics to Graphite every 10 seconds
namenode.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink
namenode.sink.graphite.server_host=graphite.example.com
namenode.sink.graphite.server_port=2003
namenode.sink.graphite.metrics_prefix=hadoop.namenode
*.period=10
# Push all service metrics to a file (for debugging)
*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
*.sink.file.filename=/var/log/hadoop/metrics.log
Supported sinks: Graphite, Ganglia, file, stdout. Third-party plugins add Prometheus and InfluxDB support.
Prometheus + Grafana Stack
The most common modern monitoring setup: export JMX metrics via jmx_exporter and visualize in Grafana.
Step 1 — Add JMX Exporter Java agent
Download jmx_prometheus_javaagent.jar and a config:
# jmx-namenode.yaml
lowercaseOutputName: true
rules:
- pattern: "Hadoop<service=NameNode, name=FSNamesystemState><>(\\w+)"
name: hadoop_namenode_$1
- pattern: "Hadoop<service=NameNode, name=RpcActivityForPort8020><>(\\w+)"
name: hadoop_namenode_rpc_$1
Step 2 — Add to daemon JVM options (hadoop-env.sh)
export HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS \
-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7000:/opt/jmx_exporter/jmx-namenode.yaml"
Repeat for DataNode (port 7001), ResourceManager (7002), NodeManager (7003).
Step 3 — prometheus.yml scrape config
scrape_configs:
- job_name: hadoop_namenode
static_configs:
- targets: ['namenode:7000']
- job_name: hadoop_datanodes
static_configs:
- targets:
- 'dn01:7001'
- 'dn02:7001'
- 'dn03:7001'
- job_name: hadoop_resourcemanager
static_configs:
- targets: ['resourcemanager:7002']
- job_name: hadoop_nodemanagers
static_configs:
- targets:
- 'dn01:7003'
- 'dn02:7003'
- 'dn03:7003'
Step 4 — Grafana dashboards
Import community dashboards from grafana.com/dashboards — search for "Hadoop" or "HDFS" to find pre-built panels for NameNode, YARN, and DataNode metrics.
YARN CLI Monitoring
# Current cluster utilization summary
yarn cluster -list-node-labels
# All running applications with resource usage
yarn application -list -appStates RUNNING
# Detailed info on a specific application
yarn application -status application_1714000000000_0042
# Kill a stuck or runaway application
yarn application -kill application_1714000000000_0042
# Node health report
yarn node -list -all
# Queue capacities and usage
yarn queue -status default
HDFS CLI Health Checks
# Overall cluster report: capacity, live nodes, block counts
hdfs dfsadmin -report
# Run filesystem check (reports corrupt/missing blocks — read-only)
hdfs fsck / -files -blocks -locations 2>&1 | tail -20
# Check a specific directory
hdfs fsck /data/warehouse -files -blocks
# List under-replicated blocks
hdfs fsck / | grep "Under replicated"
# Trigger block re-replication (safe mode off required)
hdfs dfsadmin -safemode leave
# View safe mode status
hdfs dfsadmin -safemode get
Key Alerts to Configure
| Alert | Condition | Severity |
|---|---|---|
| Missing blocks | MissingBlocks > 0 | Critical |
| Corrupt blocks | CorruptBlocks > 0 | Critical |
| Dead DataNodes | NumDeadDataNodes > 0 for > 5min | Warning |
| HDFS capacity | CapacityRemaining < 15% | Warning |
| HDFS capacity | CapacityRemaining < 5% | Critical |
| NameNode RPC queue | CallQueueLength > 100 | Warning |
| YARN memory | availableMB < 10% of totalMB | Warning |
| Volume failures | VolumeFailures > 0 | Critical |