Skip to main content

Hadoop Cluster Monitoring

Effective monitoring of a Hadoop cluster requires visibility at four levels: JVM internals, HDFS health, YARN resource usage, and OS-level metrics. This page covers the built-in tools and how to integrate with modern monitoring stacks.

Built-in Web UIs

Every Hadoop service exposes an HTTP dashboard — no extra software required:

ServiceDefault URLKey Information
NameNodehttp://namenode:9870HDFS capacity, live/dead DataNodes, block health
DataNodehttp://datanode:9864Block counts, disk usage per volume
ResourceManagerhttp://resourcemanager:8088Running/pending/completed applications, cluster utilization
NodeManagerhttp://nodemanager:8042Container logs, CPU/memory per node
HistoryServerhttp://historyserver:19888Completed MapReduce job details, counters
HBase Masterhttp://hbasemaster:16010Region distribution, compaction status

JMX (Java Management Extensions)

All Hadoop daemons expose metrics over JMX. Access them via HTTP:

# NameNode JMX — all beans
curl http://namenode:9870/jmx

# Filter to a specific MBean
curl "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState"

# ResourceManager JMX
curl "http://resourcemanager:8088/jmx?qry=Hadoop:service=ResourceManager,name=RMNMInfo"

# DataNode JMX — disk and block metrics
curl "http://datanode:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo"

Key JMX Metrics to Watch

NameNode (FSNamesystemState):

MetricHealthy Value
CapacityRemainingGB> 15% of total
UnderReplicatedBlocks0 (or trending to 0)
CorruptBlocks0
MissingBlocks0
PendingDeletionBlocksShould not grow unboundedly
NumLiveDataNodesEquals expected cluster size
NumDeadDataNodes0

NameNode RPC (RpcActivityForPort8020):

MetricAlert Threshold
RpcQueueTimeAvgTime> 100ms
RpcProcessingTimeAvgTime> 100ms
CallQueueLength> 100

DataNode (DataNodeActivity):

MetricDescription
BytesWrittenTotal bytes written (rate)
BytesReadTotal bytes read (rate)
BlocksWritten / BlocksReadI/O throughput
VolumeFailuresMust be 0 — indicates disk failure

Hadoop Metrics2 Framework

Hadoop has a built-in metrics export framework (Metrics2) that can push metrics to external systems without JMX polling.

hadoop-metrics2.properties

# Push NameNode metrics to Graphite every 10 seconds
namenode.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink
namenode.sink.graphite.server_host=graphite.example.com
namenode.sink.graphite.server_port=2003
namenode.sink.graphite.metrics_prefix=hadoop.namenode
*.period=10

# Push all service metrics to a file (for debugging)
*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
*.sink.file.filename=/var/log/hadoop/metrics.log

Supported sinks: Graphite, Ganglia, file, stdout. Third-party plugins add Prometheus and InfluxDB support.

Prometheus + Grafana Stack

The most common modern monitoring setup: export JMX metrics via jmx_exporter and visualize in Grafana.

Step 1 — Add JMX Exporter Java agent

Download jmx_prometheus_javaagent.jar and a config:

# jmx-namenode.yaml
lowercaseOutputName: true
rules:
- pattern: "Hadoop<service=NameNode, name=FSNamesystemState><>(\\w+)"
name: hadoop_namenode_$1
- pattern: "Hadoop<service=NameNode, name=RpcActivityForPort8020><>(\\w+)"
name: hadoop_namenode_rpc_$1

Step 2 — Add to daemon JVM options (hadoop-env.sh)

export HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS \
-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7000:/opt/jmx_exporter/jmx-namenode.yaml"

Repeat for DataNode (port 7001), ResourceManager (7002), NodeManager (7003).

Step 3 — prometheus.yml scrape config

scrape_configs:
- job_name: hadoop_namenode
static_configs:
- targets: ['namenode:7000']

- job_name: hadoop_datanodes
static_configs:
- targets:
- 'dn01:7001'
- 'dn02:7001'
- 'dn03:7001'

- job_name: hadoop_resourcemanager
static_configs:
- targets: ['resourcemanager:7002']

- job_name: hadoop_nodemanagers
static_configs:
- targets:
- 'dn01:7003'
- 'dn02:7003'
- 'dn03:7003'

Step 4 — Grafana dashboards

Import community dashboards from grafana.com/dashboards — search for "Hadoop" or "HDFS" to find pre-built panels for NameNode, YARN, and DataNode metrics.

YARN CLI Monitoring

# Current cluster utilization summary
yarn cluster -list-node-labels

# All running applications with resource usage
yarn application -list -appStates RUNNING

# Detailed info on a specific application
yarn application -status application_1714000000000_0042

# Kill a stuck or runaway application
yarn application -kill application_1714000000000_0042

# Node health report
yarn node -list -all

# Queue capacities and usage
yarn queue -status default

HDFS CLI Health Checks

# Overall cluster report: capacity, live nodes, block counts
hdfs dfsadmin -report

# Run filesystem check (reports corrupt/missing blocks — read-only)
hdfs fsck / -files -blocks -locations 2>&1 | tail -20

# Check a specific directory
hdfs fsck /data/warehouse -files -blocks

# List under-replicated blocks
hdfs fsck / | grep "Under replicated"

# Trigger block re-replication (safe mode off required)
hdfs dfsadmin -safemode leave

# View safe mode status
hdfs dfsadmin -safemode get

Key Alerts to Configure

AlertConditionSeverity
Missing blocksMissingBlocks > 0Critical
Corrupt blocksCorruptBlocks > 0Critical
Dead DataNodesNumDeadDataNodes > 0 for > 5minWarning
HDFS capacityCapacityRemaining < 15%Warning
HDFS capacityCapacityRemaining < 5%Critical
NameNode RPC queueCallQueueLength > 100Warning
YARN memoryavailableMB < 10% of totalMBWarning
Volume failuresVolumeFailures > 0Critical