Hadoop Cluster Monitoring

Effective monitoring of a Hadoop cluster requires visibility at four levels: JVM internals, HDFS health, YARN resource usage, and OS-level metrics. This page covers the built-in tools and how to integrate with modern monitoring stacks.

Built-in Web UIs

Every Hadoop service exposes an HTTP dashboard — no extra software required:

Service	Default URL	Key Information
NameNode	`http://namenode:9870`	HDFS capacity, live/dead DataNodes, block health
DataNode	`http://datanode:9864`	Block counts, disk usage per volume
ResourceManager	`http://resourcemanager:8088`	Running/pending/completed applications, cluster utilization
NodeManager	`http://nodemanager:8042`	Container logs, CPU/memory per node
HistoryServer	`http://historyserver:19888`	Completed MapReduce job details, counters
HBase Master	`http://hbasemaster:16010`	Region distribution, compaction status

JMX (Java Management Extensions)

All Hadoop daemons expose metrics over JMX. Access them via HTTP:

# NameNode JMX — all beans
curl http://namenode:9870/jmx

# Filter to a specific MBean
curl "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState"

# ResourceManager JMX
curl "http://resourcemanager:8088/jmx?qry=Hadoop:service=ResourceManager,name=RMNMInfo"

# DataNode JMX — disk and block metrics
curl "http://datanode:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo"

Key JMX Metrics to Watch

NameNode (FSNamesystemState):

Metric	Healthy Value
`CapacityRemainingGB`	> 15% of total
`UnderReplicatedBlocks`	0 (or trending to 0)
`CorruptBlocks`	0
`MissingBlocks`	0
`PendingDeletionBlocks`	Should not grow unboundedly
`NumLiveDataNodes`	Equals expected cluster size
`NumDeadDataNodes`	0

NameNode RPC (RpcActivityForPort8020):

Metric	Alert Threshold
`RpcQueueTimeAvgTime`	> 100ms
`RpcProcessingTimeAvgTime`	> 100ms
`CallQueueLength`	> 100

DataNode (DataNodeActivity):

Metric	Description
`BytesWritten`	Total bytes written (rate)
`BytesRead`	Total bytes read (rate)
`BlocksWritten` / `BlocksRead`	I/O throughput
`VolumeFailures`	Must be 0 — indicates disk failure

Hadoop Metrics2 Framework

Hadoop has a built-in metrics export framework (Metrics2) that can push metrics to external systems without JMX polling.

hadoop-metrics2.properties

# Push NameNode metrics to Graphite every 10 seconds
namenode.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink
namenode.sink.graphite.server_host=graphite.example.com
namenode.sink.graphite.server_port=2003
namenode.sink.graphite.metrics_prefix=hadoop.namenode
*.period=10

# Push all service metrics to a file (for debugging)
*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
*.sink.file.filename=/var/log/hadoop/metrics.log

Supported sinks: Graphite, Ganglia, file, stdout. Third-party plugins add Prometheus and InfluxDB support.

Prometheus + Grafana Stack

The most common modern monitoring setup: export JMX metrics via jmx_exporter and visualize in Grafana.

Step 1 — Add JMX Exporter Java agent

Download jmx_prometheus_javaagent.jar and a config:

# jmx-namenode.yaml
lowercaseOutputName: true
rules:
  - pattern: "Hadoop<service=NameNode, name=FSNamesystemState><>(\\w+)"
    name: hadoop_namenode_$1
  - pattern: "Hadoop<service=NameNode, name=RpcActivityForPort8020><>(\\w+)"
    name: hadoop_namenode_rpc_$1

Step 2 — Add to daemon JVM options (hadoop-env.sh)

export HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS \
  -javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7000:/opt/jmx_exporter/jmx-namenode.yaml"

Repeat for DataNode (port 7001), ResourceManager (7002), NodeManager (7003).

Step 3 — prometheus.yml scrape config

scrape_configs:
  - job_name: hadoop_namenode
    static_configs:
      - targets: ['namenode:7000']

  - job_name: hadoop_datanodes
    static_configs:
      - targets:
          - 'dn01:7001'
          - 'dn02:7001'
          - 'dn03:7001'

  - job_name: hadoop_resourcemanager
    static_configs:
      - targets: ['resourcemanager:7002']

  - job_name: hadoop_nodemanagers
    static_configs:
      - targets:
          - 'dn01:7003'
          - 'dn02:7003'
          - 'dn03:7003'

Step 4 — Grafana dashboards

Import community dashboards from grafana.com/dashboards — search for "Hadoop" or "HDFS" to find pre-built panels for NameNode, YARN, and DataNode metrics.

YARN CLI Monitoring

# Current cluster utilization summary
yarn cluster -list-node-labels

# All running applications with resource usage
yarn application -list -appStates RUNNING

# Detailed info on a specific application
yarn application -status application_1714000000000_0042

# Kill a stuck or runaway application
yarn application -kill application_1714000000000_0042

# Node health report
yarn node -list -all

# Queue capacities and usage
yarn queue -status default

HDFS CLI Health Checks

# Overall cluster report: capacity, live nodes, block counts
hdfs dfsadmin -report

# Run filesystem check (reports corrupt/missing blocks — read-only)
hdfs fsck / -files -blocks -locations 2>&1 | tail -20

# Check a specific directory
hdfs fsck /data/warehouse -files -blocks

# List under-replicated blocks
hdfs fsck / | grep "Under replicated"

# Trigger block re-replication (safe mode off required)
hdfs dfsadmin -safemode leave

# View safe mode status
hdfs dfsadmin -safemode get

Key Alerts to Configure

Alert	Condition	Severity
Missing blocks	`MissingBlocks > 0`	Critical
Corrupt blocks	`CorruptBlocks > 0`	Critical
Dead DataNodes	`NumDeadDataNodes > 0` for > 5min	Warning
HDFS capacity	`CapacityRemaining < 15%`	Warning
HDFS capacity	`CapacityRemaining < 5%`	Critical
NameNode RPC queue	`CallQueueLength > 100`	Warning
YARN memory	`availableMB < 10% of totalMB`	Warning
Volume failures	`VolumeFailures > 0`	Critical

Built-in Web UIs​

JMX (Java Management Extensions)​

Key JMX Metrics to Watch​

Hadoop Metrics2 Framework​

hadoop-metrics2.properties​

Prometheus + Grafana Stack​

Step 1 — Add JMX Exporter Java agent​

Step 2 — Add to daemon JVM options (hadoop-env.sh)​

Step 3 — prometheus.yml scrape config​

Step 4 — Grafana dashboards​

YARN CLI Monitoring​

HDFS CLI Health Checks​

Key Alerts to Configure​