Skip to main content

Apache ZooKeeper

What Is ZooKeeper?

Apache ZooKeeper is a distributed coordination service — a highly available system that provides primitives for building distributed applications: configuration management, distributed locking, leader election, and service discovery.

In the Hadoop ecosystem, ZooKeeper is a foundational dependency used by:

SystemZooKeeper Role
HDFS HANameNode automatic failover (ZKFC)
YARN HAResourceManager leader election
HBaseRegion Server coordination, Master election
KafkaBroker registration, topic/partition metadata (pre-Kafka 3.x)
OozieJob state coordination
HiveHiveServer2 instance discovery

ZooKeeper Data Model

ZooKeeper stores data in a hierarchical namespace (like a filesystem), where each node is called a znode:

/
├── hadoop-ha/
│ ├── nameservice1/
│ │ ├── ActiveBreadCrumb ← which NameNode is active
│ │ └── ActiveStandbyElectorLock
├── hbase/
│ ├── master ← HBase active master
│ ├── rs/ ← region server registrations
│ └── table/
└── kafka/
├── brokers/
└── topics/

Znode types:

  • Persistent — survives client disconnection
  • Ephemeral — deleted automatically when the creating client disconnects (used for leader election and service registration)
  • Sequential — appends a unique monotonic counter to the node name

Architecture

ZooKeeper Ensemble (3 or 5 nodes recommended)

Leader ──► Follower 1
──► Follower 2
──► Follower 3

Quorum requirement: majority must be alive
3 nodes → tolerates 1 failure
5 nodes → tolerates 2 failures

Never run 2 or 4 nodes — an even number doesn't increase fault tolerance but wastes a server.

Configuration (zoo.cfg)

# Data directory for snapshots
dataDir=/var/zookeeper/data

# Transaction log directory (put on separate disk from dataDir)
dataLogDir=/var/zookeeper/logs

# Client port
clientPort=2181

# Tick time (ms) — base unit for heartbeats and timeouts
tickTime=2000

# Follower init sync timeout (10 × tickTime)
initLimit=10

# Follower-leader sync timeout (5 × tickTime)
syncLimit=5

# Session timeout bounds (ms)
minSessionTimeout=4000
maxSessionTimeout=40000

# Ensemble members: server.id=host:leader-port:election-port
server.1=zk1.example.com:2888:3888
server.2=zk2.example.com:2888:3888
server.3=zk3.example.com:2888:3888

Each ZooKeeper node needs a myid file containing its server ID:

# On zk1:
echo 1 > /var/zookeeper/data/myid

# On zk2:
echo 2 > /var/zookeeper/data/myid

# On zk3:
echo 3 > /var/zookeeper/data/myid

Starting ZooKeeper

# Start
zkServer.sh start

# Stop
zkServer.sh stop

# Check status (shows if node is Leader or Follower)
zkServer.sh status

ZooKeeper CLI (zkCli)

# Connect to local ZooKeeper
zkCli.sh -server localhost:2181

# Connect to a specific ensemble member
zkCli.sh -server zk1.example.com:2181

Common commands inside the CLI:

# List children of a znode
ls /

# Get data stored in a znode
get /hadoop-ha/nameservice1/ActiveBreadCrumb

# Create a persistent znode with data
create /myapp/config "version=1.2"

# Create an ephemeral znode
create -e /myapp/locks/worker-01 "locked"

# Set data on an existing znode
set /myapp/config "version=1.3"

# Watch a znode for changes (triggers once)
get -w /myapp/config

# Delete a znode
delete /myapp/config

# Delete recursively
deleteall /myapp

Four-Letter Commands (Monitoring)

ZooKeeper responds to short text commands over TCP for health monitoring:

# Check if node is OK
echo ruok | nc zk1.example.com 2181
# Returns: imok

# Show server statistics
echo stat | nc zk1.example.com 2181

# List active connections
echo dump | nc zk1.example.com 2181

# Show environment info
echo envi | nc zk1.example.com 2181

# Show outstanding requests (should be near 0)
echo wchs | nc zk1.example.com 2181

Enable 4-letter commands in zoo.cfg (required in newer versions):

4lw.commands.whitelist=ruok,stat,dump,envi,wchs,mntr

Key Tuning Parameters

ParameterDefaultRecommendation
tickTime2000msKeep at 2000ms for most clusters
maxSessionTimeout20× tickTimeIncrease to 60000ms for slow HBase clients
dataLogDirsame as dataDirAlways set to a separate disk (reduces latency)
autopurge.snapRetainCount3Set to 5–10 to keep enough snapshots
autopurge.purgeInterval0 (off)Set to 1 (hourly purge) to prevent disk fill
JVM heap500MB1–2 GB for busy ensembles

Enable automatic log purge

autopurge.snapRetainCount=5
autopurge.purgeInterval=1

JVM heap (zkEnv.sh or zookeeper-env.sh)

export ZK_SERVER_HEAP=1024   # 1 GB

ZooKeeper and HDFS HA

When configuring HDFS NameNode HA with automatic failover, ZooKeeper holds the active NameNode lock. The ZooKeeper Failover Controller (ZKFC) runs on each NameNode host:

# Format the ZooKeeper ZNode for HDFS HA (run once)
hdfs zkfc -formatZK

# Start ZKFC on each NameNode host
hdfs --daemon start zkfc

If the active NameNode fails, the ZKFC detects it and triggers a failover — the standby NameNode acquires the ZooKeeper lock and transitions to active automatically.