Apache ZooKeeper

What Is ZooKeeper?

Apache ZooKeeper is a distributed coordination service — a highly available system that provides primitives for building distributed applications: configuration management, distributed locking, leader election, and service discovery.

In the Hadoop ecosystem, ZooKeeper is a foundational dependency used by:

System	ZooKeeper Role
HDFS HA	NameNode automatic failover (ZKFC)
YARN HA	ResourceManager leader election
HBase	Region Server coordination, Master election
Kafka	Broker registration, topic/partition metadata (pre-Kafka 3.x)
Oozie	Job state coordination
Hive	HiveServer2 instance discovery

ZooKeeper Data Model

ZooKeeper stores data in a hierarchical namespace (like a filesystem), where each node is called a znode:

/
├── hadoop-ha/
│   ├── nameservice1/
│   │   ├── ActiveBreadCrumb      ← which NameNode is active
│   │   └── ActiveStandbyElectorLock
├── hbase/
│   ├── master                    ← HBase active master
│   ├── rs/                       ← region server registrations
│   └── table/
└── kafka/
    ├── brokers/
    └── topics/

Znode types:

Persistent — survives client disconnection
Ephemeral — deleted automatically when the creating client disconnects (used for leader election and service registration)
Sequential — appends a unique monotonic counter to the node name

Architecture

ZooKeeper Ensemble (3 or 5 nodes recommended)

  Leader ──► Follower 1
         ──► Follower 2
         ──► Follower 3

Quorum requirement: majority must be alive
  3 nodes → tolerates 1 failure
  5 nodes → tolerates 2 failures

Never run 2 or 4 nodes — an even number doesn't increase fault tolerance but wastes a server.

Configuration (zoo.cfg)

# Data directory for snapshots
dataDir=/var/zookeeper/data

# Transaction log directory (put on separate disk from dataDir)
dataLogDir=/var/zookeeper/logs

# Client port
clientPort=2181

# Tick time (ms) — base unit for heartbeats and timeouts
tickTime=2000

# Follower init sync timeout (10 × tickTime)
initLimit=10

# Follower-leader sync timeout (5 × tickTime)
syncLimit=5

# Session timeout bounds (ms)
minSessionTimeout=4000
maxSessionTimeout=40000

# Ensemble members: server.id=host:leader-port:election-port
server.1=zk1.example.com:2888:3888
server.2=zk2.example.com:2888:3888
server.3=zk3.example.com:2888:3888

Each ZooKeeper node needs a myid file containing its server ID:

# On zk1:
echo 1 > /var/zookeeper/data/myid

# On zk2:
echo 2 > /var/zookeeper/data/myid

# On zk3:
echo 3 > /var/zookeeper/data/myid

Starting ZooKeeper

# Start
zkServer.sh start

# Stop
zkServer.sh stop

# Check status (shows if node is Leader or Follower)
zkServer.sh status

ZooKeeper CLI (zkCli)

# Connect to local ZooKeeper
zkCli.sh -server localhost:2181

# Connect to a specific ensemble member
zkCli.sh -server zk1.example.com:2181

Common commands inside the CLI:

# List children of a znode
ls /

# Get data stored in a znode
get /hadoop-ha/nameservice1/ActiveBreadCrumb

# Create a persistent znode with data
create /myapp/config "version=1.2"

# Create an ephemeral znode
create -e /myapp/locks/worker-01 "locked"

# Set data on an existing znode
set /myapp/config "version=1.3"

# Watch a znode for changes (triggers once)
get -w /myapp/config

# Delete a znode
delete /myapp/config

# Delete recursively
deleteall /myapp

Four-Letter Commands (Monitoring)

ZooKeeper responds to short text commands over TCP for health monitoring:

# Check if node is OK
echo ruok | nc zk1.example.com 2181
# Returns: imok

# Show server statistics
echo stat | nc zk1.example.com 2181

# List active connections
echo dump | nc zk1.example.com 2181

# Show environment info
echo envi | nc zk1.example.com 2181

# Show outstanding requests (should be near 0)
echo wchs | nc zk1.example.com 2181

Enable 4-letter commands in zoo.cfg (required in newer versions):

4lw.commands.whitelist=ruok,stat,dump,envi,wchs,mntr

Key Tuning Parameters

Parameter	Default	Recommendation
`tickTime`	2000ms	Keep at 2000ms for most clusters
`maxSessionTimeout`	20× tickTime	Increase to 60000ms for slow HBase clients
`dataLogDir`	same as dataDir	Always set to a separate disk (reduces latency)
`autopurge.snapRetainCount`	3	Set to 5–10 to keep enough snapshots
`autopurge.purgeInterval`	0 (off)	Set to 1 (hourly purge) to prevent disk fill
JVM heap	500MB	1–2 GB for busy ensembles

Enable automatic log purge

autopurge.snapRetainCount=5
autopurge.purgeInterval=1

JVM heap (zkEnv.sh or zookeeper-env.sh)

export ZK_SERVER_HEAP=1024   # 1 GB

ZooKeeper and HDFS HA

When configuring HDFS NameNode HA with automatic failover, ZooKeeper holds the active NameNode lock. The ZooKeeper Failover Controller (ZKFC) runs on each NameNode host:

# Format the ZooKeeper ZNode for HDFS HA (run once)
hdfs zkfc -formatZK

# Start ZKFC on each NameNode host
hdfs --daemon start zkfc

If the active NameNode fails, the ZKFC detects it and triggers a failover — the standby NameNode acquires the ZooKeeper lock and transitions to active automatically.

What Is ZooKeeper?​

ZooKeeper Data Model​

Architecture​

Configuration (zoo.cfg)​

Starting ZooKeeper​

ZooKeeper CLI (zkCli)​

Four-Letter Commands (Monitoring)​

Key Tuning Parameters​

Enable automatic log purge​

JVM heap (zkEnv.sh or zookeeper-env.sh)​

ZooKeeper and HDFS HA​