Skip to main content

HBase vs Cassandra: Choosing a NoSQL Database for Big Data

· 7 min read
Hadoop.so Editorial Team
Big Data Engineers

Apache HBase and Apache Cassandra are the two most widely deployed NoSQL databases in the Hadoop ecosystem. Both handle massive datasets across distributed clusters, but they have fundamentally different architectures that make each excel in different scenarios. This post cuts through the marketing and gives you a practical comparison.

Background

Apache HBase (2008) was modeled after Google's Bigtable paper and built on top of HDFS. It's a wide-column store tightly integrated with the Hadoop ecosystem — it uses HDFS for storage, YARN for resource management (optionally), and ZooKeeper for coordination.

Apache Cassandra (2008, open-sourced by Facebook) was inspired by both Amazon Dynamo and Google Bigtable. It's a fully distributed, peer-to-peer wide-column store designed for high availability with no single point of failure.


Architecture: The Fundamental Difference

This is the most important distinction:

HBase: Master/Replica Architecture

ZooKeeper (coordination)


HMaster (assigns regions, handles DDL)

├──► RegionServer 1 (serves regions A–M)
│ └── Stores data in HDFS
├──► RegionServer 2 (serves regions N–T)
│ └── Stores data in HDFS
└──► RegionServer 3 (serves regions U–Z)
└── Stores data in HDFS

HBase has a master node (HMaster) that coordinates region assignment and cluster state. RegionServers handle read/write for assigned row key ranges. Data is stored in HDFS — the underlying distributed filesystem handles replication.

Cassandra: Peer-to-Peer Ring

Client


Coordinator Node (any node can serve this role)

├──► Node A ──► Node B ──► Node C
│ (Replication Factor = 3)
└──► (data replicated across RF nodes on the ring)

Cassandra has no master. Every node is equal — any node can coordinate any request. Data is distributed using consistent hashing across the ring, and replicas are written to multiple nodes based on the Replication Factor and placement strategy.


Data Model Comparison

Both are wide-column stores, but with different modeling philosophies.

HBase Data Model

Table: user_events

Row Key | Column Family: cf
-----------------+-----------------------------------------
user_1|ts_001 | cf:action="click" cf:page="/home"
user_1|ts_002 | cf:action="login" cf:ip="10.0.0.1"
user_2|ts_001 | cf:action="view" cf:item="SKU-123"
  • Rows are sorted by row key lexicographically
  • Efficient range scans across contiguous row keys
  • Sparse: columns don't need to be consistent across rows
  • Versioning: each cell can store multiple timestamped versions

Cassandra Data Model

-- Cassandra uses CQL (Cassandra Query Language)
CREATE TABLE user_events (
user_id UUID,
event_time TIMESTAMP,
action TEXT,
page TEXT,
PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
  • Partition key (user_id) determines which node holds the data
  • Clustering columns (event_time) determine order within a partition
  • Queries must include the partition key for efficient access
  • Schema is enforced (unlike HBase's schema-less model)

Consistency Models

AspectHBaseCassandra
Consistency modelStrong (single-region)Tunable (eventual by default)
Write pathWAL → MemStore → HFile flushCommit log → Memtable → SSTable flush
Read pathBlock cache → MemStore → HFilesRow cache → Memtable → SSTables
ReplicationHDFS (3 replicas by default)Configurable RF (typically 3)
Cross-DC replicationLimited (requires extra tooling)Built-in multi-datacenter support
FailoverRegionServer failure → region reassignmentAny node can fail; ring heals automatically

Cassandra's tunable consistency lets you choose per-query:

QUORUM  = majority of replicas must ack (strong, slower)
ONE = first replica acks (fast, eventually consistent)
ALL = all replicas must ack (strongest, least available)
LOCAL_QUORUM = quorum within local datacenter (best for multi-DC)

HBase provides strong consistency within a single region — a row's data is always read from one RegionServer, so there's no stale read risk within a datacenter.


Read/Write Performance

HBase Read Performance

Point lookup (single row key):
- Cache hit: < 1ms
- Cache miss (HDFS read): 5–20ms typical

Range scan (contiguous row keys):
- Highly efficient — sequential HDFS reads
- Best use case for HBase

Random row reads (non-sequential keys):
- Moderate — multiple block cache lookups

Cassandra Read Performance

Point lookup (partition key + clustering):
- Single partition: 1–5ms typical
- Cassandra is optimized for this pattern

Range scan (across partitions):
- Requires full cluster scan (ALLOW FILTERING)
- Anti-pattern — avoid in production

Secondary indexes:
- Available but add overhead
- Materialized views preferred

Write Performance

Both are optimized for writes — they use in-memory buffers (MemStore/Memtable) and sequential disk writes (WAL/commit log + SSTable/HFile). Write throughput of 100,000–1,000,000+ ops/sec per node is achievable for both under the right conditions.


Operations Complexity

HBase

The HMaster is a single point of coordination failure (mitigated by ZooKeeper-based failover). Region splitting, compaction, and balancing require operational attention. HDFS also adds operational complexity:

# HBase common operations
hbase shell
> list # list tables
> describe 'user_events' # schema
> scan 'user_events', {LIMIT => 10}

# Region management
hbase hbck -details # cluster health check
hbase hbck -fixAssignments # fix region assignment issues

Cassandra

No master simplifies operations — nodes can be added or removed without downtime. But Cassandra's own complexity comes from compaction strategies, tombstone accumulation, and repair:

# Cassandra common operations
nodetool status # cluster ring view
nodetool repair keyspace # anti-entropy repair (run weekly)
nodetool compactionstats # compaction progress
cqlsh -e "DESCRIBE KEYSPACE ks;"

Cassandra's nodetool repair is a critical, often-neglected operational task. Skipping it leads to data inconsistency as tombstones aren't garbage-collected and missing replicas aren't re-synchronized.


Integration with Hadoop

IntegrationHBaseCassandra
HDFSNative (stores data in HDFS)Optional (Cassandra HDFS connector)
MapReduceTableInputFormat / TableOutputFormatSpark connector preferred
SparkHBase-Spark connectorCassandra Spark connector (DataStax)
HiveHBaseStorageHandlerExternal table via SerDe
SqoopHBaseImportJobCassandra connector
PhoenixYes (SQL layer over HBase)No equivalent

Apache Phoenix is a major HBase advantage for SQL workloads: it provides a full JDBC/SQL interface over HBase with secondary indexes, making HBase queryable by BI tools without custom code.


When to Choose HBase

HBase is the right choice when:

  • You need tight HDFS integration — your existing Hadoop pipeline writes to HBase as a sink
  • Row key range scans are your primary access pattern (time-series, sensor data ordered by device+timestamp)
  • You need Apache Phoenix for SQL access to NoSQL data
  • Strong consistency per row is a hard requirement
  • Your team already operates a Hadoop cluster (shared operational overhead)

Example use cases: Web analytics event storage (keyed by user+timestamp), genome sequence storage, message storage for large-scale messaging systems.


When to Choose Cassandra

Cassandra is the right choice when:

  • Multi-datacenter active-active replication is required (Cassandra's strongest differentiator)
  • No single point of failure is a hard requirement — you can't afford HMaster failover delay
  • Writes vastly outnumber reads (IoT telemetry, click streams at millions of events/second)
  • Your data access is primarily partition key lookups (user profile by user_id, session by session_id)
  • The workload is independent of Hadoop — Cassandra doesn't need HDFS or YARN

Example use cases: Global user session management, IoT telemetry ingestion across regions, product catalog with global replication, real-time fraud scoring feature store.


Summary Decision Guide

CriteriaChoose HBaseChoose Cassandra
ArchitectureHadoop ecosystem, HDFS storageStandalone, cloud-native
ConsistencyStrong consistency requiredTunable / eventual OK
Multi-DC replicationOne datacenter primaryMulti-DC active-active
Query patternRange scans on ordered keysPoint lookups by partition key
SQL accessApache Phoenix availableLimited (CQL, not SQL)
High availabilityAdequate (master failover ~30s)Excellent (no master)
Operational overlapShares ops with Hadoop clusterSeparate ops team
Write throughputVery highExtremely high

Both are proven at internet scale. The decision almost always comes down to: Do you already have Hadoop? If yes, HBase is the natural fit for random-access storage alongside HDFS. If you're operating independently or need multi-region active-active, Cassandra is the stronger choice.