Apache HBase
What Is HBase?
Apache HBase is a distributed, column-oriented NoSQL database built on top of HDFS. Unlike Hive which is optimized for batch analytics, HBase provides millisecond random read/write access to billions of rows and millions of columns.
HBase is modeled after Google Bigtable and is the right choice when you need:
- Single-row lookups by key in milliseconds
- High-volume writes (millions per second across a cluster)
- Sparse data with variable columns per row
- Time-series or versioned data
Client
│
▼
HBase Master ── manages region assignments, DDL
│
▼
Region Servers ── serve reads and writes for key ranges
│
▼
HDFS (persistent storage of HFiles)
│
ZooKeeper (coordination: master election, region locations)
Data Model
| Concept | Description |
|---|---|
| Table | Collection of rows, identified by a row key |
| Row Key | Unique binary key — rows are stored sorted by key |
| Column Family | Group of columns stored together on disk (defined at table creation) |
| Column Qualifier | Individual column within a family (dynamic, defined at write time) |
| Cell | Intersection of row + column family + qualifier |
| Timestamp/Version | Each cell stores multiple versions by timestamp |
A cell address: Table → RowKey → ColumnFamily:Qualifier → Timestamp → Value
Installation Quick Start
export HBASE_HOME=/opt/hbase
export PATH=$PATH:$HBASE_HOME/bin
# Configure ZooKeeper quorum in hbase-site.xml first, then:
start-hbase.sh
# Open HBase Shell
hbase shell
HBase Shell Examples
# Create a table with two column families
create 'user_events', 'profile', 'activity'
# Insert data (put rowkey, 'family:qualifier', 'value')
put 'user_events', 'user_001', 'profile:name', 'Alice'
put 'user_events', 'user_001', 'profile:email', 'alice@example.com'
put 'user_events', 'user_001', 'activity:login', '2026-04-29T08:00:00'
put 'user_events', 'user_001', 'activity:page', '/dashboard'
# Get a single row
get 'user_events', 'user_001'
# Get specific columns
get 'user_events', 'user_001', {COLUMN => 'profile:name'}
# Scan a range of rows
scan 'user_events', {STARTROW => 'user_000', STOPROW => 'user_010'}
# Scan with filter
scan 'user_events', {FILTER => "ValueFilter(=, 'binary:Alice')"}
# Delete a specific cell
delete 'user_events', 'user_001', 'activity:page'
# Delete an entire row
deleteall 'user_events', 'user_001'
# View table schema
describe 'user_events'
# Disable and drop a table
disable 'user_events'
drop 'user_events'
Java API Example
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "zk1,zk2,zk3");
try (Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("user_events"))) {
// Write
Put put = new Put(Bytes.toBytes("user_001"));
put.addColumn(
Bytes.toBytes("profile"),
Bytes.toBytes("name"),
Bytes.toBytes("Alice")
);
table.put(put);
// Read
Get get = new Get(Bytes.toBytes("user_001"));
Result result = table.get(get);
byte[] name = result.getValue(
Bytes.toBytes("profile"),
Bytes.toBytes("name")
);
System.out.println(Bytes.toString(name)); // Alice
}
hbase-site.xml Configuration
<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode:8020/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>zk1.example.com,zk2.example.com,zk3.example.com</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!-- Region size before automatic split -->
<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value> <!-- 10 GB -->
</property>
Row Key Design
Row key design is critical — HBase stores rows sorted by key, so poor design causes hotspotting (all writes hit one Region Server).
| Pattern | Good For | Avoid When |
|---|---|---|
Reverse timestamp (Long.MAX_VALUE - ts) | Latest-first time-series | Random access by entity |
| Salted prefix (hash prefix + entity ID) | High write throughput | Range scans needed |
Composite key (region#userId#ts) | Multi-dimensional lookups | Key becomes too long |
Example — avoid sequential user IDs (hotspot):
BAD: user_0001, user_0002, user_0003 ... (sequential → all to one region)
GOOD: a3f2_user_0001, b7c1_user_0002 ... (hash prefix → spread across regions)
HBase vs Hive
| HBase | Hive | |
|---|---|---|
| Access pattern | Random reads/writes (ms latency) | Full table scans (batch) |
| Query language | Java API / Shell | HiveQL (SQL) |
| Use case | OLTP-style, event stores | OLAP analytics |
| Schema | Flexible (add columns anytime) | Fixed at table creation |
| Storage | HDFS (HFile format) | HDFS (ORC/Parquet/Text) |