Skip to main content

Apache HBase

What Is HBase?

Apache HBase is a distributed, column-oriented NoSQL database built on top of HDFS. Unlike Hive which is optimized for batch analytics, HBase provides millisecond random read/write access to billions of rows and millions of columns.

HBase is modeled after Google Bigtable and is the right choice when you need:

  • Single-row lookups by key in milliseconds
  • High-volume writes (millions per second across a cluster)
  • Sparse data with variable columns per row
  • Time-series or versioned data
Client


HBase Master ── manages region assignments, DDL


Region Servers ── serve reads and writes for key ranges


HDFS (persistent storage of HFiles)

ZooKeeper (coordination: master election, region locations)

Data Model

ConceptDescription
TableCollection of rows, identified by a row key
Row KeyUnique binary key — rows are stored sorted by key
Column FamilyGroup of columns stored together on disk (defined at table creation)
Column QualifierIndividual column within a family (dynamic, defined at write time)
CellIntersection of row + column family + qualifier
Timestamp/VersionEach cell stores multiple versions by timestamp

A cell address: Table → RowKey → ColumnFamily:Qualifier → Timestamp → Value

Installation Quick Start

export HBASE_HOME=/opt/hbase
export PATH=$PATH:$HBASE_HOME/bin

# Configure ZooKeeper quorum in hbase-site.xml first, then:
start-hbase.sh

# Open HBase Shell
hbase shell

HBase Shell Examples

# Create a table with two column families
create 'user_events', 'profile', 'activity'

# Insert data (put rowkey, 'family:qualifier', 'value')
put 'user_events', 'user_001', 'profile:name', 'Alice'
put 'user_events', 'user_001', 'profile:email', 'alice@example.com'
put 'user_events', 'user_001', 'activity:login', '2026-04-29T08:00:00'
put 'user_events', 'user_001', 'activity:page', '/dashboard'

# Get a single row
get 'user_events', 'user_001'

# Get specific columns
get 'user_events', 'user_001', {COLUMN => 'profile:name'}

# Scan a range of rows
scan 'user_events', {STARTROW => 'user_000', STOPROW => 'user_010'}

# Scan with filter
scan 'user_events', {FILTER => "ValueFilter(=, 'binary:Alice')"}

# Delete a specific cell
delete 'user_events', 'user_001', 'activity:page'

# Delete an entire row
deleteall 'user_events', 'user_001'

# View table schema
describe 'user_events'

# Disable and drop a table
disable 'user_events'
drop 'user_events'

Java API Example

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "zk1,zk2,zk3");

try (Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("user_events"))) {

// Write
Put put = new Put(Bytes.toBytes("user_001"));
put.addColumn(
Bytes.toBytes("profile"),
Bytes.toBytes("name"),
Bytes.toBytes("Alice")
);
table.put(put);

// Read
Get get = new Get(Bytes.toBytes("user_001"));
Result result = table.get(get);
byte[] name = result.getValue(
Bytes.toBytes("profile"),
Bytes.toBytes("name")
);
System.out.println(Bytes.toString(name)); // Alice
}

hbase-site.xml Configuration

<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode:8020/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>zk1.example.com,zk2.example.com,zk3.example.com</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!-- Region size before automatic split -->
<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value> <!-- 10 GB -->
</property>

Row Key Design

Row key design is critical — HBase stores rows sorted by key, so poor design causes hotspotting (all writes hit one Region Server).

PatternGood ForAvoid When
Reverse timestamp (Long.MAX_VALUE - ts)Latest-first time-seriesRandom access by entity
Salted prefix (hash prefix + entity ID)High write throughputRange scans needed
Composite key (region#userId#ts)Multi-dimensional lookupsKey becomes too long

Example — avoid sequential user IDs (hotspot):

BAD:  user_0001, user_0002, user_0003 ...  (sequential → all to one region)
GOOD: a3f2_user_0001, b7c1_user_0002 ... (hash prefix → spread across regions)

HBase vs Hive

HBaseHive
Access patternRandom reads/writes (ms latency)Full table scans (batch)
Query languageJava API / ShellHiveQL (SQL)
Use caseOLTP-style, event storesOLAP analytics
SchemaFlexible (add columns anytime)Fixed at table creation
StorageHDFS (HFile format)HDFS (ORC/Parquet/Text)