Apache HBase

What Is HBase?

Apache HBase is a distributed, column-oriented NoSQL database built on top of HDFS. Unlike Hive which is optimized for batch analytics, HBase provides millisecond random read/write access to billions of rows and millions of columns.

HBase is modeled after Google Bigtable and is the right choice when you need:

Single-row lookups by key in milliseconds
High-volume writes (millions per second across a cluster)
Sparse data with variable columns per row
Time-series or versioned data

Client
  │
  ▼
HBase Master ── manages region assignments, DDL
  │
  ▼
Region Servers ── serve reads and writes for key ranges
  │
  ▼
HDFS (persistent storage of HFiles)
  │
ZooKeeper (coordination: master election, region locations)

Data Model

Concept	Description
Table	Collection of rows, identified by a row key
Row Key	Unique binary key — rows are stored sorted by key
Column Family	Group of columns stored together on disk (defined at table creation)
Column Qualifier	Individual column within a family (dynamic, defined at write time)
Cell	Intersection of row + column family + qualifier
Timestamp/Version	Each cell stores multiple versions by timestamp

A cell address: Table → RowKey → ColumnFamily:Qualifier → Timestamp → Value

Installation Quick Start

export HBASE_HOME=/opt/hbase
export PATH=$PATH:$HBASE_HOME/bin

# Configure ZooKeeper quorum in hbase-site.xml first, then:
start-hbase.sh

# Open HBase Shell
hbase shell

HBase Shell Examples

# Create a table with two column families
create 'user_events', 'profile', 'activity'

# Insert data (put rowkey, 'family:qualifier', 'value')
put 'user_events', 'user_001', 'profile:name',    'Alice'
put 'user_events', 'user_001', 'profile:email',   'alice@example.com'
put 'user_events', 'user_001', 'activity:login',  '2026-04-29T08:00:00'
put 'user_events', 'user_001', 'activity:page',   '/dashboard'

# Get a single row
get 'user_events', 'user_001'

# Get specific columns
get 'user_events', 'user_001', {COLUMN => 'profile:name'}

# Scan a range of rows
scan 'user_events', {STARTROW => 'user_000', STOPROW => 'user_010'}

# Scan with filter
scan 'user_events', {FILTER => "ValueFilter(=, 'binary:Alice')"}

# Delete a specific cell
delete 'user_events', 'user_001', 'activity:page'

# Delete an entire row
deleteall 'user_events', 'user_001'

# View table schema
describe 'user_events'

# Disable and drop a table
disable 'user_events'
drop 'user_events'

Java API Example

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "zk1,zk2,zk3");

try (Connection connection = ConnectionFactory.createConnection(conf);
     Table table = connection.getTable(TableName.valueOf("user_events"))) {

    // Write
    Put put = new Put(Bytes.toBytes("user_001"));
    put.addColumn(
        Bytes.toBytes("profile"),
        Bytes.toBytes("name"),
        Bytes.toBytes("Alice")
    );
    table.put(put);

    // Read
    Get get = new Get(Bytes.toBytes("user_001"));
    Result result = table.get(get);
    byte[] name = result.getValue(
        Bytes.toBytes("profile"),
        Bytes.toBytes("name")
    );
    System.out.println(Bytes.toString(name)); // Alice
}

hbase-site.xml Configuration

<property>
  <name>hbase.rootdir</name>
  <value>hdfs://namenode:8020/hbase</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>zk1.example.com,zk2.example.com,zk3.example.com</value>
</property>
<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<!-- Region size before automatic split -->
<property>
  <name>hbase.hregion.max.filesize</name>
  <value>10737418240</value>  <!-- 10 GB -->
</property>

Row Key Design

Row key design is critical — HBase stores rows sorted by key, so poor design causes hotspotting (all writes hit one Region Server).

Pattern	Good For	Avoid When
Reverse timestamp (`Long.MAX_VALUE - ts`)	Latest-first time-series	Random access by entity
Salted prefix (hash prefix + entity ID)	High write throughput	Range scans needed
Composite key (`region#userId#ts`)	Multi-dimensional lookups	Key becomes too long

Example — avoid sequential user IDs (hotspot):

BAD:  user_0001, user_0002, user_0003 ...  (sequential → all to one region)
GOOD: a3f2_user_0001, b7c1_user_0002 ...  (hash prefix → spread across regions)

HBase vs Hive

	HBase	Hive
Access pattern	Random reads/writes (ms latency)	Full table scans (batch)
Query language	Java API / Shell	HiveQL (SQL)
Use case	OLTP-style, event stores	OLAP analytics
Schema	Flexible (add columns anytime)	Fixed at table creation
Storage	HDFS (HFile format)	HDFS (ORC/Parquet/Text)

What Is HBase?​

Data Model​

Installation Quick Start​

HBase Shell Examples​

Java API Example​

hbase-site.xml Configuration​

Row Key Design​

HBase vs Hive​