Apache Hive
What Is Hive?
Apache Hive is a data warehouse layer built on top of Hadoop that lets you query HDFS data using a SQL-like language called HiveQL. Instead of writing MapReduce jobs in Java, analysts can write familiar SQL and Hive compiles it into MapReduce, Tez, or Spark execution plans automatically.
Client (HiveQL)
│
▼
Hive Metastore ──── stores table schemas, partition info
│
▼
Execution Engine (MapReduce / Tez / Spark)
│
▼
HDFS (reads/writes ORC, Parquet, Text files)
Key Concepts
| Concept | Description |
|---|---|
| Database | Namespace for tables (maps to an HDFS directory) |
| Table | Schema definition over files in HDFS |
| Partition | Subdirectory split by column value (e.g., dt=2026-04-29) |
| Bucket | Hash-partitioned files for faster sampling and joins |
| Metastore | Relational DB (MySQL/Postgres) storing all schema metadata |
| SerDe | Serializer/Deserializer — tells Hive how to parse file formats |
Installation Quick Start
# Set Hive environment
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin
# Initialize the Metastore schema (first time only)
schematool -initSchema -dbType mysql
# Start the Hive Metastore service
hive --service metastore &
# Start HiveServer2 (JDBC/ODBC endpoint)
hiveserver2 &
# Open Hive CLI
hive
HiveQL Examples
Create an external table over existing HDFS data
-- External table: Hive manages schema but NOT the data files
CREATE EXTERNAL TABLE web_logs (
ip STRING,
ts TIMESTAMP,
method STRING,
url STRING,
status INT,
bytes BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/logs/web/';
Partitioned table for efficient queries
CREATE TABLE sales (
order_id BIGINT,
product STRING,
amount DECIMAL(10,2)
)
PARTITIONED BY (dt STRING, region STRING)
STORED AS ORC;
-- Add a partition
ALTER TABLE sales ADD PARTITION (dt='2026-04-29', region='us-west');
-- Query only the relevant partition (avoids full table scan)
SELECT product, SUM(amount)
FROM sales
WHERE dt = '2026-04-29' AND region = 'us-west'
GROUP BY product;
Join two large tables
-- Hive automatically uses Map-side join when one table fits in memory
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SELECT o.order_id, c.name, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.dt = '2026-04-29';
Export results back to HDFS
INSERT OVERWRITE DIRECTORY '/output/daily-summary'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT region, SUM(amount) AS total
FROM sales
WHERE dt = '2026-04-29'
GROUP BY region;
File Formats
| Format | Best For | Compression |
|---|---|---|
| ORC | Write-heavy analytics, Hive-native | Zlib, Snappy, LZ4 |
| Parquet | Read-heavy analytics, multi-engine | Snappy, Gzip |
| Avro | Schema evolution, streaming ingestion | Snappy, Deflate |
| TextFile | Human-readable, debugging | Gzip (not splittable) |
Always use ORC or Parquet in production — they offer 3–10× compression and columnar pushdown.
Performance Tips
-- Enable cost-based optimizer
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
-- Analyze table statistics after bulk loads
ANALYZE TABLE sales PARTITION (dt='2026-04-29') COMPUTE STATISTICS;
ANALYZE TABLE sales PARTITION (dt='2026-04-29') COMPUTE STATISTICS FOR COLUMNS;
-- Enable vectorized query execution
SET hive.vectorized.execution.enabled=true;
-- Use Tez instead of MapReduce (much faster for interactive queries)
SET hive.execution.engine=tez;
hive-site.xml Key Settings
<!-- Metastore connection -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql-host:3306/hive_metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<!-- Enable dynamic partitioning -->
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>