Skip to main content

Apache Hive

What Is Hive?

Apache Hive is a data warehouse layer built on top of Hadoop that lets you query HDFS data using a SQL-like language called HiveQL. Instead of writing MapReduce jobs in Java, analysts can write familiar SQL and Hive compiles it into MapReduce, Tez, or Spark execution plans automatically.

Client (HiveQL)


Hive Metastore ──── stores table schemas, partition info


Execution Engine (MapReduce / Tez / Spark)


HDFS (reads/writes ORC, Parquet, Text files)

Key Concepts

ConceptDescription
DatabaseNamespace for tables (maps to an HDFS directory)
TableSchema definition over files in HDFS
PartitionSubdirectory split by column value (e.g., dt=2026-04-29)
BucketHash-partitioned files for faster sampling and joins
MetastoreRelational DB (MySQL/Postgres) storing all schema metadata
SerDeSerializer/Deserializer — tells Hive how to parse file formats

Installation Quick Start

# Set Hive environment
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin

# Initialize the Metastore schema (first time only)
schematool -initSchema -dbType mysql

# Start the Hive Metastore service
hive --service metastore &

# Start HiveServer2 (JDBC/ODBC endpoint)
hiveserver2 &

# Open Hive CLI
hive

HiveQL Examples

Create an external table over existing HDFS data

-- External table: Hive manages schema but NOT the data files
CREATE EXTERNAL TABLE web_logs (
ip STRING,
ts TIMESTAMP,
method STRING,
url STRING,
status INT,
bytes BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/logs/web/';

Partitioned table for efficient queries

CREATE TABLE sales (
order_id BIGINT,
product STRING,
amount DECIMAL(10,2)
)
PARTITIONED BY (dt STRING, region STRING)
STORED AS ORC;

-- Add a partition
ALTER TABLE sales ADD PARTITION (dt='2026-04-29', region='us-west');

-- Query only the relevant partition (avoids full table scan)
SELECT product, SUM(amount)
FROM sales
WHERE dt = '2026-04-29' AND region = 'us-west'
GROUP BY product;

Join two large tables

-- Hive automatically uses Map-side join when one table fits in memory
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000;

SELECT o.order_id, c.name, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.dt = '2026-04-29';

Export results back to HDFS

INSERT OVERWRITE DIRECTORY '/output/daily-summary'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT region, SUM(amount) AS total
FROM sales
WHERE dt = '2026-04-29'
GROUP BY region;

File Formats

FormatBest ForCompression
ORCWrite-heavy analytics, Hive-nativeZlib, Snappy, LZ4
ParquetRead-heavy analytics, multi-engineSnappy, Gzip
AvroSchema evolution, streaming ingestionSnappy, Deflate
TextFileHuman-readable, debuggingGzip (not splittable)

Always use ORC or Parquet in production — they offer 3–10× compression and columnar pushdown.

Performance Tips

-- Enable cost-based optimizer
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;

-- Analyze table statistics after bulk loads
ANALYZE TABLE sales PARTITION (dt='2026-04-29') COMPUTE STATISTICS;
ANALYZE TABLE sales PARTITION (dt='2026-04-29') COMPUTE STATISTICS FOR COLUMNS;

-- Enable vectorized query execution
SET hive.vectorized.execution.enabled=true;

-- Use Tez instead of MapReduce (much faster for interactive queries)
SET hive.execution.engine=tez;

hive-site.xml Key Settings

<!-- Metastore connection -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql-host:3306/hive_metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>

<!-- Enable dynamic partitioning -->
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>