Apache Hive

What Is Hive?

Apache Hive is a data warehouse layer built on top of Hadoop that lets you query HDFS data using a SQL-like language called HiveQL. Instead of writing MapReduce jobs in Java, analysts can write familiar SQL and Hive compiles it into MapReduce, Tez, or Spark execution plans automatically.

Client (HiveQL)
      │
      ▼
 Hive Metastore ──── stores table schemas, partition info
      │
      ▼
 Execution Engine (MapReduce / Tez / Spark)
      │
      ▼
      HDFS (reads/writes ORC, Parquet, Text files)

Key Concepts

Concept	Description
Database	Namespace for tables (maps to an HDFS directory)
Table	Schema definition over files in HDFS
Partition	Subdirectory split by column value (e.g., `dt=2026-04-29`)
Bucket	Hash-partitioned files for faster sampling and joins
Metastore	Relational DB (MySQL/Postgres) storing all schema metadata
SerDe	Serializer/Deserializer — tells Hive how to parse file formats

Installation Quick Start

# Set Hive environment
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin

# Initialize the Metastore schema (first time only)
schematool -initSchema -dbType mysql

# Start the Hive Metastore service
hive --service metastore &

# Start HiveServer2 (JDBC/ODBC endpoint)
hiveserver2 &

# Open Hive CLI
hive

HiveQL Examples

Create an external table over existing HDFS data

-- External table: Hive manages schema but NOT the data files
CREATE EXTERNAL TABLE web_logs (
  ip        STRING,
  ts        TIMESTAMP,
  method    STRING,
  url       STRING,
  status    INT,
  bytes     BIGINT
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/logs/web/';

Partitioned table for efficient queries

CREATE TABLE sales (
  order_id  BIGINT,
  product   STRING,
  amount    DECIMAL(10,2)
)
PARTITIONED BY (dt STRING, region STRING)
STORED AS ORC;

-- Add a partition
ALTER TABLE sales ADD PARTITION (dt='2026-04-29', region='us-west');

-- Query only the relevant partition (avoids full table scan)
SELECT product, SUM(amount)
FROM sales
WHERE dt = '2026-04-29' AND region = 'us-west'
GROUP BY product;

Join two large tables

-- Hive automatically uses Map-side join when one table fits in memory
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000;

SELECT o.order_id, c.name, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.dt = '2026-04-29';

Export results back to HDFS

INSERT OVERWRITE DIRECTORY '/output/daily-summary'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT region, SUM(amount) AS total
FROM sales
WHERE dt = '2026-04-29'
GROUP BY region;

File Formats

Format	Best For	Compression
ORC	Write-heavy analytics, Hive-native	Zlib, Snappy, LZ4
Parquet	Read-heavy analytics, multi-engine	Snappy, Gzip
Avro	Schema evolution, streaming ingestion	Snappy, Deflate
TextFile	Human-readable, debugging	Gzip (not splittable)

Always use ORC or Parquet in production — they offer 3–10× compression and columnar pushdown.

Performance Tips

-- Enable cost-based optimizer
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;

-- Analyze table statistics after bulk loads
ANALYZE TABLE sales PARTITION (dt='2026-04-29') COMPUTE STATISTICS;
ANALYZE TABLE sales PARTITION (dt='2026-04-29') COMPUTE STATISTICS FOR COLUMNS;

-- Enable vectorized query execution
SET hive.vectorized.execution.enabled=true;

-- Use Tez instead of MapReduce (much faster for interactive queries)
SET hive.execution.engine=tez;

hive-site.xml Key Settings

<!-- Metastore connection -->
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://mysql-host:3306/hive_metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
</property>

<!-- Enable dynamic partitioning -->
<property>
  <name>hive.exec.dynamic.partition</name>
  <value>true</value>
</property>
<property>
  <name>hive.exec.dynamic.partition.mode</name>
  <value>nonstrict</value>
</property>

What Is Hive?​

Key Concepts​

Installation Quick Start​

HiveQL Examples​

Create an external table over existing HDFS data​

Partitioned table for efficient queries​

Join two large tables​

Export results back to HDFS​

File Formats​

Performance Tips​

hive-site.xml Key Settings​