Skip to main content

Hive vs Presto vs Trino: Choosing a SQL Engine for Your Data Lake

· 6 min read
Hadoop.so Editorial Team
Big Data Engineers

Three SQL engines dominate the Hadoop data lake landscape: Apache Hive, Presto, and Trino (Presto's open-source fork). Each evolved to solve different problems. Picking the wrong one leads to either unbearably slow interactive queries or over-engineered infrastructure for simple batch ETL. Here's how they compare.

Origin Stories

Apache Hive was created at Facebook in 2008 as a way to run SQL-like queries over HDFS data without writing raw MapReduce jobs. It translates HiveQL into MapReduce (or later Tez/Spark) execution plans.

Presto was also created at Facebook in 2012 — specifically because Hive was too slow for interactive analysis. It uses an in-memory, pipelined execution model with no disk spill by design.

Trino (formerly PrestoSQL) forked from Presto in 2019 after the original Presto core team left Facebook. It is now the community-driven, fully open-source successor. The two are API-compatible but have diverged in features and release cadence.


Architecture Comparison

Hive

HiveQL Query


Hive Metastore (schema + table definitions)


Query Planner (Hive → Tez/Spark DAG)


YARN execution (reads HDFS blocks from DataNodes)


Results (written to HDFS or streamed to client)

Hive is a batch SQL engine. Queries produce YARN jobs that read from HDFS, shuffle through reducers, and write results — the same data path as MapReduce. It's reliable and handles arbitrarily large datasets, but startup overhead and disk-based shuffle mean latency is measured in minutes, not seconds.

Presto / Trino

SQL Query


Coordinator Node (plan + split generation)


Worker Nodes (in-memory pipelined execution)
│ │ │
▼ ▼ ▼
Data Sources: HDFS, S3, HBase, MySQL, Kafka, ...


Results streamed back to client

Presto/Trino uses a coordinator/worker model with fully pipelined, in-memory execution. No intermediate disk writes between stages (unless memory pressure forces spill). Latency is measured in seconds for most interactive queries.


Feature Comparison

FeatureHivePrestoTrino
Query modelBatch (MapReduce/Tez/Spark)Interactive, in-memoryInteractive, in-memory
LatencyMinutes (startup + shuffle overhead)SecondsSeconds
Data volumeUnlimited (disk-backed)Limited by cluster RAM (spill available)Limited by cluster RAM (spill available)
ACID transactionsYes (Hive 3+ with ORC)No (read + INSERT only)No (read + INSERT only)
Data formatsORC, Parquet, Avro, Text, JSONORC, Parquet, Avro, Text, JSONORC, Parquet, Avro, Text, JSON
MetastoreBuilt-in (HMS)Uses Hive MetastoreUses Hive Metastore
Federated queriesNoYes (connectors)Yes (connectors)
StreamingLimitedNoNo
UDFsJava, Python (via Hive)Java (via plugin)Java (via plugin)
YARN integrationNativeOptional (via YARN service)Optional
LicenseApache 2.0Apache 2.0Apache 2.0

Performance: When Does Each Win?

Hive is faster when...

  • Query touches the entire dataset — full table scans with heavy aggregations where disk I/O is the bottleneck, not startup time
  • Data doesn't fit in cluster RAM — Hive's disk-backed shuffle handles any data size gracefully
  • ACID updates/deletes are required — Hive 3+ with ORC and compaction supports MERGE, UPDATE, DELETE
  • Long-running ETL pipelines — startup overhead amortizes over hour-long jobs
-- Hive: Efficient for full-table ETL
INSERT OVERWRITE TABLE sales_summary
SELECT region, product_category, SUM(revenue) as total_revenue
FROM sales_raw
WHERE dt BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region, product_category;

Presto/Trino is faster when...

  • Query touches a small partition — partition pruning + in-memory execution means sub-second results
  • Interactive BI dashboards — analysts run dozens of queries per session; Hive's per-query startup overhead (5–30s) kills interactivity
  • Federated queries across systems — join HDFS data with MySQL, Kafka, or a REST API in one query
  • Ad hoc exploration — analysts iterating on queries don't want to wait 2 minutes per attempt
-- Trino: Sub-second interactive query with partition pruning
SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total
FROM hive.sales.orders
WHERE order_date = CURRENT_DATE - INTERVAL '7' DAY
AND region = 'US'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100;

Federated Queries: Trino's Superpower

Trino's connector architecture lets you query multiple data sources in a single SQL statement:

-- Join HDFS data with live MySQL data
SELECT
h.user_id,
h.total_purchases,
m.email,
m.subscription_tier
FROM hive.warehouse.user_purchases h
JOIN mysql.crm.users m ON h.user_id = m.id
WHERE h.purchase_date > DATE '2024-01-01'
AND m.subscription_tier = 'premium';

Available Trino connectors include: Hive/HDFS, Delta Lake, Iceberg, MySQL, PostgreSQL, Kafka, Cassandra, Elasticsearch, MongoDB, HTTP (REST API), and many more. Hive has no equivalent capability.


Configuration Quick Start

Hive Metastore (shared by all three engines)

<!-- hive-site.xml -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://metastore-db:3306/metastore</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore-host:9083</value>
</property>

Trino Configuration

# etc/config.properties (coordinator)
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery.uri=http://coordinator:8080

# etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://metastore-host:9083
hive.config.resources=/etc/hadoop/core-site.xml,/etc/hadoop/hdfs-site.xml
# Query via Trino CLI
trino --server coordinator:8080 --catalog hive --schema warehouse

Table Format: The Modern Choice

All three engines support modern open table formats that add ACID capabilities to object stores:

FormatHivePrestoTrino
Apache IcebergYes (3.x+)YesYes (best support)
Delta LakeLimitedVia delta connectorYes
Apache HudiYesLimitedYes

Apache Iceberg is the recommended format for new data lakes — it provides ACID transactions, schema evolution, time travel, and hidden partitioning, and it's supported well across all three engines.


When to Use Each

Use CaseBest Choice
Overnight ETL loading petabytes into the warehouseHive (or Spark SQL)
Interactive BI dashboards with SLA < 5sTrino
Ad hoc analyst queries on partitioned dataTrino or Presto
ACID UPDATE/DELETE/MERGE on HDFS tablesHive 3+
Federated queries across HDFS + RDBMS + KafkaTrino
Legacy HiveQL compatibility (UDFs, SerDes)Hive
Mixed batch + interactive on same clusterHive (batch) + Trino (interactive)

The Practical Answer

In 2025, most data-mature organizations run both Hive and Trino:

  • Hive (on Tez or Spark) for overnight batch ETL, data loading, and ACID operations
  • Trino for all analyst-facing interactive queries and BI tool connectivity

They share the same Hive Metastore and read the same HDFS/S3 files in ORC or Parquet format. This gives you the best of both worlds: reliable batch processing and fast interactive queries, without migrating data or rewriting pipelines.

Presto vs Trino: if you're starting fresh, choose Trino — it has a more active open-source community, more frequent releases, and broader connector support.