Apache Sqoop
What Is Sqoop?
Apache Sqoop (SQL-to-Hadoop) is a tool for bulk transfer of data between Hadoop and relational databases. It generates MapReduce jobs to parallelize imports and exports, achieving much higher throughput than JDBC batch inserts.
Relational DB (MySQL, PostgreSQL, Oracle, SQL Server)
│
│ Sqoop Import ──────────────────────────► HDFS / Hive / HBase
│
│◄────────────────────────── Sqoop Export ── HDFS
note
Sqoop 1 (the original CLI tool) is in maintenance-only mode. Sqoop 2 redesigned the architecture with a server/client model. For new projects, consider Apache Spark with JDBC or Apache NiFi for richer pipeline features.
Installation
export SQOOP_HOME=/opt/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
# Copy your JDBC driver into Sqoop's lib directory
cp mysql-connector-j-8.x.jar $SQOOP_HOME/lib/
Importing from a Database
Import a full table to HDFS
sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--target-dir /data/retail/orders \
--num-mappers 4 \
--fields-terminated-by ',' \
--lines-terminated-by '\n'
Import with a WHERE clause (partial import)
sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--query "SELECT * FROM orders WHERE dt = '2026-04-29' AND \$CONDITIONS" \
--split-by order_id \
--target-dir /data/retail/orders-daily \
--num-mappers 8
\$CONDITIONS is a placeholder Sqoop replaces with its parallelism splits — it is required when using --query.
Incremental import (append mode)
sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--incremental append \
--check-column order_id \
--last-value 100000 \
--target-dir /data/retail/orders
Incremental import (lastmodified mode)
sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--incremental lastmodified \
--check-column updated_at \
--last-value "2026-04-28 00:00:00" \
--target-dir /data/retail/orders \
--merge-key order_id
Import directly into Hive
sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table customers \
--hive-import \
--hive-overwrite \
--hive-database retail \
--hive-table customers \
--create-hive-table \
--num-mappers 4
Import into HBase
sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table products \
--hbase-table products \
--column-family info \
--hbase-row-key product_id \
--num-mappers 4
Exporting from HDFS to a Database
sqoop export \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table daily_summary \
--export-dir /output/daily-summary \
--input-fields-terminated-by ',' \
--num-mappers 4 \
--update-mode allowinsert \
--update-key summary_id
Saving Jobs for Reuse
Sqoop can save job definitions for scheduled execution:
# Save an import job
sqoop job --create daily-orders-import \
-- import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--incremental append \
--check-column order_id \
--last-value 0 \
--target-dir /data/retail/orders
# List saved jobs
sqoop job --list
# Execute a saved job (last-value is updated automatically)
sqoop job --exec daily-orders-import
Performance Tips
| Setting | Recommendation |
|---|---|
--num-mappers | Match to DB connection pool limit; 4–16 is typical |
--split-by | Use an indexed, uniformly distributed column (integer primary key) |
| File format | Add --as-parquetfile or --as-orcfile for Hive imports |
| Compression | Add --compress --compression-codec snappy |
| Fetch size | --fetch-size 10000 to tune JDBC batch size |
Supported Databases
| Database | JDBC Driver |
|---|---|
| MySQL | mysql-connector-j |
| PostgreSQL | postgresql JDBC |
| Oracle | ojdbc8 |
| SQL Server | mssql-jdbc |
| DB2 | db2jcc |
| Generic | Any JDBC-compliant driver |