Skip to main content

Apache Sqoop

What Is Sqoop?

Apache Sqoop (SQL-to-Hadoop) is a tool for bulk transfer of data between Hadoop and relational databases. It generates MapReduce jobs to parallelize imports and exports, achieving much higher throughput than JDBC batch inserts.

Relational DB (MySQL, PostgreSQL, Oracle, SQL Server)

│ Sqoop Import ──────────────────────────► HDFS / Hive / HBase

│◄────────────────────────── Sqoop Export ── HDFS
note

Sqoop 1 (the original CLI tool) is in maintenance-only mode. Sqoop 2 redesigned the architecture with a server/client model. For new projects, consider Apache Spark with JDBC or Apache NiFi for richer pipeline features.

Installation

export SQOOP_HOME=/opt/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

# Copy your JDBC driver into Sqoop's lib directory
cp mysql-connector-j-8.x.jar $SQOOP_HOME/lib/

Importing from a Database

Import a full table to HDFS

sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--target-dir /data/retail/orders \
--num-mappers 4 \
--fields-terminated-by ',' \
--lines-terminated-by '\n'

Import with a WHERE clause (partial import)

sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--query "SELECT * FROM orders WHERE dt = '2026-04-29' AND \$CONDITIONS" \
--split-by order_id \
--target-dir /data/retail/orders-daily \
--num-mappers 8

\$CONDITIONS is a placeholder Sqoop replaces with its parallelism splits — it is required when using --query.

Incremental import (append mode)

sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--incremental append \
--check-column order_id \
--last-value 100000 \
--target-dir /data/retail/orders

Incremental import (lastmodified mode)

sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--incremental lastmodified \
--check-column updated_at \
--last-value "2026-04-28 00:00:00" \
--target-dir /data/retail/orders \
--merge-key order_id

Import directly into Hive

sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table customers \
--hive-import \
--hive-overwrite \
--hive-database retail \
--hive-table customers \
--create-hive-table \
--num-mappers 4

Import into HBase

sqoop import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table products \
--hbase-table products \
--column-family info \
--hbase-row-key product_id \
--num-mappers 4

Exporting from HDFS to a Database

sqoop export \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table daily_summary \
--export-dir /output/daily-summary \
--input-fields-terminated-by ',' \
--num-mappers 4 \
--update-mode allowinsert \
--update-key summary_id

Saving Jobs for Reuse

Sqoop can save job definitions for scheduled execution:

# Save an import job
sqoop job --create daily-orders-import \
-- import \
--connect jdbc:mysql://db-host:3306/retail_db \
--username sqoop_user \
--password-file /secure/sqoop.pwd \
--table orders \
--incremental append \
--check-column order_id \
--last-value 0 \
--target-dir /data/retail/orders

# List saved jobs
sqoop job --list

# Execute a saved job (last-value is updated automatically)
sqoop job --exec daily-orders-import

Performance Tips

SettingRecommendation
--num-mappersMatch to DB connection pool limit; 4–16 is typical
--split-byUse an indexed, uniformly distributed column (integer primary key)
File formatAdd --as-parquetfile or --as-orcfile for Hive imports
CompressionAdd --compress --compression-codec snappy
Fetch size--fetch-size 10000 to tune JDBC batch size

Supported Databases

DatabaseJDBC Driver
MySQLmysql-connector-j
PostgreSQLpostgresql JDBC
Oracleojdbc8
SQL Servermssql-jdbc
DB2db2jcc
GenericAny JDBC-compliant driver