Apache Pig
What Is Pig?
Apache Pig is a high-level data flow scripting platform for Hadoop. Instead of writing MapReduce jobs in Java, you write scripts in Pig Latin — a dataflow language where you describe transformations step by step. Pig compiles scripts into sequences of MapReduce jobs automatically.
Pig is particularly well-suited for:
- ETL pipelines — filtering, transforming, joining large datasets
- Exploratory data analysis — quick iteration without compiling Java
- Complex multi-step transformations that would require many MapReduce jobs
Pig Latin Script
│
▼
Pig Compiler
│
├── Logical Plan (optimized)
│
▼
MapReduce Jobs → HDFS
Pig vs Hive
| Pig | Hive | |
|---|---|---|
| Language | Pig Latin (procedural) | HiveQL (declarative SQL) |
| Best for | Data pipelines, ETL | Ad-hoc queries, reporting |
| Schema | Optional (schema-on-read) | Required at table creation |
| Users | Engineers/developers | Analysts familiar with SQL |
| Output | Any HDFS path | Tables in Metastore |
Running Pig
# Interactive Grunt shell (local mode — no Hadoop needed)
pig -x local
# Interactive Grunt shell (MapReduce mode)
pig
# Run a script in MapReduce mode
pig -f my_script.pig
# Run a script in local mode
pig -x local -f my_script.pig
Pig Latin Basics
Load and inspect data
-- Load tab-separated web log data
logs = LOAD '/data/logs/web/' USING PigStorage('\t')
AS (ip:chararray, ts:chararray, method:chararray,
url:chararray, status:int, bytes:long);
-- Show schema
DESCRIBE logs;
-- Preview first 10 records
raw = LIMIT logs 10;
DUMP raw;
Filter, transform, group
-- Filter only HTTP 200 responses
ok_logs = FILTER logs BY status == 200;
-- Extract just the URL and bytes
urls = FOREACH ok_logs GENERATE url, bytes;
-- Group by URL and compute total bytes
grouped = GROUP urls BY url;
totals = FOREACH grouped GENERATE
group AS url,
SUM(urls.bytes) AS total_bytes,
COUNT(urls) AS hit_count;
-- Sort by total bytes descending
sorted = ORDER totals BY total_bytes DESC;
-- Store result back to HDFS
STORE sorted INTO '/output/url-totals' USING PigStorage(',');
Join two datasets
orders = LOAD '/data/orders/' USING PigStorage(',')
AS (order_id:long, customer_id:long, amount:double, dt:chararray);
customers = LOAD '/data/customers/' USING PigStorage(',')
AS (customer_id:long, name:chararray, region:chararray);
-- Inner join
joined = JOIN orders BY customer_id, customers BY customer_id;
result = FOREACH joined GENERATE
orders::order_id AS order_id,
customers::name AS customer_name,
orders::amount AS amount,
orders::dt AS date;
STORE result INTO '/output/orders-with-names';
Flatten bags and tuples
-- Group all orders per customer
by_customer = GROUP orders BY customer_id;
-- FLATTEN expands bags into individual rows
flat = FOREACH by_customer {
sorted_orders = ORDER orders BY amount DESC;
top3 = LIMIT sorted_orders 3;
GENERATE group AS customer_id, FLATTEN(top3);
};
DUMP flat;
User-Defined Functions (UDFs)
When built-in functions aren't enough, write a Java UDF:
// src/main/java/com/example/pig/UpperCase.java
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UpperCase extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
return ((String) input.get(0)).toUpperCase();
}
}
Register and use in your Pig script:
REGISTER /opt/udfs/myudfs.jar;
DEFINE UpperCase com.example.pig.UpperCase();
data = LOAD '/data/names' AS (name:chararray);
upper = FOREACH data GENERATE UpperCase(name);
DUMP upper;
Useful Built-in Functions
-- String functions
FOREACH data GENERATE LOWER(name), TRIM(email), SUBSTRING(url, 0, 50);
-- Math
FOREACH sales GENERATE ROUND(amount), ABS(delta), MAX(values);
-- Date/time (requires Joda format)
FOREACH logs GENERATE ToDate(ts, 'yyyy-MM-dd HH:mm:ss') AS dt;
FOREACH logs GENERATE GetYear(dt), GetMonth(dt);
-- Bag/Tuple
FOREACH grouped GENERATE COUNT(items), SUM(items.amount), AVG(items.score);
pig.properties Key Settings
# Default execution mode
exectype=mapreduce
# Log level
pig.logfile=/var/log/pig/pig.log
# Parallelism hint (number of reducers)
default_parallel=50
# Enable cost-based optimizer
pig.exec.nocombiner=false
When to Use Pig Today
Pig adoption has declined as Spark and Hive-on-Tez became dominant. However, Pig remains useful for:
- Legacy ETL pipelines already written in Pig Latin
- Teams comfortable with procedural data flow style
- Quick transformations without setting up a Hive Metastore
For new projects, consider Apache Spark (Python/Scala API) or Hive on Tez as modern alternatives.