Skip to main content

Apache Pig

What Is Pig?

Apache Pig is a high-level data flow scripting platform for Hadoop. Instead of writing MapReduce jobs in Java, you write scripts in Pig Latin — a dataflow language where you describe transformations step by step. Pig compiles scripts into sequences of MapReduce jobs automatically.

Pig is particularly well-suited for:

  • ETL pipelines — filtering, transforming, joining large datasets
  • Exploratory data analysis — quick iteration without compiling Java
  • Complex multi-step transformations that would require many MapReduce jobs
Pig Latin Script


Pig Compiler

├── Logical Plan (optimized)


MapReduce Jobs → HDFS

Pig vs Hive

PigHive
LanguagePig Latin (procedural)HiveQL (declarative SQL)
Best forData pipelines, ETLAd-hoc queries, reporting
SchemaOptional (schema-on-read)Required at table creation
UsersEngineers/developersAnalysts familiar with SQL
OutputAny HDFS pathTables in Metastore

Running Pig

# Interactive Grunt shell (local mode — no Hadoop needed)
pig -x local

# Interactive Grunt shell (MapReduce mode)
pig

# Run a script in MapReduce mode
pig -f my_script.pig

# Run a script in local mode
pig -x local -f my_script.pig

Pig Latin Basics

Load and inspect data

-- Load tab-separated web log data
logs = LOAD '/data/logs/web/' USING PigStorage('\t')
AS (ip:chararray, ts:chararray, method:chararray,
url:chararray, status:int, bytes:long);

-- Show schema
DESCRIBE logs;

-- Preview first 10 records
raw = LIMIT logs 10;
DUMP raw;

Filter, transform, group

-- Filter only HTTP 200 responses
ok_logs = FILTER logs BY status == 200;

-- Extract just the URL and bytes
urls = FOREACH ok_logs GENERATE url, bytes;

-- Group by URL and compute total bytes
grouped = GROUP urls BY url;
totals = FOREACH grouped GENERATE
group AS url,
SUM(urls.bytes) AS total_bytes,
COUNT(urls) AS hit_count;

-- Sort by total bytes descending
sorted = ORDER totals BY total_bytes DESC;

-- Store result back to HDFS
STORE sorted INTO '/output/url-totals' USING PigStorage(',');

Join two datasets

orders    = LOAD '/data/orders/'    USING PigStorage(',')
AS (order_id:long, customer_id:long, amount:double, dt:chararray);

customers = LOAD '/data/customers/' USING PigStorage(',')
AS (customer_id:long, name:chararray, region:chararray);

-- Inner join
joined = JOIN orders BY customer_id, customers BY customer_id;

result = FOREACH joined GENERATE
orders::order_id AS order_id,
customers::name AS customer_name,
orders::amount AS amount,
orders::dt AS date;

STORE result INTO '/output/orders-with-names';

Flatten bags and tuples

-- Group all orders per customer
by_customer = GROUP orders BY customer_id;

-- FLATTEN expands bags into individual rows
flat = FOREACH by_customer {
sorted_orders = ORDER orders BY amount DESC;
top3 = LIMIT sorted_orders 3;
GENERATE group AS customer_id, FLATTEN(top3);
};
DUMP flat;

User-Defined Functions (UDFs)

When built-in functions aren't enough, write a Java UDF:

// src/main/java/com/example/pig/UpperCase.java
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UpperCase extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
return ((String) input.get(0)).toUpperCase();
}
}

Register and use in your Pig script:

REGISTER /opt/udfs/myudfs.jar;
DEFINE UpperCase com.example.pig.UpperCase();

data = LOAD '/data/names' AS (name:chararray);
upper = FOREACH data GENERATE UpperCase(name);
DUMP upper;

Useful Built-in Functions

-- String functions
FOREACH data GENERATE LOWER(name), TRIM(email), SUBSTRING(url, 0, 50);

-- Math
FOREACH sales GENERATE ROUND(amount), ABS(delta), MAX(values);

-- Date/time (requires Joda format)
FOREACH logs GENERATE ToDate(ts, 'yyyy-MM-dd HH:mm:ss') AS dt;
FOREACH logs GENERATE GetYear(dt), GetMonth(dt);

-- Bag/Tuple
FOREACH grouped GENERATE COUNT(items), SUM(items.amount), AVG(items.score);

pig.properties Key Settings

# Default execution mode
exectype=mapreduce

# Log level
pig.logfile=/var/log/pig/pig.log

# Parallelism hint (number of reducers)
default_parallel=50

# Enable cost-based optimizer
pig.exec.nocombiner=false

When to Use Pig Today

Pig adoption has declined as Spark and Hive-on-Tez became dominant. However, Pig remains useful for:

  • Legacy ETL pipelines already written in Pig Latin
  • Teams comfortable with procedural data flow style
  • Quick transformations without setting up a Hive Metastore

For new projects, consider Apache Spark (Python/Scala API) or Hive on Tez as modern alternatives.