Apache Pig

What Is Pig?

Apache Pig is a high-level data flow scripting platform for Hadoop. Instead of writing MapReduce jobs in Java, you write scripts in Pig Latin — a dataflow language where you describe transformations step by step. Pig compiles scripts into sequences of MapReduce jobs automatically.

Pig is particularly well-suited for:

ETL pipelines — filtering, transforming, joining large datasets
Exploratory data analysis — quick iteration without compiling Java
Complex multi-step transformations that would require many MapReduce jobs

Pig Latin Script
      │
      ▼
  Pig Compiler
      │
      ├── Logical Plan (optimized)
      │
      ▼
  MapReduce Jobs → HDFS

Pig vs Hive

	Pig	Hive
Language	Pig Latin (procedural)	HiveQL (declarative SQL)
Best for	Data pipelines, ETL	Ad-hoc queries, reporting
Schema	Optional (schema-on-read)	Required at table creation
Users	Engineers/developers	Analysts familiar with SQL
Output	Any HDFS path	Tables in Metastore

Running Pig

# Interactive Grunt shell (local mode — no Hadoop needed)
pig -x local

# Interactive Grunt shell (MapReduce mode)
pig

# Run a script in MapReduce mode
pig -f my_script.pig

# Run a script in local mode
pig -x local -f my_script.pig

Pig Latin Basics

Load and inspect data

-- Load tab-separated web log data
logs = LOAD '/data/logs/web/' USING PigStorage('\t')
       AS (ip:chararray, ts:chararray, method:chararray,
           url:chararray, status:int, bytes:long);

-- Show schema
DESCRIBE logs;

-- Preview first 10 records
raw = LIMIT logs 10;
DUMP raw;

Filter, transform, group

-- Filter only HTTP 200 responses
ok_logs = FILTER logs BY status == 200;

-- Extract just the URL and bytes
urls = FOREACH ok_logs GENERATE url, bytes;

-- Group by URL and compute total bytes
grouped = GROUP urls BY url;
totals  = FOREACH grouped GENERATE
            group            AS url,
            SUM(urls.bytes)  AS total_bytes,
            COUNT(urls)      AS hit_count;

-- Sort by total bytes descending
sorted = ORDER totals BY total_bytes DESC;

-- Store result back to HDFS
STORE sorted INTO '/output/url-totals' USING PigStorage(',');

Join two datasets

orders    = LOAD '/data/orders/'    USING PigStorage(',')
            AS (order_id:long, customer_id:long, amount:double, dt:chararray);

customers = LOAD '/data/customers/' USING PigStorage(',')
            AS (customer_id:long, name:chararray, region:chararray);

-- Inner join
joined = JOIN orders BY customer_id, customers BY customer_id;

result = FOREACH joined GENERATE
           orders::order_id    AS order_id,
           customers::name     AS customer_name,
           orders::amount      AS amount,
           orders::dt          AS date;

STORE result INTO '/output/orders-with-names';

Flatten bags and tuples

-- Group all orders per customer
by_customer = GROUP orders BY customer_id;

-- FLATTEN expands bags into individual rows
flat = FOREACH by_customer {
    sorted_orders = ORDER orders BY amount DESC;
    top3 = LIMIT sorted_orders 3;
    GENERATE group AS customer_id, FLATTEN(top3);
};
DUMP flat;

User-Defined Functions (UDFs)

When built-in functions aren't enough, write a Java UDF:

// src/main/java/com/example/pig/UpperCase.java
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UpperCase extends EvalFunc<String> {
    @Override
    public String exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0) return null;
        return ((String) input.get(0)).toUpperCase();
    }
}

REGISTER /opt/udfs/myudfs.jar;
DEFINE UpperCase com.example.pig.UpperCase();

data = LOAD '/data/names' AS (name:chararray);
upper = FOREACH data GENERATE UpperCase(name);
DUMP upper;

Useful Built-in Functions

-- String functions
FOREACH data GENERATE LOWER(name), TRIM(email), SUBSTRING(url, 0, 50);

-- Math
FOREACH sales GENERATE ROUND(amount), ABS(delta), MAX(values);

-- Date/time (requires Joda format)
FOREACH logs GENERATE ToDate(ts, 'yyyy-MM-dd HH:mm:ss') AS dt;
FOREACH logs GENERATE GetYear(dt), GetMonth(dt);

-- Bag/Tuple
FOREACH grouped GENERATE COUNT(items), SUM(items.amount), AVG(items.score);

pig.properties Key Settings

# Default execution mode
exectype=mapreduce

# Log level
pig.logfile=/var/log/pig/pig.log

# Parallelism hint (number of reducers)
default_parallel=50

# Enable cost-based optimizer
pig.exec.nocombiner=false

When to Use Pig Today

Pig adoption has declined as Spark and Hive-on-Tez became dominant. However, Pig remains useful for:

Legacy ETL pipelines already written in Pig Latin
Teams comfortable with procedural data flow style
Quick transformations without setting up a Hive Metastore

For new projects, consider Apache Spark (Python/Scala API) or Hive on Tez as modern alternatives.

What Is Pig?​

Pig vs Hive​

Running Pig​

Pig Latin Basics​

Load and inspect data​

Filter, transform, group​

Join two datasets​

Flatten bags and tuples​

User-Defined Functions (UDFs)​

Useful Built-in Functions​

pig.properties Key Settings​

When to Use Pig Today​