Components of Pig

There are 3 main components of Pig:

  1. Pig Latin

  2. Execution

  3. Compiler

1. Pig Latin

Pig Latin is Pig's SQL-like high-level data flow language.

A Pig Latin program can be viewed as a directed acyclic graph where each node represents an operation.

Data Types in Pig Latin

The following table describes the data types used in Pig Latin:

Pig Latin statements work with relations. A relation can be defined as follows:

  • A relation is a bag (more specifically, an outer bag)

  • A bag is a collection of tuples

  • A tuple is an ordered set of fields

  • A field is a piece of data

Quick Notes on Pig Latin Data Types:

  • A single tuple can hold multiple types of data

  • We can nest bags inside tuples, tuples inside bags, tuples inside tuples and bags inside bags

  • In a map, keys and values can be of any data type

Operations in Pig Latin

  • Arithmetic: +, -, *, <, %, etc..., FLATTEN

  • Relational: LOAD, GROUP, FOREACH, JOIN, …

  • Diagnostic: DESCRIBE, DUMP, EXPLAIN, ILLUSTRATE

  • Eval: AVG, TOP, CONCAT, COUNT, …

  • Load/Store: TextLoader, PigStorage, …

  • System: cat, cd, ls, exec, …

  • UDF: User Defined Functions

Note: The largest use case of Pig is data pipelines. A common example is web companies bringing in logs from their web servers, cleansing the data, and precomputing common aggregates before loading it into their data warehouse.

The following are some commonly used operations in Pig Latin:

LOAD

Loads data from the file system.

Syntax: LOAD 'data' [USING function] [AS schema];

FOREACH

Generates data transformations based on columns of data.

Syntax: alias = FOREACH { block | nested_block };

FILTER

Selects tuples from a relation based on some condition.

Syntax: alias = FILTER alias BY expression;

GROUP

Groups the data in one or more relations.

Syntax: alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];

Note: The GROUP and COGROUP operators are identical. Both operators work with one or more relations. For readability, GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. We can COGROUP up to but no more than 127 relations at a time.

LIMIT

Limits the number of output tuples.

Syntax: alias = LIMIT alias n;

ORDER BY

Sorts a relation based on one or more fields.

Syntax: alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];

where * denotes a tuple and filed_alias is a field in the relation "alias".

DUMP

Dumps or displays results to screen.

Syntax: DUMP alias;

STORE

Stores or saves results to the file system.

Syntax: STORE alias INTO 'directory' [USING function];

FLATTEN

The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure:

  • For tuples, FLATTEN substitutes the fields of a tuple in place of the tuple. For example, consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c).

  • For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

2. Execution

We can run Pig in various modes:

  • Interactive Mode

  • Batch Mode

  • Programmatically

But there are basically two execution modes:

  • Local Mode: To run Pig in Local mode, we need access to a single machine; all files are installed and run using our local host and file system. We must specify local mode using the -x flag (i.e. pig -x local)

  • MapReduce Mode: To run Pig in MapReduce mode, we need access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode; we can, but don't need to, specify it using the -x flag (i.e. pig OR pig -x mapreduce)

All the three modes listed above support both Local and MapReduce execution modes:

Interactive Mode

We can run Pig in Interactive mode using the Grunt shell.

We can invoke the Grunt shell using the "pig" command (as shown below) and then enter our Pig Latin statements and Pig commands interactively at the command line.

[root@quickstart /]# pig
...
grunt>

We can also use "pig -x local" OR "pig -x mapreduce" to invoke the Grunt shell in the respective modes.

Batch Mode

We can run Pig in Batch mode using Pig scripts (.pig files) and the "pig" command (in Local or MapReduce mode).

Pig scripts are basically Pig Latin statements and Pig commands in a single .pig file.

Note: Comments in Pig scripts are written using -- (for single line comments) and /* ... */ (for multi-line comments)

We can run pig scripts using:

pig pig_script_name.pig

OR

pig -x local pig_script_name.pig

OR

pig -x mapreduce pig_script_name.pig

The third method is equivalent to the first one, since MapReduce mode is default.

Programmatically

We can also run Pig programmatically using Java classes PigRunner (Grunt), PigServer (MapReduce).

Note: Pig allows us to define our own functions (User Defined Functions or UDFs) and use them in our Pig scripts. This is referred to as the Embedded Mode.

3. Compiler

As discussed earlier, Pig will compile Pig Latin into Java for the MapReduce job.

Last updated