An Introduction to Pig

Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin.

Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high-level. It is similar to SQL.

Pig runs as an execution engine on top of Hadoop. It enables us to write complex data transformations without knowing Java. It simply renders Pig Latin into Java MapReduce programs and runs them on the Hadoop cluster.

Pig removes the need for users to tune Hadoop to their needs, and it insulates users from changes in Hadoop's interfaces.

Some Noteworthy Characteristics of Pig

  • Schema on Read: Languages like SQL use schema on write. This means that the schema has to be defined before writing to the database. Due to this, only structured data is allowed. On the other hand, schema on read means that the schema is defined on the fly while reading the data. This allows Pig to work with both structured and unstructured data.

  • Lazy Evaluation: This means that no value or operation is evaluated until the the value or the transformed data is required. This reduces repeated calculations and allows for optimized performance.

The next section gives an overview of the components of Pig.

Last updated