Common Examples of MapReduce Jobs

WordCount

WordCount is the most common example of MapReduce.

Goal: Count the number of occurrences of each word in a file

Input to Mappers: Chunks of the input file

Mapper Output: (word, 1) pairs

Input to Reducers: (word, <1, 1, 1, ..., 1>) pairs

Operation at Reducer: Sum up the values for each word to get the count and output (word, count) pairs

Output of MapReduce Job: File containing (word, count) pairs

Filtering

Goal: Filter out records that are of no interest

Input to Mappers: Chunks of the input file

Mapper Output: Filtered records

Input to Reducer: Filtered records grouped by key

Output of MapReduce Job: Filtered records

We perform filtering at the mappers itself because the sort/shuffle phase of MapReduce is I/O heavy, and we want to reduce the dataset as much as possible in the map phase itself.

Basically, instead of emitting all (key, value) pairs, the mappers will emit only those pairs that meet a certain condition:

Map(k,v) {
if (f(k,v))  // check condition (filter)
{    
// do something
emit()
}
}

Top k

Goal: Select the top k records given any number of input records

Input to Mappers: Chunks of the input file (several records)

Mapper Output: Local top k records (each mapper will output top k records depending on its input records)

Input to Reducer: Top k records based on input records from mappers

Output of MapReduce Job: Top k records from the entire dataset

So, instead of analysing all n records at once, the reducer only has to process k*m records (where m is the number of mappers)

Binning

Goal: Move records into bins/categories, irrespective of the order of records

But partitioners do the exact same task! Why use binning? We use binning because the task of partitioners is I/O intensive.

In binning, the mappers receive chunks of the input file. The mappers itself categorize records into bins. There is no need for a reducer. We simply concatenate the intermediate results of the mappers to get the records in each bin.

Other Applications

MapReduce can also be used for Breadth First Search (BFS) and PageRank (Google's algorithm to rank websites in their search engine results). I have not covered their implementations in detail.

The next section lists the advantages and disadvantages of MapReduce programming.

Last updated