MapReduce in Hadoop

Map Reduce :

MR is a core processing component of Hadoop which is meant for processing of huge data in a parallel on commodity hardware machines. It is an algorithm contains two important tasks, that is Map and Reduce,

Map: Takes a set of data and converts it into another set of data, where individual elements are broken into tuples are like key and values pairs.

Reduce: reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples.

Map Reduce Life Cycle:

A Map-Reduce job usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then inputted to the reduce tasks. Both the input and output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

The Map/Reduce framework operates on < Key, Value> pairs that are the framework views the input to the job as a set of < key,value> pairs and produces a set of < key, value> pairs as the output of the job.

MapReduce Programming Model:

  • Split the data into independent chunks based on key,value pair. This is done by Map task in a parallel manner.
  • Output of the Map jobs is sorted based on the key values
  • The sorted output is the input to the Reduce job. And then it produces the final output to the processing and returns to the client.