Saturday, April 27, 2019

MapReduce Process


It's Easy to Learn MapReduce process

 

In this article, I have tried to cover MapReduce process by explaining Map and Reduce cycle. MapReduce is one of the core components of Hadoop, which is made for the processing of huge amount of data in parallel fashion on commodity machines. Map makes the traditional inquire task, disassembling task and data analysis task into distributed processing, handling and allocating task to different nodes.

 

The MapReduce algorithm parallelizes the performance of a number of problems. The Reduce combines different information coming from the Map, computing the result sets and achieving the reduced answer. MapReduce is a programming model of Hadoop system and used to process intensive data within less time. 

 

 Process flow of MapReduce 

 

Figure shows each one of the map and reduce phase has key-value pairs as input and output. The shuffle phase shuffles the outputs of map phase to the input of the reduce phase evenly using the MapReduce library. The map phase runs a user defined mapper function on a set of key-value pairs [kj, vj] taken as input, and generates a set of intermediate key-value pairs. Each major (reduce) step reduces the number of data objects (key-value pairs) by an order of magnitude or more. 

 

The “org.apache.hadoop.mapreduce.InputFormat” is responsible for dividing given input data block into multiple input splits. The Hadoop application execution time is greatly affected by the shuffling phase where amount of data is transferred from map tasks to reduce tasks. 

 

The “org.apache.hadoop.mapreduce.recordreader” is responsible for dividing split into (key, value) phase to be accepted by mapper phase. Below are steps that happen as part of the MapReduce cycle.

 

Step 1— Input request should in the form of .JAR file which contains Driver code, Mapper code and Reducer code. 

 

Step 2— Job Tracker assigns the mapper tasks by tracking the business logic from the .JAR file on the all the available task trackers. 

Step 3 — Once all the task trackers are done with mapper processes, they send the same status back to Job Tracker. 

Step 4 — All the task trackers do with mapper phase, then job tracker initiates sort and shuffle phase on all the mapper outputs.  

Step 5 — When the sort and shuffle is done, job tracker initiates reducer phase on all available task trackers. 


Step 6 — Once all task trackers do with reducer phase, they update the same status back to the job tracker. 


Mapper and Reducer are user driven phases. Mapper class output filename is “part-m-00000” and Reducer class output filename is “part-r-00000”. Job Tracker and Task Tracker are the two daemons which are entirely responsible for MapReduce processing.

 

Kindly follow my blog and stay tuned for more advanced post on regression techniques. 

Thank you!




Reference 
Analysis of Research Data using MapReduce Word Count Algorithm. International Journal of Advanced Research in Computer and Communication Engineering.

1 comment:

  1. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon. Big thanks for the useful info.
    Curso Ciencia de Datos

    ReplyDelete

What is Support Vector Machine?

  What is Support Vector Machine? Support Vector Machine (SVM) is a supervised machine learning algorithm that is widely used in classificat...