Product SiteDocumentation Site

1.2. What is Map-Reduce?

Seen from a high level, Map-Reduce is really two things:
  1. A strategy or model for writing programs that can easily be made to process data in parallel.
  2. A framework that runs these programs in parallel, automatically handling the details of division of labor, distribution, synchronization, and fault-tolerance.
The model and the framework work together to make programs that are scalable, distributed, and fault-tolerant.

1.2.1. The Model

In the map-reduce programming model, work is divided into two phases: a map phase and a reduce phase. Both of these phases work on key-value pairs. What these pairs contain is completely up to you: they could be URLs paired with counts of how many pages link to them, or movie IDs paired with ratings. It all depends on how you write and set up your map-reduce job.
A map-reduce program typically acts something like this:
  1. Input data, such as a long text file, is split into key-value pairs. These key-value pairs are then fed to your mapper. (This is the job of the map-reduce framework.)
  2. Your mapper processes each key-value pair individually and outputs one or more intermediate key-value pairs.
  3. All intermediate key-value pairs are collected, sorted, and grouped by key (again, the responsibility of the framework).
  4. For each unique key, your reducer receives the key with a list of all the values associated with it. The reducer aggregates these values in some way (adding them up, taking averages, finding the maximum, etc.) and outputs one or more output key-value pairs.
  5. Output pairs are collected and stored in an output file (by the framework).
Visualization of Map-Reduce
Figure 1.1. Visualization of Map-Reduce

What makes this model so good for parallel programming should be apparent from the figure above: each key-value pair can be mapped or reduced independently. This means that many different processors, or even machines, can each take a section of the data and process it separately—a classic example of data parallelism. The only real step where synchronization is needed is during the collecting and sorting phase, which can be handled by the framework (and, when done carefully, even this can be parallelized).
So, when you can fit a problem into this model, it can make parallelization very easy. What may seem less obvious is how a problem can be solved with this model in the first place. For this, a concrete example is the best place to start, which we will do in Section 1.2.3, “Example: WordCount”. First, though, we will talk about the role of map-reduce frameworks.