Product SiteDocumentation Site

1.2.3. Example: WordCount

A very often-used example to demonstrate map-reduce is WordCount, a simple program that counts the frequency of words in a text. (It is so common, in fact, that it deserves to be called the "Hello, World!" of map-reduce.) There are of course many other ways to apply map-reduce. This example is only a basic introduction.
The input to a WordCount map-reduce application typically consists of plain lines of text. There are many different possibilities for translating these simple lines into key-value pairs. One option is to give the whole line as a key and leave the value empty.[1] Since that is what is typically done in WebMapReduce, we will illustrate it this way. So, given two lines of text like the following:
one fish two fish
red fish blue fish
The mapper will receive these two key-value pairs:

<"one fish two fish", "">
<"red fish blue fish", "">

Note

The <key, value> format is just a convention denoting an abstract key-value pair. It does not mean to suggest that your mapper will receive a string that looks like this. In WebMapReduce, the exact representation of a key-value pair depends on your language. They are typically just two separate arguments to a function.
What should the mapper do with these? Since the goal is to produce a list of words and their associated counts, a good strategy is to split the line into its constituent words, outputting each word as a key (remember that a single map can output multiple pairs!), along with a count as the value. These lines would then become:

<"one fish two fish", "">  =>  <"one", 1>
                               <"fish", 1>
                               <"two", 1>
                               <"fish", 1>
<"red fish blue fish", "">  =>  <"red", 1>
                                <"fish", 1>
                                <"blue", 1>
                                <"fish", 1>

We will not give code for such a mapper here, but as an exercise, think of how you would write a function to split a line like this in your favorite programming language. It shouldn't take much more than a few moments of thought and a dash of code. In Section 2.6, “Example Job”, we will show how easy it is to use WebMapReduce to write a WordCount application in Python. Hold your breath!
Now to the reducer. The obvious task that remains is to add together all the individual counts for every unique word produced in the map phase. The map-reduce framework makes this task easy by grouping the counts by word for us. Each reduce task should receive a key and a list of the values associated with it. For the lines above, the pairs would be:

<"blue", ["1"]>
<"fish", ["1", "1", "1", "1"]>
<"one",  ["1"]>
<"red",  ["1"]>
<"two",  ["1"]>

Note

A small aside: notice that the previously-numeric 1's that were produced in the map phase have become strings before being input to the reducer. In WebMapReduce, this may or may not happen, depending on your language. It is simplification that WMR typically makes in order to make configuration easier for the user.
For each key and list of values, the reducer should simply iterate through the list and produce a final count, which it should output with the key. The final key-value pairs should then be:

<"blue", 1>
<"fish", 4>
<"one", 1>
<"red", 1>
<"two", 1>

Finally, we have our result! The only remaining question is, just how will these key-value pairs be represented in the output file? In WebMapReduce, the answer is that all files—both input and output—are text files where each line is a key-value pair, and keys are separated from values by a tab character. The output file in this example will be:
blue    1
fish    4
one     1
red     1
two     1


[1] Other possibilites are to give the line as a value and leave the key empty, or to give another property, like the byte offset of the line in the file, as the key. That is how Hadoop, the framework that underlies WebMapReduce, typically gives lines.