2.6.2. Configure & Write Job

Once logged in, you need to configure your job's basic settings. Use any job name and select Python 3 for the source language. There is no need to set the number of map or reduce tasks in this example, so leave these fields unchanged.

Figure 2.2. Job Configuration

Next supply input to the job. A small amount of text will suffice for this example. This has the benefit of allowing us to check whether we received the right results at a glance. This same example could, however, be run on a much larger dataset (try it out when you are finished!). Use something simple, preferably with repeated words, like the text shown in the screenshot in Figure 2.3, “Job Input”:

Figure 2.3. Job Input

Now define a mapper. The following mapper does exactly what is described in Section 1.2.3, “Example: WordCount” in the introduction to map-reduce: it takes an entire line of input as its key (ignoring the empty value), and outputs every word in the line along with a 1 as the value:

def mapper(key, val):
  words = key.split()
  for word in words:
    Wmr.emit(word, '1')

Notice the following parts of the code:

The function is named mapper(). This name is required so that WebMapReduce knows what function to call to feed input.^[3] You are free to write other auxillary functions to support your mapper.
The function takes two arguments named key and value. This is the input key-value pair described in the introduction to map-reduce. Both arguments will be strings.
In this case, since the input we gave was simple text with no tabs to separate keys from values, the entire line will be contained in the key, and the value will be empty.^[4]
The Wmr.emit() function is how key-value pairs are output in the Python library for WebMapReduce. It takes two arguments: again, a key and a value.^[5]
Even though the value we output is a number, we have put it inside a string. This is to highlight that WebMapReduce always uses strings for keys and values.
Technically, we could have just used the number 1. The Python library for WebMapReduce will automatically convert any type of arugment to a string using Python's str() function. When it is fed to the reducer, though, it will not automatically be converted back.^[4]

If the mapper receives the line "one fish" as input, its output should be:

one    1
fish   1

Figure 2.4. Mapper Source Code

Finally, we will write a reducer. This reducer adds all the 1's associated with each word to get a final count:

def reducer(key, values):
  sum = 0
  for v in values:
    sum = sum + int(v)
  Wmr.emit(key, str(sum))

Notice the following:

The function is named reducer(). As with mapper(), this is required.
This time, the second argument to reducer() is actually a list of values. This is the list of all values output from the map phase whose key was equal to key. As described in the introduction to map-reduce, the framework collects these values and generates the list automatically.
In this case, values is a list of "1"s with as many elements as there were occurrences of key in the input.
Before being added to sum, each value is converted from a string to an integer using Python's int() function.
Before being output with Wmr.emit(), the sum is converted from an integer to a string using Python's str() function. As with the mapper, this is not required, but is included to make the conversion explicit.

Figure 2.5. Reducer Source Code

^[3] All languages require certain names for mappers and reducers, although the exact name may differ. The chapter on each language will give the specific requirement.

^[4] Although this is generally true in WebMapReduce, some languages may behave differently. These differences will be noted in the chapter for each language.

^[5] Other languages may have similar functions, or they may use the return value of the function as output. See the chapter on each language.