1. Introduction

1.1. What is WebMapReduce?

WebMapReduce is a simple web-based user interface for creating and submitting Hadoop Map-Reduce jobs in practically any language. It is ideally suited for use in the introductory computer science classroom, requiring very little programming experience to write massively parallel programs.

Some of its features include:

  • Simplified Map-Reduce model: WebMapReduce offers the features of Map-Reduce that are crucial to the core concept, without details that add to the learning curve.
  • Extensible language support: Mappers and reducers can be written in practically any language. All that is needed to support a new language is a simple wrapper library, possibly with an API for users to easily perform common tasks such as string processing.
  • Adaptable APIs: In addition, preincluded languages can be customized to alter their APIs—for example, in order to support libraries already introduced in a class, or to suit a particular teaching style or programming paradigm. Purely functional, imperative, or object-oriented strategies are all fair game.

1.2. What is Map-Reduce?

Seen from a high level, Map-Reduce is really two things:

  1. A strategy or model for writing programs that can easily be made to process data in parallel.
  2. A framework that runs these programs in parallel, automatically handling the details of division of labor, distribution, synchronization, and fault-tolerance.

The model and the framework work together to make programs that are scalable, distributed, and fault-tolerant.

1.2.1. The Model

In the map-reduce programming model, work is divided into two phases: a map phase and a reduce phase. Both of these phases work on key-value pairs. What these pairs contain is completely up to you: they could be URLs paired with counts of how many pages link to them, or movie IDs paired with ratings. It all depends on how you write and set up your map-reduce job.

A map-reduce program typically acts something like this:

  1. Input data, such as a long text file, is split into key-value pairs. These key-value pairs are then fed to your mapper. (This is the job of the map-reduce framework.)
  2. Your mapper processes each key-value pair individually and outputs one or more intermediate key-value pairs.
  3. All intermediate key-value pairs are collected, sorted, and grouped by key (again, the responsibility of the framework).
  4. For each unique key, your reducer receives the key with a list of all the values associated with it. The reducer aggregates these values in some way (adding them up, taking averages, finding the maximum, etc.) and outputs one or more output key-value pairs.
  5. Output pairs are collected and stored in an output file (by the framework).
../_images/map-reduce.png

Visualization of Map-Reduce

What makes this model so good for parallel programming should be apparent from the figure above: each key-value pair can be mapped or reduced independently. This means that many different processors, or even machines, can each take a section of the data and process it separately—a classic example of data parallelism. The only real step where synchronization is needed is during the collecting and sorting phase, which can be handled by the framework (and, when done carefully, even this can be parallelized).

So, when you can fit a problem into this model, it can make parallelization very easy. What may seem less obvious is how a problem can be solved with this model in the first place. For this, a concrete example is the best place to start, which we will do in Example: WordCount. First, though, we will talk about the role of map-reduce frameworks.

1.2.2. The Framework

While it is possible to perform every step of the map-reduce model yourself, one of the other benefits of the model is that certain steps can easily be automated. This is where a map-reduce framework comes in. As explained above, the splitting, sorting, and storage steps are usually handled by this framework. Other jobs that a framework typically performs include:

  • Spawning mappers and reducers. These could be in separate threads, separate processes, or even separate machines. The framework, perhaps guided by a job’s configuration, can decide how many processes are necessary.
  • Distributing data to the mappers and reducers. This especially important when running on many different machines.
  • Recovering from errors. If a map or reduce task fails, the framework can start another identical task (perhaps on a different machine this time) to make up the lost work. This fault-tolerance is an important feature of map-reduce.

1.2.3. Example: WordCount

A very often-used example to demonstrate map-reduce is WordCount, a simple program that counts the frequency of words in a text. (It is so common, in fact, that it deserves to be called the “Hello, World!” of map-reduce.) There are of course many other ways to apply map-reduce. This example is only a basic introduction.

The input to a WordCount map-reduce application typically consists of plain lines of text. There are many different possibilities for translating these simple lines into key-value pairs. One option is to give the whole line as a key and leave the value empty. [1] Since that is what is typically done in WebMapReduce, we will illustrate it this way. So, given two lines of text like the following:

one fish two fish
red fish blue fish

The mapper will receive these two key-value pairs:

<"one fish two fish", "">
<"red fish blue fish", "">

Note

The <key, value> format is just a convention denoting an abstract key-value pair. It does not mean to suggest that your mapper will receive a string that looks like this. In WebMapReduce, the exact representation of a key-value pair depends on your language. They are typically just two separate arguments to a function.

What should the mapper do with these? Since the goal is to produce a list of words and their associated counts, a good strategy is to split the line into its constituent words, outputting each word as a key (remember that a single map can output multiple pairs!), along with a count as the value. These lines would then become:

<"one fish two fish", "">  =>  <"one", 1>
                               <"fish", 1>
                               <"two", 1>
                               <"fish", 1>
<"red fish blue fish", "">  =>  <"red", 1>
                                <"fish", 1>
                                <"blue", 1>
                                <"fish", 1>

A note on counts

You might wonder why the mapper output two ‘fishes’ with a count of 1 rather than one ‘fish’ with a count of 2. We certainly could have done it the latter way. However, that would have meant keeping track—with some sort of associative array—of all the words encountered in a given line.

Besides making the process slightly more complicated, it could add to overhead. In this trivial example, that overhead certainly wouldn’t amount to much, but when we program map-reduce jobs, we always try to keep scalability in mind. What would happen if a whole document was written on one line?

The choice is ultimately up to you. Rest assured, though, that either way, our reducer will make sure all the counts are added correctly in the end.

We will not give code for such a mapper here, but as an exercise, think of how you would write a function to split a line like this in your favorite programming language. It shouldn’t take much more than a few moments of thought and a dash of code. In sect-User_Guide-The_Web_Interface-Example_Job, we will show how easy it is to use WebMapReduce to write a WordCount application in Python. Hold your breath!

Now to the reducer. The obvious task that remains is to add together all the individual counts for every unique word produced in the map phase. The map-reduce framework makes this task easy by grouping the counts by word for us. Each reduce task should receive a key and a list of the values associated with it. For the lines above, the pairs would be:

<"blue", ["1"]>
<"fish", ["1", "1", "1", "1"]>
<"one",  ["1"]>
<"red",  ["1"]>
<"two",  ["1"]>

Note

A small aside: notice that the previously-numeric 1‘s that were produced in the map phase have become strings before being input to the reducer. In WebMapReduce, this may or may not happen, depending on your language. It is simplification that WMR typically makes in order to make configuration easier for the user.

For each key and list of values, the reducer should simply iterate through the list and produce a final count, which it should output with the key. The final key-value pairs should then be:

<"blue", 1>
<"fish", 4>
<"one", 1>
<"red", 1>
<"two", 1>

Finally, we have our result! The only remaining question is, just how will these key-value pairs be represented in the output file? In WebMapReduce, the answer is that all files—both input and output—are text files where each line is a key-value pair, and keys are separated from values by a tab character. The output file in this example will be:

blue    1
fish    4
one     1
red     1
two     1

Footnotes

[1]Other possibilites are to give the line as a value and leave the key empty, or to give another property, like the byte offset of the line in the file, as the key. That is how Hadoop, the framework that underlies WebMapReduce, typically gives lines.

1.3. Hadoop and WebMapReduce

There have been many different implementations of map-reduce frameworks. The most well-known is Google’s MapReduce, which was described in a seminal paper [2] that popularized the idea of map-reduce on clusters. However, Google MapReduce is closed-source and not publicly available.

In an effort to provide a map-reduce framework that can be used by anyone, the Apache Foundation started the Hadoop project. Owing to its open-source nature, Hadoop has become easily the most widely used implementation of map-reduce. It is used at Yahoo, Facebook, IBM, and Amazon.com (including Amazon Elastic Map-Reduce, a service dedicated to providing Hadoop), among many others. WebMapReduce is built on top of Hadoop.

Footnotes

[2]Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google, Inc., 2004. http://labs.google.com/papers/mapreduce-osdi04.pdf.

1.3.1. Hadoop Simplified

Unfortunately for the casual user, Hadoop’s powerful features can create a bit of a learning curve. Hadoop is written in Java, so map-reduce jobs native to Hadoop must be written in Java as well. Programmers not only need to be familiar with Java program structure, compliation, and dependencies, they also need to be able to use Hadoop’s command-line interface to submit and manage jobs and view results.

A few projects exist to make Hadoop easier to use:

  • Hadoop Streaming allows map-reduce programs be written in any language. However, with Streaming, the key-value pairs for mapper input and output are streamed through standard input and output in text format—a good generic interface, but one that is not quite idomatic in many languages.
  • Hue, or the Hadoop User Interface (formerly known as Cloudera Desktop), provides a rich web interface for managing Hadoop clusters and map-reduce jobs. It is a very promising project that can make life easier for administrators and users. Nearly all of the powerful features of Hadoop are available through this interface. Nonetheless, it still requires users to manage compliation and dependencies for Java map-reduce applications.

WebMapReduce aims to combine the best parts of these two projects into an environment that is suitable for novice users. It is a web interface (like Hue) where jobs can be written in any language (like Streaming), but with the following additions over both:

  • Mappers and reducers can be written idomatically in whatever language they use. Users write functions or classes that are called automatically by the framework, rather than interfacing with standard input and output and parsing data manually.

  • In compiled languages like Java and C++, compliation and dependency management is handled automatically. Users just sumbit the source code for their mappers and reducers. Everything else is up to WebMapReduce.

  • Some features of Hadoop are omitted or put in the background so that configuration is not as difficult. Examples:

    • Only one type of input/output format is supported: plain text with keys and values separated by tabs.
    • Keys and values are always given as strings (though some languages may have exceptions).
    • Combiners are not available.

    In the future, we may support some of these features for more advanced users. However, we believe that a basic introduction to map-reduce doesn’t need these details.

Since part of simplification means a smaller feature set, users who are already familiar with map-reduce will most likely prefer tools like Hue. WebMapReduce is intended to demonstrate the core concepts of map-reduce and is particularly geared toward classroom use. We think that it serves this purpose well.