WebMapReduce is a simple web-based user interface for creating and submitting Hadoop Map-Reduce jobs in practically any language. It is ideally suited for use in the introductory computer science classroom, requiring very little programming experience to write massively parallel programs.
Some of its features include:
Seen from a high level, Map-Reduce is really two things:
The model and the framework work together to make programs that are scalable, distributed, and fault-tolerant.
In the map-reduce programming model, work is divided into two phases: a map phase and a reduce phase. Both of these phases work on key-value pairs. What these pairs contain is completely up to you: they could be URLs paired with counts of how many pages link to them, or movie IDs paired with ratings. It all depends on how you write and set up your map-reduce job.
A map-reduce program typically acts something like this:
What makes this model so good for parallel programming should be apparent from the figure above: each key-value pair can be mapped or reduced independently. This means that many different processors, or even machines, can each take a section of the data and process it separately—a classic example of data parallelism. The only real step where synchronization is needed is during the collecting and sorting phase, which can be handled by the framework (and, when done carefully, even this can be parallelized).
So, when you can fit a problem into this model, it can make parallelization very easy. What may seem less obvious is how a problem can be solved with this model in the first place. For this, a concrete example is the best place to start, which we will do in Example: WordCount. First, though, we will talk about the role of map-reduce frameworks.
While it is possible to perform every step of the map-reduce model yourself, one of the other benefits of the model is that certain steps can easily be automated. This is where a map-reduce framework comes in. As explained above, the splitting, sorting, and storage steps are usually handled by this framework. Other jobs that a framework typically performs include:
A very often-used example to demonstrate map-reduce is WordCount, a simple program that counts the frequency of words in a text. (It is so common, in fact, that it deserves to be called the “Hello, World!” of map-reduce.) There are of course many other ways to apply map-reduce. This example is only a basic introduction.
The input to a WordCount map-reduce application typically consists of plain lines of text. There are many different possibilities for translating these simple lines into key-value pairs. One option is to give the whole line as a key and leave the value empty. [1] Since that is what is typically done in WebMapReduce, we will illustrate it this way. So, given two lines of text like the following:
one fish two fish
red fish blue fish
The mapper will receive these two key-value pairs:
<"one fish two fish", "">
<"red fish blue fish", "">
Note
The <key, value> format is just a convention denoting an abstract key-value pair. It does not mean to suggest that your mapper will receive a string that looks like this. In WebMapReduce, the exact representation of a key-value pair depends on your language. They are typically just two separate arguments to a function.
What should the mapper do with these? Since the goal is to produce a list of words and their associated counts, a good strategy is to split the line into its constituent words, outputting each word as a key (remember that a single map can output multiple pairs!), along with a count as the value. These lines would then become:
<"one fish two fish", ""> => <"one", 1>
<"fish", 1>
<"two", 1>
<"fish", 1>
<"red fish blue fish", ""> => <"red", 1>
<"fish", 1>
<"blue", 1>
<"fish", 1>
A note on counts
You might wonder why the mapper output two ‘fishes’ with a count of 1 rather than one ‘fish’ with a count of 2. We certainly could have done it the latter way. However, that would have meant keeping track—with some sort of associative array—of all the words encountered in a given line.
Besides making the process slightly more complicated, it could add to overhead. In this trivial example, that overhead certainly wouldn’t amount to much, but when we program map-reduce jobs, we always try to keep scalability in mind. What would happen if a whole document was written on one line?
The choice is ultimately up to you. Rest assured, though, that either way, our reducer will make sure all the counts are added correctly in the end.
We will not give code for such a mapper here, but as an exercise, think of how you would write a function to split a line like this in your favorite programming language. It shouldn’t take much more than a few moments of thought and a dash of code. In sect-User_Guide-The_Web_Interface-Example_Job, we will show how easy it is to use WebMapReduce to write a WordCount application in Python. Hold your breath!
Now to the reducer. The obvious task that remains is to add together all the individual counts for every unique word produced in the map phase. The map-reduce framework makes this task easy by grouping the counts by word for us. Each reduce task should receive a key and a list of the values associated with it. For the lines above, the pairs would be:
<"blue", ["1"]>
<"fish", ["1", "1", "1", "1"]>
<"one", ["1"]>
<"red", ["1"]>
<"two", ["1"]>
Note
A small aside: notice that the previously-numeric 1‘s that were produced in the map phase have become strings before being input to the reducer. In WebMapReduce, this may or may not happen, depending on your language. It is simplification that WMR typically makes in order to make configuration easier for the user.
For each key and list of values, the reducer should simply iterate through the list and produce a final count, which it should output with the key. The final key-value pairs should then be:
<"blue", 1>
<"fish", 4>
<"one", 1>
<"red", 1>
<"two", 1>
Finally, we have our result! The only remaining question is, just how will these key-value pairs be represented in the output file? In WebMapReduce, the answer is that all files—both input and output—are text files where each line is a key-value pair, and keys are separated from values by a tab character. The output file in this example will be:
blue 1
fish 4
one 1
red 1
two 1
Footnotes
[1] | Other possibilites are to give the line as a value and leave the key empty, or to give another property, like the byte offset of the line in the file, as the key. That is how Hadoop, the framework that underlies WebMapReduce, typically gives lines. |
There have been many different implementations of map-reduce frameworks. The most well-known is Google’s MapReduce, which was described in a seminal paper [2] that popularized the idea of map-reduce on clusters. However, Google MapReduce is closed-source and not publicly available.
In an effort to provide a map-reduce framework that can be used by anyone, the Apache Foundation started the Hadoop project. Owing to its open-source nature, Hadoop has become easily the most widely used implementation of map-reduce. It is used at Yahoo, Facebook, IBM, and Amazon.com (including Amazon Elastic Map-Reduce, a service dedicated to providing Hadoop), among many others. WebMapReduce is built on top of Hadoop.
Footnotes
[2] | Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google, Inc., 2004. http://labs.google.com/papers/mapreduce-osdi04.pdf. |
Unfortunately for the casual user, Hadoop’s powerful features can create a bit of a learning curve. Hadoop is written in Java, so map-reduce jobs native to Hadoop must be written in Java as well. Programmers not only need to be familiar with Java program structure, compliation, and dependencies, they also need to be able to use Hadoop’s command-line interface to submit and manage jobs and view results.
A few projects exist to make Hadoop easier to use:
WebMapReduce aims to combine the best parts of these two projects into an environment that is suitable for novice users. It is a web interface (like Hue) where jobs can be written in any language (like Streaming), but with the following additions over both:
Mappers and reducers can be written idomatically in whatever language they use. Users write functions or classes that are called automatically by the framework, rather than interfacing with standard input and output and parsing data manually.
In compiled languages like Java and C++, compliation and dependency management is handled automatically. Users just sumbit the source code for their mappers and reducers. Everything else is up to WebMapReduce.
Some features of Hadoop are omitted or put in the background so that configuration is not as difficult. Examples:
In the future, we may support some of these features for more advanced users. However, we believe that a basic introduction to map-reduce doesn’t need these details.
Since part of simplification means a smaller feature set, users who are already familiar with map-reduce will most likely prefer tools like Hue. WebMapReduce is intended to demonstrate the core concepts of map-reduce and is particularly geared toward classroom use. We think that it serves this purpose well.