Product SiteDocumentation Site

1.3. Hadoop and WebMapReduce

There have been many different implementations of map-reduce frameworks. The most well-known is Google's MapReduce, which was described in a seminal paper[2] that popularized the idea of map-reduce on clusters. However, Google MapReduce is closed-source and not publicly available.
In an effort to provide a map-reduce framework that can be used by anyone, the Apache Foundation started the Hadoop project. Owing to its open-source nature, Hadoop has become easily the most widely used implementation of map-reduce. It is used at Yahoo, Facebook, IBM, and Amazon.com (including Amazon ECC, a service dedicated to providing Hadoop), among many others. WebMapReduce is built on top of Hadoop.

1.3.1. Hadoop Simplified

Unfortunately for the casual user, Hadoop's powerful features can create a bit of a learning curve. Hadoop is written in Java, so map-reduce jobs native to Hadoop must be written in Java as well. Programmers not only need to be familiar with Java program structure, compliation, and dependencies, they also need to be able to use Hadoop's command-line interface to submit and manage jobs and view results.
A few projects exist to make Hadoop easier to use:
  • Hadoop Streaming allows map-reduce programs be written in any language. However, with Streaming, the key-value pairs for mapper input and output are streamed through standard input and output in text format—a good generic interface, but one that is not quite idomatic in many languages.
  • Hue, or the Hadoop User Interface (formerly known as Cloudera Desktop), provides a rich web interface for managing Hadoop clusters and map-reduce jobs. It is a very promising project that can make life easier for administrators and users. Nearly all of the powerful features of Hadoop are available through this interface. Nonetheless, it still requires users to manage compliation and dependencies for Java map-reduce applications.
WebMapReduce aims to combine the best parts of these two projects into an environment that is suitable for novice users. It is a web interface (like Hue) where jobs can be written in any language (like Streaming), but with the following additions over both:
  • Mappers and reducers can be written idomatically in whatever language they use. Users write functions or classes that are called automatically by the framework, rather than interfacing with standard input and output and parsing data manually.
  • In compiled languages like Java and C++, compliation and dependency management is handled automatically. Users just sumbit the source code for their mappers and reducers. Everything else is up to WebMapReduce.
  • Some features of Hadoop are omitted or put in the background so that configuration is not as difficult. Examples:
    • Only one type of input/output format is supported: plain text with keys and values separated by tabs.
    • Keys and values are always given as strings (though some languages may have exceptions).
    • Combiners are not available.
    In the future, we may support some of these features for more advanced users. However, we believe that a basic introduction to map-reduce doesn't need these details.
Since part of simplification means a smaller feature set, users who are already familiar with map-reduce will most likely prefer tools like Hue. WebMapReduce is intended to demonstrate the core concepts of map-reduce and is particularly geared toward classroom use. We think that it serves this purpose well.


[2] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google, Inc., 2004. http://labs.google.com/papers/mapreduce-osdi04.pdf.