WebMapReduce in Education
WebMapReduce is geared toward use in the classroom. Free teaching materials for introductory computer science courses are available for the following languages through the CS in Parallel web site at its Map-reduce Computing page.
What is Map-Reduce?
Map-reduce is a model for writing programs that can easily be made to process data in parallel. It usually goes along with a framework that manages the details of parallelism like distributing work and synchronizing shared resources. This makes map-reduce programs very easy to write as well.
In the map-reduce model, work is divided into two phases: a map phase and a reduce phase. The map phase takes a piece of input and performs some operation on it (e.g., extracting a field), and the reduce phase aggregates similar pieces of information that are produced by the map phase (e.g., averaging fields with the same name). These pieces of information are represented by key-value pairs whose content is determined by the program.
For more information and examples, see:
- What is Map-Reduce? in the WebMapReduce User Guide
- MapReduce on Wikipedia
- MapReduce: Simplified Data Processing on Large Clusters by Dean and Ghemawat
Why Map-Reduce?
Map-reduce can serve as an ideal introduction to parallelism for a number of reasons. It avoids issues that can be particularly difficult with parallel programming, like deadlock and race conditions. At the same time, though, it does demonstrate many important concepts, such as:
- Data vs. task parallelism: Input is divided among many tasks (data parallelism), while responsibilities are divided between the map and reduce phases to form a pipeline (task parallelism).
- Scalability & speedup: Map-reduce can scale to extremely large data sets—on the order of petabytes. Increasing the number of map and reduce can decrease the run time, but overhead limits the theoretical speedup as governed by Amdahl's Law.
- Functional techniques: The map-reduce model is inspired by patterns in functional languages, where the lack of side-effects and state make parallelization simple. Thinking in a functional way can help solve parallel problems in many different domains.
- Fault-tolerance: While not managed by the user, fault-tolerance is an important aspect of map-reduce. Map and reduce tasks that fail can be reassigned to different machines.