The web interface exposes these general configuration options:
A suggested number of tasks to divide the reduce phase of the job into. Like the number of map tasks, it does not need to be set manually. However, the map-reduce system follows this suggestion more closely than for map tasks. The default is 1. Set it higher for large jobs where the reduce phase becomes a bottleneck.
Note
For test jobs, neither the map or reduce phases are split into tasks, so these numbers are ignored.
There are four different ways to provide input for a job:
When the Upload and Direct Input options are used, the data is uploaded to the DFS when the job is saved or submitted. The form is updated so that the job input is set to the path on the DFS to which the data was uploaded. The data can be reused through the Datasets interface in the navigation bar.
The mapper and reducer must be written in the language specified by the Source Code Language selection box, following the appropriate format for that language. (See later chapters for examples and API specifications for each language.) Each piece of code can be submitted in either of two ways:
WebMapReduce allows users to save job configurations for future use. This is accomplished by clicking the Save button at the bottom of the form. Configurations are saved according to the job name, and saving a job with the same name as an existing configuration will overwrite that configuration. Saving a configuration also sends any uploaded or manually entered job input to the DFS—if the job configuration is reused, it will provide the path to this data. Job configurations may be reused through the Saved Configurations interface. To load a configuration:
Once a job has been properly configured, it can be submitted using the button(s) at the bottom of the interface. Jobs can be submitted in two ways: as test jobs, or as standard jobs.
Test Jobs
Test jobs do not run on a cluster or in any distributed context. Instead, they are executed immediately on the machine which hosts the WebMapReduce server, inside an environment that emulates the map-reduce process. This allows output to be delivered quickly, and also allows for more helpful error output. The purpose of test jobs is to provide an easy way for users to debug their programs without having to write a test framework of their own. In general, jobs should be tested in the test environment before being run full-scale in order to save time and resources.
Note that the output of test jobs is only saved temporarily, and additionally there is a limit on the size of input that may be used with test jobs. Since this limit is configurable, questions concerning test job input size should be directed to local system administrators. If the WebMapReduce backend is restarted, test jobs will automatically expire and become inaccessible from the web interface.
Standard Jobs
Standard jobs are “real” map-reduce jobs that are run in a distributed context, most commonly a Beowulf cluster. These jobs have no limits on input size, can save and reuse output, and support monitoring.
Once a job of this type is successfully submitted, users are redirected to the monitoring page, which automatically refreshes to update the job status. Additionally, users can cancel their own jobs from this interface. If the job completes without error, its output will be displayed on the next refresh (across multiple pages if necessary) and options will be provided to re-use the output as the input of another job, and to reload the job configuration for a new job. Currently, WebMapReduce does not allow users to download their job output through the web interface, but it can be accessed directly through the DFS if necessary.
If one user has submitted too many jobs in the last period of time, he or she will receive an error asking them to wait a few minutes before submitting another job.
All submitted jobs are recorded in the Job History interface. This feature allows users to revisit past jobs, sorted by submission time, to view their monitoring page. A previous job may be accessed by:
Note that test jobs are displayed in italics, and may expire and become inaccessible.
This example will walk through the process of using WebMapReduce, from signing in to monitoring a job. We will demonstrate WordCount (the simple map-reduce application described in Example: WordCount) using Python 3. For WordCount examples in other languages, please see the documentation for each language in later chapters.
To begin using WMR, you must first sign in. Navigate to the WMR main page and login using the form provided.
Once logged in, you need to configure your job’s basic settings. Use any job name and select Python 3 for the source language. There is no need to set the number of map or reduce tasks in this example, so leave these fields unchanged.
Next supply input to the job. A small amount of text will suffice for this example. This has the benefit of allowing us to check whether we received the right results at a glance. This same example could, however, be run on a much larger dataset (try it out when you are finished!). Use something simple, preferably with repeated words, like the text shown in the screenshot in Job Input:
Now define a mapper. The following mapper does exactly what is described in Example: WordCount in the introduction to map-reduce: it takes an entire line of input as its key (ignoring the empty value), and outputs every word in the line along with a 1 as the value:
def mapper(key, val):
for word in key.split():
Wmr.emit(word, '1')
Notice the following parts of the code:
If the mapper receives the line “one fish” as input, its output should be:
one 1
fish 1
Finally, we will write a reducer. This reducer adds all the 1‘s associated with each word to get a final count:
def reducer(key, vals):
sum = 0
for val in vals:
sum += int(val)
Wmr.emit(key, str(sum))
Notice the following:
Footnotes
[1] | All languages require certain names for mappers and reducers, although the exact name may differ. The chapter on each language will give the specific requirement. |
[2] | (1, 2) Although this is generally true in WebMapReduce, some languages may behave differently. These differences will be noted in the chapter for each language. |
[3] | Other languages may have similar functions, or they may use the return value of the function as output. See the chapter on each language. |
Now we can submit the job. We will make sure our job is written correctly by submitting it as a test job, clicking Test.
The result screen for test jobs appears immediately, and results appear as soon as the test is completed. Test jobs have a much faster turnaround than a distributed job, and you can see any error output (if available) in addition to the regular output of your jobs. This makes test jobs a helpful tool for debugging map-reduce programs. As seen below, the test job run for the example succeeded, and the output is correct:
Under Mapper, we see each key-value pair that was output from our mapper (keys separated from values by tabs), and under Reducer, we see the final, sorted result of our job.
Once jobs have been tested, they can be run on the cluster where there is a much greater amount of computing power and no limit on input. Click Resubmit Job link to reload the job form. Notice that the input has been changed to a Cluster Path. This is because when jobs are submitted (or saved) any external input is uploaded to the DFS. WebMapReduce automatically handles locating and using this data when reusing configurations.
To submit the job as a cluster job, click the Submit button at the bottom of the form.
If the job is submitted without errors, you will be redirected to the monitoring page. This page provides progress information about the running job and automatically refreshes to update the information. Additionally, you can cancel the job by clicking Kill Job if it is taking too long, not working properly, stuck in an infinite loop, etc.
Important
Before you decide to kill a job, be patient! Sometimes, running a job on a full-blown cluster can take significant set-up time. Also, other users may be running their own jobs which have higher priority than yours, causing your job to be queued until more resources are available.
However, if you see that your job progresses for a while and stops for a significant amount of time—or sometimes, goes backwads and tries again many times—your program may be stuck in an infinite loop. Kill the job, go back, and try to debug it using test jobs or other tools. If that doesn’t help, there may be a problem with the cluster. Contact your administrator.
There can be a significant amount of overhead between the map phase and reduce phase of a map-reduce job, so there may be a period of time during which the mapper progress will be at 100%, and the reducer progress will remain at 0%. This is normal behavior, especially for small jobs, which may spend more time being distributed over the cluster than they spend actually running. Similarly, the mapper and reducer progress may remain at 100% while the output is being collected and the job finalized.
If the map-reduce system cannot run your reducer, the job will fail and you will be given an error message. If this happens, go back and try to debug your job using test jobs. If your job works when testing but fails on the cluster, contact your administrator.
Once the job is finished, the output is collected from the DFS and the user is given the option to use the output as the input for a new job. The output is displayed on-page, and is split into multiple pages if it exceeds a certain size.
For any questions related to the use of WebMapReduce, please see the WMR help and discussion forum at:
https://sourceforge.net/apps/phpbb/webmapreduce
For documentation for each language, please see subsequent chapters of this User Guide.