Chapter 3. Administration
If you are using WebMapReduce in a class, it is a good idea to provide a few data sets that your students can use as input for their jobs. Medium and large data sets can be stored on the HDFS in a known location. Then, users can enter the path to the data in the Distributed File System field as input for their job.
If these data sets will only be used by WebMapReduce, put them underneath the directory wmr.dfs.home
, which defaults to be the WebMapReduce user's home directory on the DFS. Then, users can specify relative paths, which are resolved against this directory.
For example, if the WMRServer daemon runs under the account wmr
, and you upload a file named war-and-peace.txt
using this command:
$ hadoop fs -mkdir /user/wmr/books
$ hadoop fs -put war-and-peace.txt /user/wmr/books
Then WebMapReduce users can use this data by specifying the relative path books/war-and-peace.txt
as the path to their job input.
Otherwise, as long as the setting wmr.input.containment.enforce
is set to false (the default), users can access data from anywhere on the DFS. A file stored under /data/wiki/snapshot.txt
, for example, can be used by specifying its absolute path.
Here are some websites that offer raw, downloadable data that might be interesting to process with map-reduce:
3.1.2. Data Sets from Examples
The data sets used in the examples in this guide are available for download from the SourceForge website. Descriptions of the data sets and instructions for their use are contained in the package.