Chapter 3. Administration

3.1.1. Sources of Data
3.1.2. Data Sets from Examples

3.1. Data Sets

If you are using WebMapReduce in a class, it is a good idea to provide a few data sets that your students can use as input for their jobs. Medium and large data sets can be stored on the HDFS in a known location. Then, users can enter the path to the data in the Distributed File System field as input for their job.

If these data sets will only be used by WebMapReduce, put them underneath the directory wmr.dfs.home, which defaults to be the WebMapReduce user's home directory on the DFS. Then, users can specify relative paths, which are resolved against this directory.

For example, if the WMRServer daemon runs under the account wmr, and you upload a file named war-and-peace.txt using this command:

$ hadoop fs -mkdir /user/wmr/books
$ hadoop fs -put war-and-peace.txt /user/wmr/books

Then WebMapReduce users can use this data by specifying the relative path books/war-and-peace.txt as the path to their job input.

Otherwise, as long as the setting wmr.input.containment.enforce is set to false (the default), users can access data from anywhere on the DFS. A file stored under /data/wiki/snapshot.txt, for example, can be used by specifying its absolute path.

3.1.1. Sources of Data

Here are some websites that offer raw, downloadable data that might be interesting to process with map-reduce:

Wikipedia: Full dumps of English Wikipedia are available in many forms from http://download.wikimedia.org. One of the easiest to process is the Static HTML dump, which contains only the "current" (as of the latest snapshot) pages as they are delivered to browsers, rather than in raw WikiText/SQL. This is available at:
http://static.wikipedia.org/downloads/current/en/wikipedia-en-html.tar.7z
Note that this file is 15 GB compressed, and nearly 230 GB uncompressed!
Project Gutenberg: This site provides free public domain (and a few copyrighted) eBooks. You can download individual books in plain text format (ASCII is preferred, since some programming languages have spotty Unicode support), or download a CD or DVD release at:
http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project
Data.gov: This is a centralized source for many U.S. government reports, census data, etc. Data can be downloaded through the Raw Data Catalog, where you can search by format, topic, and agency. Some individual agencies (the Census Bureau and Federal Election Commission, for example) provide more options on their individual websites.
NCBI: The National Center for Biotechnology Information houses a huge quantity of biological and biochemical data, including GenBank, the de facto standard source of genome sequences. Snapshots of much of this data can be downloaded through various FTP sites, which are described at http://www.ncbi.nlm.nih.gov/guide/data-software/.
GroupLens: This research group at the University of Minnesota offers data they have collected on reccommendations for movies, books, etc. This data is available at http://www.grouplens.org/node/12.

3.1.2. Data Sets from Examples

The data sets used in the examples in this guide are available for download from the SourceForge website. Descriptions of the data sets and instructions for their use are contained in the package.