3.1. Data Sets
If you are using WebMapReduce in a class, it is a good idea to provide a few
data sets that your students can use as input for their jobs. Medium and large
data sets can be stored on the HDFS in a known location. Then, users can enter
the path to the data in the Distributed File System field as input
for their job.
If these data sets will only be used by WebMapReduce, put them underneath the
directory wmr.dfs.home, which defaults to be the WebMapReduce user’s home
directory on the DFS. Then, users can specify relative paths, which are resolved
against this directory.
For example, if the WebMapReduce backend daemon runs under the account wmr,
and you upload a file named war-and-peace.txt using this command:
$ hadoop fs -mkdir /user/wmr/books
$ hadoop fs -put war-and-peace.txt /user/wmr/books
Then WebMapReduce users can use this data by specifying the relative path
books/war-and-peace.txt as the path to their job input.
Otherwise, as long as the setting wmr.input.containment.enforce is set to
false (the default), users can access data from anywhere on the DFS. A file
stored under /data/wiki/snapshot.txt, for example, can be used by
specifying its absolute path.
3.1.1. Sources of Data
Here are some websites that offer raw, downloadable data that might be
interesting to process with map-reduce:
- Wikipedia
- Full dumps of English Wikipedia are available in many forms from
http://download.wikimedia.org. One of the easiest to process is the Static
HTML dump, which contains only the “current” (as of the latest snapshot)
pages as they are delivered to browsers, rather than in raw WikiText/SQL.
This is available at:
http://static.wikipedia.org/downloads/current/en/wikipedia-en-html.tar.7z
Note that this file is 15 GB compressed, and nearly 230 GB uncompressed!
- Project Gutenberg
- This site provides free public domain (and a few copyrighted) eBooks. You can
download individual books in plain text format (ASCII is preferred, since
some programming languages have spotty Unicode support), or download a CD or
DVD release at:
http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project
- Data.gov
- This is a centralized source for many U.S. government reports, census data,
etc. Data can be downloaded through the Raw Data Catalog, where you can
search by format, topic, and agency. Some individual agencies (the Census
Bureau and Federal Election Commission, for example) provide more options on
their individual websites.
- NCBI
- The National Center for Biotechnology Information houses a huge quantity of
biological and biochemical data, including GenBank, the de facto standard
source of genome sequences. Snapshots of much of this data can be downloaded
through various FTP sites, which are described at
http://www.ncbi.nlm.nih.gov/guide/data-software/.
- GroupLens
- This research group at the University of Minnesota offers data they have
collected on reccommendations for movies, books, etc. This data is available
at http://www.grouplens.org/node/12.
3.1.2. Data Sets from Examples
An earlier version of the data sets used in the examples in this guide is available for download from the SourceForge website. They have not been tested with the current version, but may work with it. Descriptions of the data sets and instructions for their use are contained in the package.