**************
Administration
**************

.. _sect-Admin_Guide-Administration-Data_Sets:

Data Sets
=========

If you are using WebMapReduce in a class, it is a good idea to provide a few
data sets that your students can use as input for their jobs. Medium and large
data sets can be stored on the HDFS in a known location. Then, users can enter
the path to the data in the :guilabel:`Distributed File System` field as input
for their job.

If these data sets will only be used by WebMapReduce, put them underneath the
directory ``wmr.dfs.home``, which defaults to be the WebMapReduce user's home
directory on the DFS. Then, users can specify relative paths, which are resolved
against this directory.

For example, if the WebMapReduce backend daemon runs under the account ``wmr``,
and you upload a file named :file:`war-and-peace.txt` using this command:

.. code-block:: console

   $ hadoop fs -mkdir /user/wmr/books
   $ hadoop fs -put war-and-peace.txt /user/wmr/books

Then WebMapReduce users can use this data by specifying the relative path
:file:`books/war-and-peace.txt` as the path to their job input.

Otherwise, as long as the setting ``wmr.input.containment.enforce`` is set to
false (the default), users can access data from anywhere on the DFS. A file
stored under :file:`/data/wiki/snapshot.txt`, for example, can be used by
specifying its absolute path.

Sources of Data
---------------

Here are some websites that offer raw, downloadable data that might be
interesting to process with map-reduce:

`Wikipedia <http://en.wikipedia.org>`_
   Full dumps of English Wikipedia are available in many forms from
   http://download.wikimedia.org. One of the easiest to process is the Static
   HTML dump, which contains only the "current" (as of the latest snapshot)
   pages as they are delivered to browsers, rather than in raw WikiText/SQL.
   This is available at:
   http://static.wikipedia.org/downloads/current/en/wikipedia-en-html.tar.7z
   Note that this file is 15 GB compressed, and nearly 230 GB uncompressed!

`Project Gutenberg`_
   This site provides free public domain (and a few copyrighted) eBooks. You can
   download individual books in plain text format (ASCII is preferred, since
   some programming languages have spotty Unicode support), or download a CD or
   DVD release at:
   http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project

`Data.gov`_
   This is a centralized source for many U.S. government reports, census data,
   etc. Data can be downloaded through the `Raw Data Catalog`_, where you can
   search by format, topic, and agency. Some individual agencies (the Census
   Bureau and Federal Election Commission, for example) provide more options on
   their individual websites.

`NCBI`_
   The National Center for Biotechnology Information houses a huge quantity of
   biological and biochemical data, including GenBank, the *de facto* standard
   source of genome sequences. Snapshots of much of this data can be downloaded
   through various FTP sites, which are described at
   http://www.ncbi.nlm.nih.gov/guide/data-software/.

`GroupLens`_
   This research group at the University of Minnesota offers data they have
   collected on reccommendations for movies, books, etc. This data is available
   at http://www.grouplens.org/node/12.

.. _Wikipedia: http://en.wikipedia.org
.. _Project Gutenberg: http://www.gutenberg.org
.. _Data.gov: http://www.data.gov
.. _Raw Data Catalog: http://www.data.gov/catalog/raw
.. _NCBI: http://www.ncbi.nlm.nih.gov/
.. _GroupLens: http://www.grouplens.org/

Data Sets from Examples
-----------------------

An earlier version of the data sets used in the examples in this guide is available for download from the SourceForge website. They have not been tested with the current version, but may work with it. Descriptions of the data sets and instructions for their use are contained in the package.