2. Installation

2.1. Requirements

As explained in Architecture, WebMapReduce is split into two components. We will discuss the requirements for each component separately.

2.1.1. Frontend

The WebMapReduce frontend requires the following:

Footnotes

[1]Technically, any webserver with WSGI support will work. Apache is just the officially-supported option. Consult the Django documentation for instructions relating to different webservers.
[2]We plan to backport to Python 2.4 soon. Django is the main limitation for Python 3 support. Once Django adds Python 3 support (as of this writing, the Django developers have a rough roadmap), WebMapReduce will trivially support Python 3.
[3]A small modification to enable WebMapReduce to work with Django 1.2.4 or above is available upon request.
[4]The Apache Thrift libraries, version 0.7.0, are included with the WebMapReduce package. However, the thrift.protocol.fastbinary compiled Python module, which may improve performance, is not included. Download and compile Apache Thrift yourself to enable this module.

2.1.2. Backend

The WebMapReduce backend requires the following:

  • Apache Hadoop 0.20.x, or an equivalent distribution such as Cloudera Distribution for Hadoop version 2 or 3.
  • A POSIX-compatible OS (e.g., Linux, Mac OS X, BSD, Solaris) or support layer (e.g., Cygwin)
  • Java SE 6 or above
  • Recommended: sudo or an equivalent system that allows unprivileged users to execute commands under other (unprivileged) accounts. (Some OSs have this support built in.)

Additionally, each language that you desire to use must have its compiler or interpreter installed on all Hadoop nodes. For example, C++ requires GNU make and GCC, which C also requires, while Java requires the JDK and C# requires Mono. Interpreted languages (Python, Java, Scheme, etc.) require their interpreters. See Language Configuration for more details.

2.2. Backend Installation

This section will guide you through installing the WebMapReduce backend.

2.2.1. Download and Unpack

Download the WebMapReduce distribution from the SourceForge website and unpack it to an appropriate directory (e.g., /opt). You should see a subdirectory named WMR-backend-version For the rest of this guide, we will refer to this directory as $WMR_HOME. Replace this variable with the appropriate path whenever you see it these instructions.

Example Installation

$ tar -xzf WebMapReduce-backend-*.tar.gz -C /opt
$ cd /opt
$ ls
WMR-backend-<version>/

In this example, $WMR_HOME would be /opt/WMR-backend-version.

2.2.2. Create Account & Set Permissions

Good security practice dictates that server processes should be run under separate, limited accounts. WebMapReduce is no exception. In fact, because it allows users to submit and run programs on your server, keeping its privileges limited is especially important. For this reason, we highly recommend setting up a special account for the WebMapReduce backend daemon.

Security Goals

Whatever you choose to do, keep two considerations in mind:

  • Some information used to configure WebMapReduce should be kept confidential (SSL keys and passwords, for instance). Files containing this information should only be readable by WebMapReduce.

  • Normal jobs [5] submitted through WebMapReduce are processed and run by Hadoop. If the Hadoop daemons have the same privileges as WebMapReduce, users will be able to write jobs that read this confidential information. Do not run Hadoop and WebMapReduce under the same account.

    Warning

    Never run the Hadoop (or WebMapReduce) daemons as root or any other administrator account. If you do, jobs will run with administrator privileges, potentially granting unlimited access to anyone.

    Footnotes

    [5]

    Test jobs are a special exception: they are run directly by the WebMapReduce daemon. Securing these jobs is more complicated (they are disabled by default for this reason), but a good approach is explained later.

The details for setting up an account and setting permissions vary from system to system. Aim for the following:

  1. Create an account named wmr, granting it as few permissions as possible. On systems where this is applicable, make it a “system” account. On many Unix systems, the appropriate command would be:

    $ useradd -r wmr
    
  2. Set the permissions on the $WMR_HOME/conf directory to be readable only by the wmr account and optionally any appropriate administrator accounts. The standard Unix commands to do this are the following:

    $ chown -R wmr $WMR_HOME/conf
    $ chmod 500 $WMR_HOME/conf
    

    To additionally allow read/write access to an administrator group:

    $ chgrp -R <groupname> $WMR_HOME/conf
    $ chmod 570 $WMR_HOME/conf
    

    Where groupname is your administrator group.

  3. Do the same for the $WMR_HOME/db directory. For example:

    $ chown wmr $WMR_HOME/db
    $ chmod 700 $WMR_HOME/db
    
  4. Ensure that the wmr user can read and write to the Hadoop logs directory, which is usually linked to by $HADOOP_HOME/logs and found at /var/log/hadoop-<version>.

2.2.3. Configure

Before WebMapReduce will function, some basic settings must be configured. The relevant configuration is split into two parts: the main configuration and the supported languages configuration.

2.2.3.1. Main Configuration

For its main configuration, WebMapReduce uses a scheme similar to Hadoop’s:

  • Default settings are stored in a file named wmr-default.xml.
  • Site-specific are stored in in wmr-site.xml.

Both of these files are usually stored in the $WMR_HOME/conf directory, and they have the exact same format and features as Hadoop’s configuration files.

First, make sure that the file wmr-site.xml exists in the $WMR_HOME/conf directory. If one is not present, copy/rename it from the file wmr-site.xml.example:

$ cp $WMR_HOME/conf/wmr-site.xml.example $WMR_HOME/conf/wmr-site.xml

Next, we will set the hostname and port to which the server should bind, if the defaults will not work (localhost and 50100, respectively). Open wmr-site.xml and add these two stanzas inside the <configuration> tag:

<property>
  <name>wmr.server.bind.host</name>
  <value>HOSTNAME-OR-IP</value>
</property>
<property>
  <name>wmr.server.bind.port</name>
  <value>PORT</value>
</property>

Replace HOSTNAME-OR-IP and PORT with the appropriate values.

Note

On many systems, binding to localhost will not allow the server to accept outside connections. If you plan to run your frontend on another machine, setting wmr.server.bind.host to 0.0.0.0 is the easiest way to allow it to connect.

Those settings should be enough for a first run of WebMapReduce. Other settings that might be of interest at this point are wmr.dfs.home and wmr.temp.dir. You may wish to change the arbitrary, internally-used used database password in wmr.db.password to your own random string. The settings beginning with wmr.quota. are used to configure the number of jobs a user may submit at a particular time. Look at wmr-default.xml for a full list of possible settings and their descriptions. Some of these will be covered in later sections.

2.2.3.2. Language Configuration

Next, we will configure the language support for WebMapReduce. The language configuration is stored in a separate INI-like file called languages.conf, usually in the $WMR_HOME/lang-support directory.

If languages.conf does not exist, copy it from the example file:

$ cp $WMR_HOME/lang-support/languages.conf.example $WMR_HOME/lang-support/languages.conf

Each supported language is denoted by a section similar to the following:

[python3]
interpreter = /usr/bin/python3
library = python3
extension = py

You should not have to worry about many of the details of this file at this point. Just check the following:

  1. Comment out (or remove) any languages that you do not want to support.
  2. Make sure the programs listed in the remaining interpreter lines can be executed as named on all of your Hadoop nodes. You will need to provide absolute paths.

See the comments in the file languages.conf.example file for more information on configuring the WebMapReduce languages.

2.2.4. Start & Stop

Now we will start the WebMapReduce daemon for the first time. Similar to Hadoop itself, WebMapReduce is controlled by scripts named start-wmr.sh and stop-wmr.sh, which are found in the $WMR_HOME/bin directory.

Because these scripts actually make use of Hadoop’s startup scripts, they must be able to find the Hadoop installation. As a result, you must either:

  • put the $HADOOP_HOME/bin directory on the system $PATH, OR
  • set the environment variable $HADOOP_HOME.

You must also start Hadoop before starting WebMapReduce (not doing so will give an error).

Note

If you are using Cloudera Distribution for Hadoop, having the hadoop command on the $PATH is not enough. You will have to set $HADOOP_HOME to /usr/lib/hadoop-0.20.

To start the daemon, run the start-wmr.sh script (be sure to run it as the wmr user):

$ sudo -u wmr /bin/bash  # Open a shell as wmr -- this command varies
$ cd $WMR_HOME
$ bin/start-wmr.sh
starting edu.stolaf.cs.wmrserver.ThriftServer, logging to
 $HADOOP_HOME/logs/hadoop-user-edu.stolaf.cs.wmrserver.ThriftServer-hostname.out

If all goes well, the command should produced output similar to that shown above. Especially when you first set up the site, however, check the log file listed in the output for any errors (wait 10 seconds before doing so to allow it to be updated).

At this point, WebMapReduce should be running succesfully. Until you have set up a frontend, however, it is not very useful. Go ahead and stop the daemon:

$ bin/stop-wmr.sh
stopping edu.stolaf.cs.wmrserver.EmbeddedServer
$ exit  # Exit the wmr shell

2.2.5. Enable Test Jobs (Optional)

Test jobs allow users to run small jobs without submitting them to Hadoop. WebMapReduce mimics the behavior of Hadoop Streaming while running the mapper and reducer programs, which allows the daemon to report errors and view intermediate output, providing a useful debugging facility.

The disadvantage to this scheme is that, since the user’s mapper and reducer are run directly by the WebMapReduce daemon, they inherit its privileges and can potentially view confidential information. For this reason, test jobs are disabled by default.

WebMapReduce provides a way to secure test jobs by setting a command that the daemon can use to run test jobs under a different, unprivileged user. On most Unix-like systems, the easiest way to achieve this is through the sudo package. Instructions relating to sudo are provided; you will have to adapt these instructions for other systems.

2.2.5.1. Sudo Configuration

With the sudo package installed, open the sudo configuration for editing using the visudo command:

$ export EDITOR="nano" # or your favorite command-line text editor
$ visudo

We want to allow WebMapReduce to execute arbitrary commands under a specific, unprivileged account without providing a password. Many systems provide a nobody account with the least privileges possible; we will use this in the documentation. If your system does not have one of these accounts, replace nobody with another suitable unprivileged account.

Add the following line to the sudo configuration:

wmr ALL=(nobody) NOPASSWD: ALL

Note

Be aware that this allows any process or user running under the wmr account to run any command under the nobody account. This shouldn’t be a serious problem as long as you don’t use these accounts for other purposes.

You may also need to add the following setting, which allows non-interactive sessions to use sudo:

Defaults    !requiretty

Finally, quit visudo (by exiting the text editor) and run the following command under the wmr account to test your configuration:

$ sudo -nu nobody whoami
nobody

If this command gave the wrong output or prompted for a password, check your configuration. Also, make sure you are running it as wmr.

2.2.5.2. Backend Configuration

Now add the following directives to wmr-site.xml:

<property>
  <name>wmr.tests.allow</name>
  <value>true</value>
</property>
<property>
  <name>wmr.tests.su.cmd</name>
  <value>/usr/bin/sudo -Enu nobody ${cmd}</value>
</property>

WebMapReduce will now allow test jobs, which it will run by replacing the special variable ${cmd} in the wmr.tests.su.cmd property with the path to the mapper or reducer executable.

Restart the WebMapReduce daemon (as wmr) to reload the configuration:

$ cd $WMR_HOME
$ bin/stop-wmr.sh
$ bin/start-wmr.sh

2.3. Frontend Installation

This section will guide you through installing the web interface that makes up the frontend of WebMapReduce.

These instructions assume you have all the requirements listed in Frontend—namely, a working Apache webserver with mod_wsgi, and a Python installation with Django available on the $PYTHONPATH. You do not need to have any kind of Django deployment on your webserver; these instructions will show you how to do this for WebMapReduce.

2.3.1. Download and Unpack

Download the WebMapReduce frontend distribution from the SourceForge website and unpack it to an appropriate directory (e.g., /opt). Also, change its owner to the wmr user and ensure that its name does not have any periods (Python uses periods for namespaces).

Warning

Unlike static HTML files and PHP web apps, Python web apps should not be placed under your webserver’s DocumentRoot. Doing so is a security risk, because it would expose the application’s source code to the client. Instead, we configure Apache to serve Python apps from a different location through a WSGI gateway.

Example Installation

$ tar -xzf WebMapReduce-frontend-*.tar.gz -C /opt
$ cd /opt
$ ls
WMR-frontend-<version>/
$ sudo mv WMR-frontend-<version>/ WMR-frontend
$ sudo chown -R wmr WMR-frontend

2.3.2. Configure

Like the backend, configuration for the WebMapReduce frontend is stored in two files: default settings in settings.py, and local overrides in settings_local.py. You should edit the second file.

Both files simply contain a list of global variables and their values—mostly simple strings—so even if you are unfamiliar with Python, editing should not be difficult.

If the settings_local.py file does not exist already, use a copy of the example file, settings_local.py.example:

$ cp settings_local.py.example settings_local.py

2.3.2.1. Connection Settings

If your WebMapReduce backend is running on a different machine than the frontend—or if it is running on a custom port—we will need to tell the frontend where to find it. Open settings_local.py and uncomment the following values:

WMR_HOST = '<hostname>'
WMR_PORT = <port>

hostname is the hostname or IP of the machine running the backend (the default is localhost), and port is the value of wmr.server.bind.port in the backend configuration (50100 by default).

Note

The hostname you put here may need to match the value of wmr.server.bind.host in the backend configuration. If you experience connection errors, check that these match.

2.3.2.2. Interface Settings

Also included in settings_local.py are a few settings that determine what options are presented to the user.

The option WMR_ALLOW_TEST_JOBS is a boolean that determines whether an option should be provided to submit a job as a test job. If you enabled test jobs on the backend as described in Enable Test Jobs (Optional), set this to True:

WMR_ALLOW_TEST_JOBS = True

The option SECRET_KEY is used by Django as a seed to generate various random strings used by a running web server. Change it to your own long random string.

The options REGISTRATION_ENABLED and USE_TIMED_REGISTRATION_KEYS are booleans in settings_local.py that together determine whether and how anyone can register accounts. Both are true by default. The REQUIRE_REGISTRATION_KEY setting requires anyone registering his own account to enter a secret key. These keys are created using the admin interface at /wmr/admin/registration/registrationkey/add/ with a time at which they will expire. To disallow anyone but administrators from adding accounts, even with the restriction, set REGISTRATION_ENABLED to false by uncommenting the following line in :file:settings_local.py:

REGISTRATION_ENABLED=False

Next, we will configure which languages are presented as options for mappers and reducers. The list of languages is stored in the variable WMR_LANGUAGES as a list of tuples, like this:

WMR_LANGUAGES = [
    ('python3',  'Python 3'),
    ('scheme-i', 'Scheme (Imperative)'),
    ('scheme-f', 'Scheme (Functional)'),
    ('cpp',      'C++'),
    ('java',     'Java'),
]

Each tuple represents a language, where the value on the right is the name of the language as the user sees it, and the key on the left is the name of the language as it is known to the WebMapReduce backend—the name in languages.conf, as described in Language Configuration.

The default set of languages should already be listed in the file. Simply uncomment it, and remove any options that you also removed from the backend configuration.

Also, set the language that should be selected by default:

WMR_DEFAULT_LANGUAGE = '<language>'

where language is one of the machine-readable keys in the WMR_LANGUAGES list (i.e., python3, not Python 3).

2.3.3. Setup Database

The frontend maintains a database of submissions and saved job configurations (unrelated to the backend database). To set up this database, run the following command as the Apache user:

$ cd WMR_FRONTEND_HOME
$ python manage.py syncdb
$ python manage.py migrate wmr

Replace WMR_FRONTEND_HOME with the path where you unpacked the frontend (e.g., /opt/WMR-frontend-version).

The syncdb command will ask you whether you want to create a superuser account. This account can be used to manage other accounts and application data through a special administration interface in WebMapReduce accessible at the /admin sub-url of the WebMapReduce site. Set the account up using a strong password.

2.3.4. Configure Apache

Now, we need to set up your webserver with WebMapReduce. Add the following to an appropriate place in your Apache configuration (after other Directory stanzas or inside the appropriate VirtualHost stanza, for instance):

WSGIScriptAlias URL_PREFIX WMR_FRONTEND_HOME/wsgi/django.wsgi

<Directory WMR_FRONTEND_HOME/wsgi>
Order allow,deny
Allow from all
</Directory>

Replace URL_PREFIX with the sub-url that you would like WebMapReduce to occupy on your webserver (e.g., /wmr). If you would like WebMapReduce to operate from the root of your webserver, use /. WMR_FRONTEND_HOME is the filesystem path where you unpacked the WebMapReduce frontend (e.g., /opt/WMR-frontend-version).

Note

Make sure the folder WMR_FRONTEND_HOME/wsgi/ and its contents are accessible by the appropriate Apache user account on your system. (www, www-data and apache are common account names. If you are unsure, check your system documentation.) Also, ensure that the frontend database in WMR_FRONTEND_HOME/db is writeable by the Apache user account: this may be done by giving that directory to the Apache user’s group and enabling group write permissions for that directory.

Next, add this setting to your settings_local.py:

FORCE_SCRIPT_NAME='<URL_PREFIX>'

Replace URL_PREFIX with the value you used in the Apache configuration (without the angle brackets).

Finally, restart your Apache server to load the new configuration (some systems provide an apachectl or apache2ctl program to do this).