3. Cluster Administration

3.1. Logging In

While a cluster can be started and stopped using only the AWS Web Console, you will need to log in to the master node of your cluster through SSH to perform some administration tasks, like uploading data sets to the HDFS.

If you are familiar with SSH, here are the details:

  • User: ubuntu (this account has sudo privileges)
  • Host: The Public DNS associated with your master node on EC2
  • You must connect with the keypair you created and downloaded when you started the master instance. You cannot log in with a password.

The exact process for logging in depends on whether you are running a Unix-like operating system (e.g., Linux or Mac OS X) or Windows.

3.1.1. From Unix/Linux/OS X

Open a shell and issue this command:

$ ssh -i <keypair> ubuntu@<ec2-public-hostname>
The authenticity of host '<ec2-public-hostname>' can't be established.
RSA key fingerprint is [...].
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '<ec2-public-hostname>' (RSA) to the list of known hosts.

keypair is the filename of the keypair you created and downloaded when setting up the master instance. It should have the extension *.pem. ec2-public-hostname is that instance’s Public DNS, and ec2-private-hostname is the Private DNS.

You can log out at any time by typing the exit command.

3.1.2. From Windows

On Windows, you will need to install an SSH client such as PuTTY. Download the latest package and install.


You will need the whole suite of programs, not just individual executables like putty.exe. Look for the packages like putty-version-installer.exe or putty.zip.

First, we will need to take the keypair you created and downloaded when setting up the master instance and convert it to a format that PuTTY understands natively.

  1. Open the program PuTTYgen.
  2. In the menu, go to Conversions ‣ Import Key
  3. Choose the keypair file (with the extension *.pem) that you downloaded from AWS.
  4. Set a password for the private key file in the Key Passphrase box.
  5. Click the Save Private Key button in the main screen and save the new key. It should have a *.ppk extension.

Now we can log into the master instance using the main PuTTY program:

  1. Open PuTTY. The program will start with a screen that lets you configure the connection.
  2. In the Session category, enter the Public Hostname of your EC2 master node in the Host Name box.
  3. In the Connection ‣ SSH ‣ Auth category, click Browse next to the Private Key for Authentication box and choose your private key file (the one ending in *.ppk).
  4. Click Open to connect.
  5. Enter ubuntu for the username.

You should now have a shell on your master node on EC2.

3.2. Uploading Data Sets

If you have data on your machine that you would like to use with WebMapReduce, you first need to use scp (Unix) or PSCP (Windows) to copy the data to your instance. Then, you must upload your data to the cluster’s Distributed File System.


If you are uploading a particularly large data set (on the order of several gigabytes), first check that you have enough disk space on your master node, and second, be aware that Amazon does charge for data transfers, and the cost of a large transfer may be significant. Check the EC2 pricing page for details.

3.2.1. Copying from Unix/Linux/OS X

The easiest way to copy data to your cluster is with the scp command. The syntax of the command is similar to ssh:

$ scp [-r] -i <keypair> <source> ubuntu@<ec2-public-hostname>:<dest>

keypair and ec2-public-hostname are the same as above. source is the path to the local file or directory you want to copy, and dest is the destination path on the EC2 node. Use the -r option if you are copying a directory.

Now proceed to Uploading to the HDFS.

3.2.2. Copying from Windows

The PSCP command is only available through the command line. Run cmd.exe and at the prompt, add the PuTTY directory your search path:

C:\> set PATH=%PATH%;"C:\Program Files\PuTTY"

Now you can use the PSCP command to transfer your files:

C:\> pscp [-r] -i <key.ppk> <source> ubuntu@<ec2-public-hostname>:<dest>

key.ppk is the path to the private key you exported and saved when first logging in. ec2-public-hostname is the Public Hostname of your master node. source is the local file or folder you want to copy, and dest is the target filename on the EC2 node. Use the -r option if you are copying a directory.

Now proceed to Uploading to the HDFS.

3.2.3. Uploading to the HDFS

Now, the data set needs to be put on the HDFS. The best location for data sets that will be used only through WebMapReduce is /user/wmr, since anything in this directory can be specified with a relative path in WMR.

When logged in to your master node (see Logging In), use the following command to put data onto the HDFS:

$ hadoop fs -put <local-path> <hdfs-path>

local-path is the name of the file or directory you want to upload, and hdfs-path is the name it will have on the HDFS.

3.3. Copying Datasets from S3

For convenience, we have also made some datasets available using Amazon Simple Storage Service (S3). These files can be quickly copied directly to your cluster, rather than having to be downloaded to your machine and uploaded there.

First, we have to retrieve some access credentials for your AWS account:

  1. Visit http://aws.amazon.com/account/ and log in.
  2. Click on the Security Credentials link
  3. Find the section called Access Credentials.
  4. Click on the Access Keys tab.
  5. Find your Access Key ID and Secret Access Key in the table. Copy these down for later.

Next, we will add a configuration file to your cluster that will allow you to access S3:

  1. Log in using the procedure above.

  2. Using the nano editor (or anything you are familiar with), create the a file named s3.conf:

    ubuntu@<...>$ nano s3.conf

    Add to it the following contents:

    <?xml version="1.0"?>

    Replace ID with your Access Key ID and SECRET with your Secret Access Key you found earlier.

    In nano, press Control-x to exit, following the prompts at the bottom to save.

Finally, we can copy the data:

$ hadoop distcp -conf s3.conf s3n://webmapreduce-datasets/ /data

This command will run a map-reduce job that will copy all of the data from the wmr-datasets bucket on S3 to the filesystem on your cluster, under the /data folder. The progress of the job should be shown on the command line.