While a cluster can be started and stopped using only the AWS Web Console, you will need to log in to the master node of your cluster through SSH to perform some administration tasks, like uploading data sets to the HDFS.
If you are familiar with SSH, here are the details:
The exact process for logging in depends on whether you are running a Unix-like operating system (e.g., Linux or Mac OS X) or Windows.
Open a shell and issue this command:
$ ssh -i <keypair> ubuntu@<ec2-public-hostname>
The authenticity of host '<ec2-public-hostname>' can't be established.
RSA key fingerprint is [...].
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '<ec2-public-hostname>' (RSA) to the list of known hosts.
ubuntu@<ec2-private-hostname>$
keypair is the filename of the keypair you created and downloaded when setting up the master instance. It should have the extension *.pem. ec2-public-hostname is that instance’s Public DNS, and ec2-private-hostname is the Private DNS.
You can log out at any time by typing the exit command.
On Windows, you will need to install an SSH client such as PuTTY. Download the latest package and install.
Note
You will need the whole suite of programs, not just individual executables like putty.exe. Look for the packages like putty-version-installer.exe or putty.zip.
First, we will need to take the keypair you created and downloaded when setting up the master instance and convert it to a format that PuTTY understands natively.
Now we can log into the master instance using the main PuTTY program:
You should now have a shell on your master node on EC2.
If you have data on your machine that you would like to use with WebMapReduce, you first need to use scp (Unix) or PSCP (Windows) to copy the data to your instance. Then, you must upload your data to the cluster’s Distributed File System.
Important
If you are uploading a particularly large data set (on the order of several gigabytes), first check that you have enough disk space on your master node, and second, be aware that Amazon does charge for data transfers, and the cost of a large transfer may be significant. Check the EC2 pricing page for details.
The easiest way to copy data to your cluster is with the scp command. The syntax of the command is similar to ssh:
$ scp [-r] -i <keypair> <source> ubuntu@<ec2-public-hostname>:<dest>
keypair and ec2-public-hostname are the same as above. source is the path to the local file or directory you want to copy, and dest is the destination path on the EC2 node. Use the -r option if you are copying a directory.
Now proceed to Uploading to the HDFS.
The PSCP command is only available through the command line. Run cmd.exe and at the prompt, add the PuTTY directory your search path:
C:\> set PATH=%PATH%;"C:\Program Files\PuTTY"
Now you can use the PSCP command to transfer your files:
C:\> pscp [-r] -i <key.ppk> <source> ubuntu@<ec2-public-hostname>:<dest>
key.ppk is the path to the private key you exported and saved when first logging in. ec2-public-hostname is the Public Hostname of your master node. source is the local file or folder you want to copy, and dest is the target filename on the EC2 node. Use the -r option if you are copying a directory.
Now proceed to Uploading to the HDFS.
Now, the data set needs to be put on the HDFS. The best location for data sets that will be used only through WebMapReduce is /user/wmr, since anything in this directory can be specified with a relative path in WMR.
When logged in to your master node (see Logging In), use the following command to put data onto the HDFS:
$ hadoop fs -put <local-path> <hdfs-path>
local-path is the name of the file or directory you want to upload, and hdfs-path is the name it will have on the HDFS.
For convenience, we have also made some datasets available using Amazon Simple Storage Service (S3). These files can be quickly copied directly to your cluster, rather than having to be downloaded to your machine and uploaded there.
First, we have to retrieve some access credentials for your AWS account:
Next, we will add a configuration file to your cluster that will allow you to access S3:
Log in using the procedure above.
Using the nano editor (or anything you are familiar with), create the a file named s3.conf:
ubuntu@<...>$ nano s3.conf
Add to it the following contents:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>${fs.s3.awsAccessKeyId}</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>${fs.s3.awsSecretAccessKey}</value>
</property>
</configuration>
Replace ID with your Access Key ID and SECRET with your Secret Access Key you found earlier.
In nano, press Control-x to exit, following the prompts at the bottom to save.
Finally, we can copy the data:
$ hadoop distcp -conf s3.conf s3n://webmapreduce-datasets/ /data
This command will run a map-reduce job that will copy all of the data from the wmr-datasets bucket on S3 to the filesystem on your cluster, under the /data folder. The progress of the job should be shown on the command line.