Product SiteDocumentation Site

3.3. Uploading Data Sets

If you have data on your machine that you would like to use with WebMapReduce, you first need to use scp (Unix) or PSCP (Windows) to copy the data to your instance. Then, you must upload your data to the cluster's Distributed File System.

Important

If you are uploading a particularly large data set (on the order of several gigabytes), first check that you have enough disk space on your master node, and second, be aware that Amazon does charge for data transfers, and the cost of a large transfer may be significant. Check the EC2 pricing page for details.

3.3.1. Copying from Unix/Linux/OS X

The easiest way to copy data to your cluster is with the scp command. The syntax of the command is similar to ssh:
$ scp [[-r]] -i keypair source ubuntu@ec2-public-hostname:dest
keypair and ec2-public-hostname are the same as above. source is the path to the local file or directory you want to copy, and dest is the destination path on the EC2 node. Use the -r option if you are copying a directory.