The first step to set up a cluster with our AMIs is to sign up for EC2. If you have never used EC2 before, we strongly suggest reading Amazon’s Getting Started Guide. It shouldn’t take any longer than 10 minutes, and it makes the process in the following section much clearer. Rest assured, however, that we will still target this guide to users who are new to EC2.
Important
Remember that if you start EC2 nodes, Amazon will begin charging to your account! Simply running through these instructions should cost no more than a few dollars, as long as you terminate the cluster afterward.
Quick Start
The basic process for starting a cluster on EC2 is shown below. If you are already familiar with EC2, this might be all you need to get going. Otherwise, the process will be explained step-by-step in the following sections.
Set up two security groups:
Start a master node. After it starts, record its Public DNS, Private DNS, and Availability Zone.
Start any number of slave nodes in the same availability zone as the master node, passing the following line as User Data:
MASTER_HOST=<master's private DNS>
Replacing master’s private DNS with the appropriate value.
Once the cluster is set up, you will want to log in to the master node and add user accounts, as described in the next chapter.
Before we launch the cluster itself, we will want to create security groups for it. These define firewall rules for the EC2 instances that make up the cluster. This is a one-time-only procedure: once you have created appropriate security groups, you should not have to recreate them when starting new clusters unless you would like to keep your groups separate.
First, we will create a security group for all cluster nodes:
Log in to the AWS Management Console if you have not already.
Click on Security Groups in the left-hand Navigation pane.
In the toolbar at the top, click Create Security Group. We will first create a group for all the machines in the cluster.
Back in the Security Groups panel, your new group should be selected. In the lower pane, you can view and edit settings for this security group. We will create three rules to allow all traffic between all the instances that are assigned to this security group:
Click on the Inbound tab.
In the drop-down box labeled Create a new rule, choose “All TCP”.
For Source, type the name of your security group (hadoop-cluster). As you type, a list of options will appear under the box that should contain a list of Security Group IDs and their matching human-readable names. For instance:
sg-c520f351 (default)
sg-6e5270a9 (hadoop-cluster)
Click on the option that matches your security group.
Click the Add Rule button. A table should appear to the right, listing your new rule.
Repeat steps b-d, choosing “All UDP” and “All ICMP” for the rule types.
Once you have added all three rules, click Apply Rule Changes.
You should now have the following three rule tables:
ICMP Port (Service) |
Source |
Action |
---|---|---|
ALL |
sg-6e5270a9 |
Delete |
TCP Port (Service) |
Source |
Action |
---|---|---|
0-65535 |
sg-6e5270a9 |
Delete |
UDP Port (Service) |
Source |
Action |
---|---|---|
0-65535 |
sg-6e5270a9 |
Delete |
This will allow your cluster nodes to communicate freely between each other.
Next, we will create a security group for the master node only. This will allow outside computers to access the web interface and login via SSH:
Click Create Security Group in the toolbar again.
Back in the Security Groups panel, select your new security group, click the Inbound tab in the bottom panel, and create the following four rules:
Type |
Port Range |
Source |
---|---|---|
HTTP |
n/a |
0.0.0.0/0 |
SSH |
n/a |
0.0.0.0/0 |
Custom TCP rule |
50030 |
0.0.0.0/0 |
Custom TCP rule |
50070 |
0.0.0.0/0 |
Don’t forget to press Apply Rule Changes when you are finished.
The last two rules are for accessing the web interfaces for the Hadoop JobTracker and NameNode, respectively.
In this case, Source uses CIDR notation to specify the range of hosts that are allowed to connect. The value 0.0.0.0/0 allows access from the entire internet. If you know the IP range of your school or intended users, enter it here instead for greater security.
Now we will start our master node, the instance that controls the rest of the nodes and the one with which users will interact directly.
Log in to the AWS Management Console.
Click on EC2 Dashboard in the left-hand Navigation pane of the AWS Managment Console.
Click the Launch Instance button.
In the window that pops up:
Click the Community AMIs tab.
Enter the appropriate AMI ID (see WebMapReduce AMIs and Choosing an Instance Type) in the search box at the top.
The AMI should appear in the list below. Click the Select button to its right.
Note
Occasionally, you may have to wait a while for the list of AMIs to update after you enter the ID in the search box.
In the next pane:
In the next pane (which contains the heading Advanced Instance Options, leave all fields as their defaults. Click Continue.
The next pane allows you to add custom Tags to your instance. Feel free to enter a name you can remember into the Value column next to the “Name” key.
In the next pane:
In the next pane:
In the next pane, review your details and click Launch.
Your master instance will now start. Click Close, and click Instances in the left-hand Navigation pane. You should see your new instance listed (click Refresh in the top right toolbar if you do not see it). Once its status changes to “Running,” select it. In the bottom pane, you should see information about your running instance. Three of these fields are of interest for this guide:
Important
Pay attention to the difference between public and private DNS. We need to configure the slave nodes to communicate via the private DNS. If they use the public DNS, all communication between the nodes will incur extra network fees and may be blocked by the firewall.
After our master instance is fully operational, we will start the slave nodes, the ones that store the data and perform the work for Hadoop jobs.
Log in to the AWS Management Console.
Click on Instances in the left-hand Navigation pane and select your master instance. Copy its Private DNS and note (perhaps write down) its Availability Zone, both which should be listed in the bottom pane.
Click the Launch Instance button in the toolbar at the top.
In the window that pops up:
In the next pane:
In the next pane:
Type the following line into the User Data field:
MASTER_HOST=<master's private DNS>
Replace master’s private DNS with the value you copied in step 2. Remember to use the Private DNS!
Click Continue.
In the next pane, give the instances a human-readable name as before.
In the next pane, select the key pair you created for your master host and click Continue.
In the next pane, select only the security group you created for the entire cluster (not the master group) and click Continue.
In the next pane, review your details and click Launch.
You should now have a running cluster. If you want to add more slave instances later, simply repeat this procedure.