2. Starting a Cluster

The first step to set up a cluster with our AMIs is to sign up for EC2. If you have never used EC2 before, we strongly suggest reading Amazon’s Getting Started Guide. It shouldn’t take any longer than 10 minutes, and it makes the process in the following section much clearer. Rest assured, however, that we will still target this guide to users who are new to EC2.

Important

Remember that if you start EC2 nodes, Amazon will begin charging to your account! Simply running through these instructions should cost no more than a few dollars, as long as you terminate the cluster afterward.

Quick Start

The basic process for starting a cluster on EC2 is shown below. If you are already familiar with EC2, this might be all you need to get going. Otherwise, the process will be explained step-by-step in the following sections.

  1. Set up two security groups:

    • One for all the nodes, allowing all communication within the group.
    • Another for the master node only, allowing web and SSH access from outside (ports 80, 22, 50030, and 50070).
  2. Start a master node. After it starts, record its Public DNS, Private DNS, and Availability Zone.

  3. Start any number of slave nodes in the same availability zone as the master node, passing the following line as User Data:

    MASTER_HOST=<master's private DNS>
    

    Replacing master’s private DNS with the appropriate value.

Once the cluster is set up, you will want to log in to the master node and add user accounts, as described in the next chapter.

2.1. Choosing an Instance Type

2.2. Creating Security Groups

Before we launch the cluster itself, we will want to create security groups for it. These define firewall rules for the EC2 instances that make up the cluster. This is a one-time-only procedure: once you have created appropriate security groups, you should not have to recreate them when starting new clusters unless you would like to keep your groups separate.

First, we will create a security group for all cluster nodes:

  1. Log in to the AWS Management Console if you have not already.

  2. Click on Security Groups in the left-hand Navigation pane.

  3. In the toolbar at the top, click Create Security Group. We will first create a group for all the machines in the cluster.

    1. For Name, enter hadoop-cluster, or anything else meaningful to you.
    2. For Description, enter anything you choose.
    3. Leave VPC on its default, “No VPC”.
    4. Click Create.
  4. Back in the Security Groups panel, your new group should be selected. In the lower pane, you can view and edit settings for this security group. We will create three rules to allow all traffic between all the instances that are assigned to this security group:

    1. Click on the Inbound tab.

    2. In the drop-down box labeled Create a new rule, choose “All TCP”.

    3. For Source, type the name of your security group (hadoop-cluster). As you type, a list of options will appear under the box that should contain a list of Security Group IDs and their matching human-readable names. For instance:

      sg-c520f351 (default)
      sg-6e5270a9 (hadoop-cluster)

      Click on the option that matches your security group.

    4. Click the Add Rule button. A table should appear to the right, listing your new rule.

    5. Repeat steps b-d, choosing “All UDP” and “All ICMP” for the rule types.

    6. Once you have added all three rules, click Apply Rule Changes.

    You should now have the following three rule tables:

    ICMP Port (Service)

    Source

    Action

    ALL

    sg-6e5270a9

    Delete

    TCP Port (Service)

    Source

    Action

    0-65535

    sg-6e5270a9

    Delete

    UDP Port (Service)

    Source

    Action

    0-65535

    sg-6e5270a9

    Delete

    This will allow your cluster nodes to communicate freely between each other.

Next, we will create a security group for the master node only. This will allow outside computers to access the web interface and login via SSH:

  1. Click Create Security Group in the toolbar again.

    1. For Name, enter hadoop-cluster-master, or anything else meaningful to you.
    2. For Description, enter anything you choose.
    3. Leave VPC on its default, “No VPC”.
    4. Click Create.
  2. Back in the Security Groups panel, select your new security group, click the Inbound tab in the bottom panel, and create the following four rules:

    Type

    Port Range

    Source

    HTTP

    n/a

    0.0.0.0/0

    SSH

    n/a

    0.0.0.0/0

    Custom TCP rule

    50030

    0.0.0.0/0

    Custom TCP rule

    50070

    0.0.0.0/0

    Don’t forget to press Apply Rule Changes when you are finished.

    The last two rules are for accessing the web interfaces for the Hadoop JobTracker and NameNode, respectively.

    In this case, Source uses CIDR notation to specify the range of hosts that are allowed to connect. The value 0.0.0.0/0 allows access from the entire internet. If you know the IP range of your school or intended users, enter it here instead for greater security.

2.3. Starting the Master Instance

Now we will start our master node, the instance that controls the rest of the nodes and the one with which users will interact directly.

  1. Log in to the AWS Management Console.

  2. Click on EC2 Dashboard in the left-hand Navigation pane of the AWS Managment Console.

  3. Click the Launch Instance button.

  4. In the window that pops up:

    1. Click the Community AMIs tab.

    2. Enter the appropriate AMI ID (see WebMapReduce AMIs and Choosing an Instance Type) in the search box at the top.

    3. The AMI should appear in the list below. Click the Select button to its right.

      Note

      Occasionally, you may have to wait a while for the list of AMIs to update after you enter the ID in the search box.

  5. In the next pane:

    1. Under Instance Type, select an appropriate option. See Choosing an Instance Type.
    2. Click Continue.
  6. In the next pane (which contains the heading Advanced Instance Options, leave all fields as their defaults. Click Continue.

  7. The next pane allows you to add custom Tags to your instance. Feel free to enter a name you can remember into the Value column next to the “Name” key.

  8. In the next pane:

    1. Select the “Create a new Key Pair”.
    2. Choose an appropriate name for your key pair such as hadoop-keypair.
    3. Click Create and Download Key Pair. Remember where you save the downloaded keypair: you will need it to log in to your cluster.
    4. Click Continue.
  9. In the next pane:

    1. Select the “Choose one or more of your existing Security Groups”.
    2. Select both of the security groups you created earlier (Tip: to select multiple items in a list, hold down the Ctrl key on a PC or the Command key on a Mac while clicking the items).
    3. Click Continue.
  10. In the next pane, review your details and click Launch.

Your master instance will now start. Click Close, and click Instances in the left-hand Navigation pane. You should see your new instance listed (click Refresh in the top right toolbar if you do not see it). Once its status changes to “Running,” select it. In the bottom pane, you should see information about your running instance. Three of these fields are of interest for this guide:

Public DNS
The domain name that you can use to communicate with your new instance from outside EC2, i.e., from your personal computer. Example: ec2-50-17-174-4.compute-1.amazonaws.com
Private DNS
The domain name that other cluster nodes inside EC2 should use to communicate with your master instance. Example: ip-10-85-54-114.ec2.internal
Availability Zone (or simply Zone)
An identifier for the physical site where your instance is running. We will want to start slave instances at this same site. Example: us-east-1c

Important

Pay attention to the difference between public and private DNS. We need to configure the slave nodes to communicate via the private DNS. If they use the public DNS, all communication between the nodes will incur extra network fees and may be blocked by the firewall.

2.4. Starting Slave Instances

After our master instance is fully operational, we will start the slave nodes, the ones that store the data and perform the work for Hadoop jobs.

  1. Log in to the AWS Management Console.

  2. Click on Instances in the left-hand Navigation pane and select your master instance. Copy its Private DNS and note (perhaps write down) its Availability Zone, both which should be listed in the bottom pane.

  3. Click the Launch Instance button in the toolbar at the top.

  4. In the window that pops up:

    1. Click the Community AMIs tab.
    2. Enter the AMI ID (the same one as before) in the search box at the top.
    3. Find that AMI in the list and click the Select button to the right.
  5. In the next pane:

    1. Choose an instance type (see Choosing an Instance Type).
    2. Enter the number of instances you would like to start. If you are just trying out this process for the first time, start 2 instances (you can always add more later by following these steps in this section again).
    3. Pick the same Availability Zone as your master instance.
    4. Click Continue.
  6. In the next pane:

    1. Type the following line into the User Data field:

      MASTER_HOST=<master's private DNS>
      

      Replace master’s private DNS with the value you copied in step 2. Remember to use the Private DNS!

    2. Click Continue.

  7. In the next pane, give the instances a human-readable name as before.

  8. In the next pane, select the key pair you created for your master host and click Continue.

  9. In the next pane, select only the security group you created for the entire cluster (not the master group) and click Continue.

  10. In the next pane, review your details and click Launch.

You should now have a running cluster. If you want to add more slave instances later, simply repeat this procedure.

Table Of Contents

Previous topic

1. Introduction

Next topic

3. Cluster Administration

This Page