.. default-role:: guilabel ****************** Starting a Cluster ****************** The first step to set up a cluster with our AMIs is to `sign up for EC2`__. If you have never used EC2 before, we **strongly suggest** reading Amazon's `Getting Started Guide`_. It shouldn't take any longer than 10 minutes, and it makes the process in the following section much clearer. Rest assured, however, that we will still target this guide to users who are new to EC2. .. __: http://aws-portal.amazon.com/gp/aws/developer/subscription/index.html? productCode=AmazonEC2 .. important:: Remember that if you start EC2 nodes, Amazon will begin charging to your account! Simply running through these instructions should cost no more than a few dollars, *as long as you terminate the cluster afterward*. .. topic:: Quick Start The basic process for starting a cluster on EC2 is shown below. If you are already familiar with EC2, this might be all you need to get going. Otherwise, the process will be explained step-by-step in the following sections. #. Set up two security groups: - One for all the nodes, allowing all communication within the group. - Another for the master node only, allowing web and SSH access from outside (ports 80, 22, 50030, and 50070). #. Start a master node. After it starts, record its `Public DNS`, `Private DNS`, and `Availability Zone`. #. Start any number of slave nodes in the same availability zone as the master node, passing the following line as :dfn:`User Data`: .. parsed-literal:: MASTER_HOST=\ *<master's private DNS>* Replacing *master's private DNS* with the appropriate value. Once the cluster is set up, you will want to log in to the master node and add user accounts, as described in the next chapter. .. _sect-EC2_Guide-InstanceType: Choosing an Instance Type ========================= .. todo:: Write! Creating Security Groups ======================== Before we launch the cluster itself, we will want to create :dfn:`security groups` for it. These define firewall rules for the EC2 instances that make up the cluster. This is a one-time-only procedure: once you have created appropriate security groups, you should not have to recreate them when starting new clusters unless you would like to keep your groups separate. First, we will create a security group for all cluster nodes: #. Log in to the `AWS Management Console`_ if you have not already. #. Click on `Security Groups` in the left-hand Navigation pane. #. In the toolbar at the top, click `Create Security Group`. We will first create a group for all the machines in the cluster. a. For `Name`, enter ``hadoop-cluster``, or anything else meaningful to you. b. For `Description`, enter anything you choose. c. Leave `VPC` on its default, "No VPC". d. Click `Create`. #. Back in the Security Groups panel, your new group should be selected. In the lower pane, you can view and edit settings for this security group. We will create three rules to allow all traffic between all the instances that are assigned to this security group: a. Click on the `Inbound` tab. b. In the drop-down box labeled `Create a new rule`, choose "All TCP". c. For `Source`, type the name of your security group (``hadoop-cluster``). As you type, a list of options will appear under the box that should contain a list of :dfn:`Security Group IDs` and their matching human-readable names. For instance:: sg-c520f351 (default) sg-6e5270a9 (hadoop-cluster) Click on the option that matches your security group. d. Click the `Add Rule` button. A table should appear to the right, listing your new rule. e. Repeat steps b-d, choosing "All UDP" and "All ICMP" for the rule types. f. Once you have added all three rules, click `Apply Rule Changes`. You should now have the following three rule tables: =================== =========== ========== ICMP Port (Service) Source Action =================== =========== ========== ALL sg-6e5270a9 Delete =================== =========== ========== =================== =========== ========== TCP Port (Service) Source Action =================== =========== ========== 0-65535 sg-6e5270a9 Delete =================== =========== ========== =================== =========== ========== UDP Port (Service) Source Action =================== =========== ========== 0-65535 sg-6e5270a9 Delete =================== =========== ========== This will allow your cluster nodes to communicate freely between each other. Next, we will create a security group for the master node only. This will allow outside computers to access the web interface and login via SSH: #. Click `Create Security Group` in the toolbar again. a. For `Name`, enter ``hadoop-cluster-master``, or anything else meaningful to you. b. For `Description`, enter anything you choose. c. Leave `VPC` on its default, "No VPC". d. Click `Create`. #. Back in the Security Groups panel, select your new security group, click the `Inbound` tab in the bottom panel, and create the following four rules: =============== ========== ========= Type Port Range Source =============== ========== ========= HTTP n/a 0.0.0.0/0 SSH n/a 0.0.0.0/0 Custom TCP rule 50030 0.0.0.0/0 Custom TCP rule 50070 0.0.0.0/0 =============== ========== ========= **Don't forget** to press `Apply Rule Changes` when you are finished. The last two rules are for accessing the web interfaces for the Hadoop JobTracker and NameNode, respectively. In this case, `Source` uses `CIDR notation`_ to specify the range of hosts that are allowed to connect. The value ``0.0.0.0/0`` allows access from the entire internet. If you know the IP range of your school or intended users, enter it here instead for greater security. Starting the Master Instance ============================ Now we will start our :dfn:`master node`, the instance that controls the rest of the nodes and the one with which users will interact directly. #. Log in to the `AWS Management Console`_. #. Click on `EC2 Dashboard` in the left-hand Navigation pane of the AWS Managment Console. #. Click the `Launch Instance` button. #. In the window that pops up: a. Click the `Community AMIs` tab. b. Enter the appropriate AMI ID (see :ref:`table-EC2_Guide-AMIs` and :ref:`sect-EC2_Guide-InstanceType`) in the search box at the top. c. The AMI should appear in the list below. Click the `Select` button to its right. .. note:: Occasionally, you may have to wait a while for the list of AMIs to update after you enter the ID in the search box. #. In the next pane: a. Under `Instance Type`, select an appropriate option. See :ref:`sect-EC2_Guide-InstanceType`. b. Click `Continue`. #. In the next pane (which contains the heading `Advanced Instance Options`, leave all fields as their defaults. Click `Continue`. #. The next pane allows you to add custom :dfn:`Tags` to your instance. Feel free to enter a name you can remember into the `Value` column next to the "Name" key. #. In the next pane: a. Select the "`Create a new Key Pair`". b. Choose an appropriate name for your key pair such as ``hadoop-keypair``. c. Click `Create and Download Key Pair`. Remember where you save the downloaded keypair: you will need it to log in to your cluster. d. Click `Continue`. #. In the next pane: a. Select the "`Choose one or more of your existing Security Groups`". b. Select *both* of the security groups you created earlier (Tip: to select multiple items in a list, hold down the :kbd:`Ctrl` key on a PC or the :kbd:`Command` key on a Mac while clicking the items). c. Click `Continue`. #. In the next pane, review your details and click `Launch`. Your master instance will now start. Click Close, and click `Instances` in the left-hand Navigation pane. You should see your new instance listed (click `Refresh` in the top right toolbar if you do not see it). Once its status changes to "Running," select it. In the bottom pane, you should see information about your running instance. Three of these fields are of interest for this guide: `Public DNS` The domain name that you can use to communicate with your new instance from outside EC2, i.e., from your personal computer. Example: ``ec2-50-17-174-4.compute-1.amazonaws.com`` `Private DNS` The domain name that other cluster nodes inside EC2 should use to communicate with your master instance. Example: ``ip-10-85-54-114.ec2.internal`` `Availability Zone` (or simply `Zone`) An identifier for the physical site where your instance is running. We will want to start slave instances at this same site. Example: ``us-east-1c`` .. important:: Pay attention to the difference between *public* and *private* DNS. We need to configure the slave nodes to communicate via the *private* DNS. If they use the *public* DNS, all communication between the nodes will incur extra network fees and may be blocked by the firewall. Starting Slave Instances ======================== After our master instance is fully operational, we will start the :dfn:`slave nodes`, the ones that store the data and perform the work for Hadoop jobs. #. Log in to the `AWS Management Console`_. #. Click on `Instances` in the left-hand Navigation pane and select your master instance. Copy its `Private DNS` and note (perhaps write down) its `Availability Zone`, both which should be listed in the bottom pane. #. Click the Launch Instance button in the toolbar at the top. #. In the window that pops up: a. Click the `Community AMIs` tab. b. Enter the AMI ID (the same one as before) in the search box at the top. c. Find that AMI in the list and click the `Select` button to the right. #. In the next pane: a. Choose an instance type (see :ref:`sect-EC2_Guide-InstanceType`). b. Enter the number of instances you would like to start. If you are just trying out this process for the first time, start 2 instances (you can always add more later by following these steps in this section again). c. Pick the *same* Availability Zone as your master instance. d. Click `Continue`. #. In the next pane: a. Type the following line into the `User Data` field: .. parsed-literal:: MASTER_HOST=\ *<master's private DNS>* Replace *master's private DNS* with the value you copied in step \2. *Remember to use the* **Private** *DNS!* b. Click `Continue`. #. In the next pane, give the instances a human-readable name as before. #. In the next pane, select the key pair you created for your master host and click `Continue`. #. In the next pane, select *only* the security group you created for the entire cluster (not the master group) and click `Continue`. #. In the next pane, review your details and click `Launch`. You should now have a running cluster. If you want to add more slave instances later, simply repeat this procedure. .. Named Links .. _Getting Started Guide: http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/ .. _AWS Management Console: http://aws.amazon.com/console/ .. _CIDR Notation: http://en.wikipedia.org/wiki/CIDR_notation