Friday 25 July 2014

Setting up your own Hadoop Cluster using Azure HDInsights

How to get started with Hadoop and Hive

Install prerequisites to manage your cluster

Log into Windows Azure account

Sign up using http://azure.microsoft.com/en-us/  free trail link
Then click the portal link to manage your Azure services. You should end up with something like this menu on the side

Create a new Storage account

  1. Click on the storage link in the Azure left side menu
  2. Then click the new link at the bottom. This will prompt you with the below options to create a new storage account. 
  3. Choose a unique name for your URL. If the tick box turns green it means your account name is unique
  4. Choose create storage account at the bottom
  5. This will then start creating your storage account, you may need to wait 5 mins for it to complete

Create new HDInsights cluster


  1. Click on the HDInsight link on the Azure left side menu
  2. Then click the new link at the bottom. This will prompt you with the options below to create a new Hadoop cluster
  3. Choose a unique name for your URL
  4. Choose 1 data node for the cluster size (unless you want to go crazy then be my guest)
  5. Select the storage you created in the above section
  6. Click Create HDInsight Cluster. This takes a while, especially first time. Between 5min-40min 

Connecting to your Cluster


  1. When you click All Items in the top left menu, you should see something like this. Confirm your HDSight Cluster is running
  2. Open Powershell ISE
  3. Run the following
    Get-AzureSubscription Get-AzureHDInsightCluster
  4. Download the publish settings file to your local computer and keep note of the path
  5. Click on your HDInsight cluster Right arrow
  6. Then choose Dashboard
  7. Take note of your subscription name and your cluster name

Running Hive Queries against your Cluster

  1. Run a new script in powershell and replace configurations where nessasary
    Import-AzurePublishSettingsFile "<FULL_PATH_TO_PUBLISH_SETTINGS_FILE>"
    $subscriptionName = "<SUBSCRIPTION_NAME>"
    $clusterName = "<CLUSTER_NAME>"            
    $querystring = "select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"
    Select-AzureSubscription -SubscriptionName $subscriptionName
    Use-AzureHDInsightCluster $clusterName
    Invoke-Hive -Query $queryString

    Here is an example i have used
    Import-AzurePublishSettingsFile "C:\Powershell\Hadoop\jeremyking77Azure.publishsettings"
    $subscriptionName = "Visual Studio Professional with MSDN"
    $clusterName = "jeremyking77"            
    $querystring = "select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"
    Select-AzureSubscription -SubscriptionName $subscriptionName
    Use-AzureHDInsightCluster $clusterName
    Invoke-Hive -Query $queryString
  2. You should get output like the following
    Successfully connected to cluster jeremyking77
    Submitting Hive query..
    Started Hive query with jobDetails Id : job_1405933745625_0003
    Hive query completed Successfully
    United States   California  6881
    United States   Texas   6539
    United States   Illinois    5120
    United States   Georgia 4801
    United States   Massachusetts   4450