Automating Azure: Creating an On-Demand HDInsight Cluster

Comments 0

Share to social media

Automating Azure for Resources On-Demand

  1. Automating Azure: How to Deploy a Temporary Virtual Machine Resource
  2. Automating Azure: Creating an On-Demand HDInsight Cluster

See also: Creating a Custom .NET Activity Pipeline for Azure Data Factory

HDInsight in Azure is a great way to process Big Data, because it scales very well with large volumes of data and with complex processing requirements. Unfortunately, HDInsight clusters in Azure are expensive. The minimal configuration, as of now, costs about €5 for every hour that the cluster is running, whether you’re using it or not. Depending on your contract, the monthly cost for an HDInsight cluster can amount to thousands of Euros. Your requirements are unlikely to be anywhere near this, more likely just once a week for two hours: fifty Euros worth of computing power rather than Thousands.

For small-scale use of HDinsight, you will need a way to automate the “on-demand” creation and deletion of an HDInsight cluster. In this article, I’ll be showing you how to do this. I’ll create an HDInsight cluster with R Server, run a very simple R script on the cluster, and then close it.

For the purpose of doing this we will be using some of the ideas of my previous articles, i.e. we will be using the idea of creating resources in Azure by using Custom .NET activities and ARM templates (Create a Virtual Machine in Azure by using ARM template and C#,). Also we will automate the Custom .NET activity by scheduling it via Data Factory deployment (Creating a Custom .NET Activity Pipeline for Azure Data Factory).

To get the task done, we must prepare the following:

  1. Obtain the Template and Parameters files for the HDInsight cluster
  2. create a BLOB container where the template is stored
  3. create an AAD application (to be used for Service-to-service authentication) and give it authorization for the BLOB
    1. Get app id, key and tenant id (directory id) https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-authenticate-using-active-directory, https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#get-tenant-id
    2. Assign application to a role: https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#assign-application-to-role
  4. create a Visual Studio class for the code –
    1. create the cluster from the template
    2. use the SSH.NET to run the R script and write the script’s output to the BLOB storage
    3. delete the cluster
  5. add ADF project and add reference to the class
  6. create a batch service and pool
  7. create linked services and outputs
  8. create a pipeline
  9. Deploy the ADF and test the HDInsight creation, R script execution and HDInsight cluster deletion

This may seem to be a complex solution with a lot of steps, but the idea is, in fact, very simple: we know that we can schedule any custom C# code via Data Factory, and use this C# code to bring up any resource in Azure by using the Azure API. All of this is scheduled by using ADF and executed by an Azure Batch Service, which in this case is using the smallest “behind the scenes” VM to execute our custom code. Furthermore, there is a SSH.NET library which we are using to pass SSH commands to the HDInsight cluster. By using this, we can trigger the R script execution and ensure that the output is written back to BLOB storage outside of the HDInsight cluster. We can then delete the HDInsight Cluster that we created: We merely request the deletion of the entire resource group after the work is done. This is easy because we make sure that we create the HDInsight cluster in its own resource group so we have a handle to the resource group in our C# class.

Let’s get started:

Obtain the Template and Parameters files for the HDInsight cluster

As I mentioned in my previous article “Create a Virtual Machine in Azure by using ARM template and C#”, it is easy to obtain Templates and Parameters JSON scripts for any resource in Azure, whether the resource is up and running or about to be created.

In this case I will start creating the HDInsight cluster via the Azure portal, setup everything for it as I need it, and before I click ‘Create’, I will download the JSON definition of it and use it later on to bring up the cluster via C#.

For my purpose here I will be setting up a HDInsight cluster of the R Server type on Linux:

Before clicking on the ‘Create’ button, I will specify everything I need, and then click on the “Download template and parameters” link (ringed in orange below) to save the JSON files. These files will be referenced in my C# code later on.

Create a BLOB container where the template is stored

I will not spend too much time here explaining how to create a BLOB container. The important thing to note is that the container needs to contain the two JSON files for the template and the parameters. In our next step, we will create an AAD application, and this will need to have access to this BLOB container.

Create an AAD application

In this step we must create an Azure Active Directory application that will be used for Service-to-Service authentication. In this step we need to note the App ID, key and Active Directory Tenant ID. For further details on how to create an AAD Application, follow the Azure documentation:

Create a VS class for the code

As mentioned earlier, we will be using a C# class to create the cluster from the template. This class will be implementing the IDotNetActivity interface, provided by Microsoft.

The C# class will be similar to:

It is important to note that in order for this code to work, we need to install the following NuGet packages:

  • Install-Package Microsoft.Azure.Management.ResourceManager.Fluent -Version 1.2.0
  • Install-Package Microsoft.Azure.Management.Fluent -Version 1.2.0
  • Install-Package SSH.NET -Version 2016.0.0

Also, for the R script which is running on the HDinsight cluster I am using a very simple script, which resides on the BLOB storage and gets copied during the SSH session, executed and then the output is copied out to the BLOB storage. In this case the R script contains a very simple computation:

Running this script will produce a file with the console output saved as a text file. There is nothing special about this particular computation; all it does is to prove that a ‘hello world’ R script can be executed on the HDInsight cluster we just created and that the output of this can be written back to the BLOB storage out of the HDFS system.

Delete the cluster

As mentioned earlier, it is very easy to delete the cluster because it was created from the C# code itself and we had defined its own resource group. Hence, we can just delete the entire resource group by calling:

Add ADF project and add reference to the class

Now it is time to add the ADF project within our solution and get ready to deploy our custom .NET activity. One very important step is to actually add a reference to the custom class above to the ADF project. We do this by right-clicking on the ADF References node and then clicking on ‘Add Reference…’. This way, we can deploy the entire solution to Data Factory from Visual Studio, including the DLL files of the Custom .NET activity.

Create a batch service and pool

As mentioned earlier, the creation of this ‘on-demand’ HDInsight cluster depends on being able to use ADF to schedule the execution of a custom C# code, which is executed by a very simple Virtual Machine behind the scenes in our Batch Service account.

For more details on how to create a Batch Service and a batch pool you can refer to the Azure documentation or even to my previous article “Creating a Custom .NET Activity Pipeline for Azure Data Factory”.

Create linked services and outputs

For this Data Factory we will need two linked services: one for the Batch Service …

… and another one for the Storage:

For the output, we need the following:

Create a pipeline

The pipeline JSON looks like this:

One thing to note is that the timeout is set as “timeout”: “02:30:00”: This is important because it takes up to an hour to bring up the HDInsight cluster and to delete it. If the timeout is set to less, then the ADF pipeline will be marked as failed, even though the routine might do its job anyway.

Deploy the ADF and test the HDInsight creation, R script execution and HDInsight cluster deletion

All that is left to do is to deploy the ADF pipeline to Azure by right-clicking on the project, clicking on ‘Publish’ and going through the publish wizard. Once the pipeline is published, it needs to be executed and monitored. Feel free to add more logger comments in the C# code above and follow the custom messages for debugging.

Conclusion:

In this article we saw yet again the power of Custom Activities in Azure and how this can be used to perform occasional scalable computations that use HDInsight. Microsoft Azure does not yet provide out-of-the box functionality to create on-demand HDInsight clusters for performing periodic workloads on R Server: However the solution that I’ve described in this article opens a new field of possibilities and makes HDInsight a cost-effective choice for processing large volumes of data.

About the author

Feodor Georgiev

See Profile

Feodor has a background of many years working with SQL Server and is now mainly focusing on data analytics, data science and R.

Over more than 15 years Feodor has worked on assignments involving database architecture, Microsoft SQL Server data platform, data model design, database design, integration solutions, business intelligence, reporting, as well as performance optimization and systems scalability.

In the past 3 years he has expanded his focus to coding in R for assignments relating to data analytics and data science.

On the side of his day to day schedule he blogs, shares tips on forums and writes articles.