Moving HDInsight workloads to Qubole: Cost comparison and benefits

Start Free Trial
November 9, 2017 by , and Updated April 12th, 2024

Many companies start their big data journey in the cloud on Azure by gravitating to Microsoft’s native offering HDInsight (HDI). This is because with data already in the Blob Store or ADLS, it was easy to get started on projects and experiment.  

For an experienced team with deep Big Data expertise with data engineers and data scientists on staff, configuring and tuning an infrastructure on HDI can make sense. However, for most organizations starting out, managing and tuning this infrastructure can be a barrier to scale beyond a department or POC project. Setup requires the separate configuration of many components to customize the platform to your specific use case. The administrative overhead to maintain the platform also will require additional staff within IT to keep it operational. This still creates dependency on IT and potential bottleneck for end users to be productive, because every new project may require the spinning up of clusters or new infrastructure.

As projects mature to production or expand to different teams, an HDI platform will require constant support and revision. Upgrades of software still must to be managed manually even if it is simpler than on-premises installed open source software. Costs can quickly accumulate, because HDI does not have automated ways to size clusters according the workload at any given time.

A Different Approach to Big Data on Azure

Qubole provides a managed service. We focus on automation of infrastructure for big data in the cloud. Our policy driven workload aware auto-scaling capability reduces set up, administration and operational management overhead for a big data platform. Similar to HDI, Qubole leverages the cloud data store and Azure’s compute to spin up infrastructure on demand. The difference with Qubole is that we look at the workload itself and re-size the cluster accordingly. In addition, Qubole Data Service (QDS) automates the startup and shutdown of a cluster for you.

This type of automation enables platforms to scale as demand grows very cost effectively. By optimizing compute and the number of nodes in a cluster, organizations can expand and contract flexibly and remove the need to maintain clusters 24/7. As organizations scale into production workloads, this capability has a huge impact on TCO.

As an example, an organization wants to run several data science workloads on HDI. They require an support and only want to run the platform during working hours in one region for a maximum of 10 hours per day, 5 days per week. It sets up the infrastructure for this team consisting of a 50-node HDI cluster running on A10 VMs (8 CPUs, 56 GB of RAM and 382GB HDD). They also assign two resources to manage the infrastructure and ensure that the cluster starts and stops at the agreed upon times.    

In contrast, with Qubole an administrator defines a cluster policy that matches the same hardware specification but has the ability to auto-scale from a minimum number of 2 nodes and a maximum of 50 nodes. A data scientist submitting a first workload will self-service start up the infrastructure. Throughout the day the cluster will re-size automatically based on activity. By optimizing over the course of days, week, and months, Qubole will lower infrastructure costs enabling the team scale as needs grow.

TCO Comparison

Annual QDS v HDI

Collaboration and Productivity

An organization investing in Big Data will have additional personas running other big data workloads. Data engineers can create data pipelines for data analysts running business intelligence queries or data scientists running machine learning algorithms. QDS provides a single environment for each of these user types to access the same data and work together through a common interface. The ability to share queries, troubleshoot issues and plug into BI tools such as PowerBI or Tableau accelerates team productivity.

Most enterprises lack the deep experience to manage big data at scale. Qubole provides automation, scale, cost reduction and self-service data access. As a managed service Qubole also takes care of upgrading open source technology seamlessly, so your IT team can focus on other needs. Qubole is lowering the barrier to scaling your big data initiatives on Azure. Trying another option on Azure is low risk with our Business Edition.

Free QDS Business Edition on Azure

Sign up for free* QDS Business Edition on Azure by visiting https://azure.qubole.com.
For detailed steps, visit https://docs.qubole.com/en/latest/quick-start-guide/Azure-quick-start-guide/azure.html

*Qubole offers Qubole Data Service (QDS) Business Edition at no cost, but usage is limited by Qubole compute hours (AVMU for Azure) per month, which is approximately a $1000/month value. You must provide your own Azure cloud account and you are responsible for the infrastructure costs managed by Qubole on your behalf.

Start Free Trial
Read Distributed Deep Learning on Apache Spark with Keras