Reducing Big Data TCO on Azure with Qubole

By Published May 9, 2018

Many companies start their big data cloud journey on Azure by testing Microsoft’s native offering HDInsight (HDI). With data already in the Blob Storage or Azure Data Lake Store (ADLS), HDI makes it easy to get started on projects.

For an experienced team with deep big data expertise with data engineers and data scientists on staff, configuring and tuning infrastructure on HDI can make sense. However, for most organizations, managing and tuning their big data infrastructure can quickly become a barrier to scaling beyond a departmental project or proof of concept.  The setup requires separate configuration of many components to customize the platform to your specific use cases.  The administrative overhead to maintain the platform requires additional staff within IT to keep it operational.  This creates dependency on IT and a potential bottleneck for end users to be productive, because every new project would require manual setup and maintenance of clusters and new infrastructure.

As projects mature to production or expand to different teams, an HDI platform will require constant support and revision.  Software upgrades must be managed manually too. Costs can quickly accumulate because HDI does not have automated ways to scale clusters at any given time according to the workload.

A Different Approach to Activating Big Data on Azure

Qubole provides a cloud-native activation platform that automates big data infrastructure management in the cloud. Our policy-driven workload-aware auto-scaling capability reduces setup, administration, and operational management overhead for big data. Qubole does not store your big data, but leverages Azure storage and compute, and automates infrastructure management on demand. The difference with Qubole Data Service (QDS) is that it learns from the different workloads and re-sizes clusters accordingly, plus it automates the start-up and shut-down of clusters.

This type of automation enables platforms to scale very cost effectively as demand grows.  By optimizing the number of compute nodes in a cluster, organizations can expand and shrink their infrastructure footprint in an agile manner and remove the need to maintain clusters 24/7, thereby reducing delays and waste. As organizations scale into production workloads, this capability has a huge impact on TCO.

For example, an organization that wants to run several data science workloads on HDI, requires R-server support but only wants to run the platform during working hours in one region for a maximum of 10 hours per day, 5 days per week.  IT sets up the infrastructure for this team consisting of a 50-node HDI cluster running on D4 v2 Azure VMs.  They would also need to assign two administrators to manage the infrastructure and ensure that the cluster starts and stops at predetermined times.

In contrast, with Qubole an administrator defines a cluster policy that matches the same hardware specification, but has the ability to auto-scale from a minimum number of 2 nodes to a maximum of 50 nodes.  A data scientist submitting a first workload will trigger an automatic  start up of the infrastructure.  Throughout the day the cluster will re-size automatically based on activity.   By optimizing over the course of days, week, and months, Qubole ensures lower infrastructure costs, enabling the team to scale as needs grow.

TCO Comparison

The chart below compares annual TCO for a 50-node cluster on Qubole versus HDInsight using D4 v2 Azure instances. By using sophisticated workload-aware auto-scaling and cluster lifecycle management, Qubole delivers 52% lower TCO compared to HDInsight.

Performance Comparison

Performance is often another important factor to consider when choosing a big data activation platform. We used the industry standard TPC-DS benchmark to compare query performance of Qubole to HDInsight.

A summary of the results shows:

  1. Apache Spark on Qubole outperforms Apache Spark on HDInsight by over 40%
  2. Apache Hive on Qubole outperforms Apache Hive on HDInsight by about 17%

For the comparison, we ran a subset of TPC-DS queries as-is without any modifications. We measured end-to-end times for all queries, and not just query execution or run times. Below is the setup we used for the comparison.

Hardware configuration:
Machine type: 5 nodes of D14 v2
Number of CPU cores: 80 virtual cores
Memory (GB): 560 GB
Local Disk (TB): 4 TB

TPC-DS 1,000 scale factor on Azure blob storage for the Spark 2.1.1 comparison.
TPC-DS 1,000 scale factor on Azure Data Lake Store for the Hive 1.2 comparison.

As shown in the results below, Qubole delivers better performance due to sophisticated auto-scaling capabilities and advanced optimizations we have built on top of open source engines like Apache Spark and Hive.

  1. Apache Spark 2.1.1 comparison:
    The chart below compares total query times for 19 queries on Apache Spark 2.1.1 on Qubole vs. HDInsights. Qubole delivers 40% better performance on total query times and up to 60% better performance for few queries.

  2. Apache Hive 1.2 comparison:
    The chart below compares total query times for 18 queries on Apache Hive 1.2 on Qubole vs. HDInsights. Qubole delivers 17% better performance on total query times across 18 queries and up to 40% better performance for few queries.

Collaboration and Productivity

An organization investing in big data will have multiple user types running diverse big data workloads. For example, data engineers create data pipelines for data analysts running business intelligence queries or data scientists running machine learning algorithms. QDS provides a single environment for each of these user types to access the same data and work together through a common interface. The ability to share queries, troubleshoot issues, and use common BI tools such as Power BI or Tableau accelerates team productivity. Qubole offers multiple productivity interfaces within the platform:

  1. Notebooks: Unlike HDInsight, which uses a Livy server integration for notebooks, Qubole notebooks are natively integrated with Apache Spark. This delivers better multi-tenancy for notebooks. Additionally, we have built sophisticated capabilities such as offline viewing, auto-saving to cloud storage, scheduling of notebooks as jobs, dashboards, collaboration, and Access Control Lists (ACLs) for Qubole notebooks.
  2. Analyst workbench: Qubole offers a native analyst workbench within the platform with query history and search that allows multiple users to easily collaborate.
  3. Workflow: Qubole offers built-in scheduling and workflow in addition to support for Apache Airflow.


Most enterprises lack the deep experience to manage big data at scale on Azure.  Qubole provides the automation, scale, cost reduction, performance, and self-service data access that allows enterprises to activate their big data.  Qubole takes care of upgrading open source technology seamlessly, so your IT team can focus on other more productive tasks.  Qubole is lowering the barrier to scaling your big data initiatives on Azure.