Hive on Qubole runs 4x faster than Hive on Alternative Platforms

October 15, 2019 by Updated October 24th, 2019

Introduction

ETL workloads form a major component of big data processing at any data-driven organization – from SMBs to enterprises, and ETL data pipelines at these organizations utilize engines like Hive to extract, transform and load high volumes of data to derive timely insights.

Qubole provides a managed and cloud-optimized implementation of Hive on AWS, Azure, and GCP. This blog highlights the benefits of Hive optimizations on Qubole and their direct impact on cost reduction when compared to alternatives. We have used the TPC-DS (Scale 3000) [1] benchmark to compare the speed and cost of Hive on Qubole using unpartitioned ORC data stored on AWS S3.

The Results

To run these benchmarks, we chose two cluster configurations with different node types and number of nodes, to avoid observation discrepancies arising from node type or count bias. The two cluster-configurations we chose were m5.4xlarge with 4 worker nodes and r5.4xlarge with 8 worker nodes, and we present the results below, where the red bars represent Qubole’s results and the blue bars a competing alternative platform in the market:

As shown above, Hive on Qubole outperformed Hive on the alternative in all queries, and was 4 times faster with the first configuration and 3 times faster with the second.

With respect to cost, we observed that the total cost [2] of running TPC-DS queries on Qubole (green bars) was 3.4 times and 5 times less compared to the alternative platform (yellow bars) on the two different configurations, as shown below:

Details:

1. Setup

Benchmark suite: TPC-DS
Scale: 3000
Format: ORC
Type: Unpartitioned
Storage: AWS S3
Statistics: Pre-computed column level stats
Instance type: m5.4xlarge
No. of worker nodes: four

Software & Components

We ran our experiments on the default Hive configuration available on both the alternative platform and Qubole’s setups, except the alternative platform did not have hive.strict.checks.cartesian.product disabled which caused few queries to fail. To get the results for those failing queries as well, we have disabled this config on that platform. Furthermore, we based this comparison on only the queries that succeeded on both platforms.

2. Results

The above chart shows the results for the first 20 queries (click for a larger image), where the red bars represent Qubole and the blue bars represent the alternative platform.

Key observations:

  • Hive on Qubole outperforms Hive on the alternative platform for all TPC-DS queries
  • Hive on Qubole is 4 times faster than Hive on the competing platform hive [3]
  • The cost of running Hive queries on Qubole is 3 to 5 times lower

The difference in the performance of Hive on Qubole can be attributed to the numerous and continuous optimizations to the Hive engine that the Qubole team makes. These range from faster writes, object store specific optimizations, ORC improvements, optimized resource allocation, utilization and efficient use of statistics.

At Qubole, our dedicated performance teams are working nonstop to identify new and innovative ways to improve performance.

This test was run using on-demand nodes and compute cost can be further brought down significantly by using Qubole’s cluster lifecycle management and intelligent low cost1 nodes management capabilities – more information on these features available here.

Qubole recently announced support for Hive 3.1.1 which has new improvements and features and is 1.8 times faster than Qubole 2.3.4 Hive. We encourage you to take advantage of the Hive performance improvements. If you don’t yet have a Qubole environment, you can sign up for a free 14-day Qubole Test Drive.

———-

1 Refers to AWS’ Spot instances, Azure’s Low Cost VMs, and Google’s Pre-emptible VMs
[1] TPC-H, TPC-DS, and TPCx-BB are some of the widely used industry-standard performance benchmarks to examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. TPC-DS is a decision support benchmark for evaluating the read performance of big-data processing engines.
[2] This experiment was conducted on On-Demand hosts, so we have referenced Amazon EC2 (On-Demand) price and Qubole Compute Unit (QCU) on AWS to calculate the total cost.
[3] We have considered the geometric mean of the percentage improvement of all queries.
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    AWS re:Invent

    Dec. 2, 2019 | Las Vegas, NV