Hadoop in the Cloud: Qubole shows 2x – 8x speedup in performance over Apache Hadoop

Start Free Trial
May 14, 2014 by Updated September 15th, 2021


Qubole aims to provide the best platform for big data analysis in the cloud. In previous posts, we have already discussed our hadoop/hive optimizations for the cloud, performance analysis of Qubole versus Amazon EMR, and our Presto offering. In this post, we will discuss how Qubole compares to Apache Hadoop/Hive in the cloud.


The setup used was one master (m1.large) and 10 slaves (m1.xlarge) clusters for Qubole. For Apache Hadoop, we used essentially the same configuration except that we used three master nodes to run the replica master and Zookeeper. However, this difference in configuration will not result in any noticeable performance impact. We tested the following open source versions

  • Hive 0.11 and Hadoop 1.2.0
  • Hive 0.12 and Hadoop 2.2.0



We used default configurations in both setups with a few minor changes (applied to both) which are listed here:

  • hive.auto.convert.join=true
  • hive.auto.convert.join.noconditionaltask=true
  • mapred.child.java.opts=-Xmx1536m for Apache (as the default configuration of Apache is -Xmx200m which is insufficient to launch map tasks)
  • hive.mapper.cannot.span.multiple.partitions=false



In our experiments, we used the standard TPC-H dataset. We generated a 75GB dataset using the dbgen utility and partitioned the dataset into 50 partitions with several delimited text files per partition. We then uploaded the data to Amazon S3 using the s3cmd utility. Finally, external hive tables (partitioned) were created against this dataset.


Below are the results comparing the speedups offered by Qubole against the two versions of Apache Hadoop.

Performance of Qubole against Apache Hadoop

Qubole shows 2x-8x speedup for the query set over Apache Hadoop 1.x and 2.x platforms.


Qubole had previously set the benchmark high by giving a vastly superior performance over Amazon EMR and it continues to prove its excellence by giving a much better performance over Apache Hadoop thereby consolidating its claim of offering the best hadoop service on the cloud.


  • We have not yet tested the recently released Apache Hadoop 2.4.0 and Hive 0.13 but it is on our radar.
  • We used Hortonworks Data Platform for Apache Hadoop testing since it has Ambari support built-in which makes for easy installations.
Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    Data Lake & Data Warehouse – A Modern Data Strategy Discussion

    Oct. 22, 2021 | North America

    Get Technical With Qubole Solution Architects & Engineers

    Oct. 27, 2021 | Online

    Get Technical With Qubole Solution Architects & Engineers

    Nov. 10, 2021 | Online

    The Future of Data Science and Machine Learning at Enterprise Scale

    Nov. 12, 2021 | North America

    Open Data Science Conference

    Nov. 16, 2021 | North America - West

    Data Lake Vs Data Warehouse

    Nov. 17, 2021 | Middle East
  • Read The Evolution of Big Data