Hadoop in the Cloud: Qubole shows 2x – 8x speedup in performance over Apache Hadoop

Start Free Trial
May 14, 2014 by Updated March 1st, 2019

Qubole aims to provide the best platform for big data analysis in the cloud. In previous posts, we have already discussed on our hadoop/hive optimizations for the cloud, performance analysis of Qubole versus Amazon EMR and our Presto offering. In this post, we will discuss how Qubole compares to Apache Hadoop/Hive in the cloud.


The setup used was a one master (m1.large) and 10 slaves (m1.xlarge) cluster for Qubole. For Apache Hadoop, we used essentially the same configuration except that we used three master nodes to run the replica master and Zookeeper. However, this difference in configuration will not result in any noticeable performance impact. We tested the following open source versions

  • Hive 0.11 and Hadoop 1.2.0
  • Hive 0.12 and Hadoop 2.2.0



We used default configurations in both setups with a few minor changes (applied to both) which are listed here:

  • hive.auto.convert.join=true
  • hive.auto.convert.join.noconditionaltask=true
  • mapred.child.java.opts=-Xmx1536m for Apache (as the default configuration of Apache is -Xmx200m which is insufficient to launch map tasks)
  • hive.mapper.cannot.span.multiple.partitions=false



In our experiments, we used the standard TPC-H dataset. We generated a 75GB dataset using the dbgen utility and partitioned the dataset into 50 partitions with several delimited text files per partition. We then uploaded the data to Amazon S3 using s3cmd utility. Finally, external hive tables (partitioned) were created against this dataset.


Below are the results comparing the speedups offered by Qubole against the two versions of Apache Hadoop.

Performance of Qubole against Apache Hadoop

Qubole shows 2X-8X speedup for the query set  over Apache Hadoop 1.x and 2.x platforms.


Qubole had previously set the benchmark high by giving a vastly superior performance over Amazon EMR and it continues to prove its excellence by giving a much better performance over Apache Hadoop thereby consolidating its claim of offering the best hadoop service on the cloud.


  • We have not yet tested the recently released Apache Hadoop 2.4.0 and Hive 0.13 but it is on our radar.
  • We used Hortonworks Data Platform for Apache Hadoop testing since it has Ambari support built in which makes for easy installations.
Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    Is your data architecture modern enough to go the distance?

    Aug. 14, 2020 | Virtual Event

    Challenged to maximise AWS Spot utilization and minimize job loss?

    Aug. 18, 2020 | Virtual Event

    How to drop your cloud data lake costs by 50% in 3 months

    Aug. 27, 2020 | Virtual Event

    AWS re:Invent

    Nov. 30, 2020 | Las Vegas, NV