Hadoop in the Cloud: Qubole shows 2x – 8x speedup in performance over Apache Hadoop

Start Free Trial
May 14, 2014 by Updated January 5th, 2024

Hadoop-in-the-Cloud-Qubole-shows_big

Qubole aims to provide the best platform for big data analysis in the cloud. In previous posts, we have already discussed our hadoop/hive optimizations for the cloud, performance analysis of Qubole versus Amazon EMR, and our Presto offering. In this post, we will discuss how Qubole compares to Apache Hadoop/Hive in the cloud.

Setup

The setup used was one master (m1.large) and 10 slaves (m1.xlarge) clusters for Qubole. For Apache Hadoop, we used essentially the same configuration except that we used three master nodes to run the replica master and Zookeeper. However, this difference in configuration will not result in any noticeable performance impact. We tested the following open source versions

  • Hive 0.11 and Hadoop 1.2.0
  • Hive 0.12 and Hadoop 2.2.0

Configuration

We used default configurations in both setups with a few minor changes (applied to both) which are listed here:

  • hive.auto.convert.join=true
  • hive.auto.convert.join.noconditionaltask=true
  • mapred.child.java.opts=-Xmx1536m for Apache (as the default configuration of Apache is -Xmx200m which is insufficient to launch map tasks)
  • hive.mapper.cannot.span.multiple.partitions=false

Dataset

In our experiments, we used the standard TPC-H dataset. We generated a 75GB dataset using the dbgen utility and partitioned the dataset into 50 partitions with several delimited text files per partition. We then uploaded the data to Amazon S3 using the s3cmd utility. Finally, external hive tables (partitioned) were created against this dataset.

Benchmark

Below are the results comparing the speedups offered by Qubole against the two versions of Apache Hadoop.

Performance of Qubole against Apache Hadoop

Qubole shows 2x-8x speedup for the query set over Apache Hadoop 1.x and 2.x platforms.

Conclusion

Qubole had previously set the benchmark high by giving a vastly superior performance over Amazon EMR and it continues to prove its excellence by giving a much better performance over Apache Hadoop thereby consolidating its claim of offering the best Hadoop service on the cloud.

Disclaimers:

  • We have not yet tested the recently released Apache Hadoop 2.4.0 and Hive 0.13 but it is on our radar.
  • We used Hortonworks Data Platform for Apache Hadoop testing since it has Ambari support built-in which makes for easy installations.
Start Free Trial
Read Evolution of Big Data