Here at Qubole, our core focus is on providing the best platform to analyze data in the Cloud. We are simplifying the complicated infrastructure necessary for analytics and making the tools accessible for users with various skill sets and experience levels. We have also continued to optimize Hadoop and Hive for use in the Cloud and are proud to report that we are faster than ever! In our latest direct comparison with Amazon’s Elastic MapReduce (EMR) performance, Qubole was up to:
- 2x faster in launching a cluster
- 5x faster in query execution against data in S3
- 2x faster in writing data to S3
For details on how we set up these tests, keep reading. And if you have feedback on scenarios that you’d like to see us explore and test in the future, let us know!
We used 3-node Hadoop clusters in Qubole and Amazon’s distribution of EMR. All the instances were of type m1.xlarge and were on-demand instances. As is standard in Hadoop, one instance (master) was dedicated to running the Hadoop JobTracker and NameNode. In both cases, Hive queries were run from the master node. Qubole used a local (Derby) database as the Hive metastore while EMR used a local mysql instance.
Hadoop and Hive are highly sensitive to configuration parameters. We used default configuration settings for each platform – so the measurements here reflect the default experience for a user on these platforms. There were two exceptions. In EMR, we set the parameter hive.optimize.s3.query to true. This parameter is supposed to optimize queries against S3 datasets per EMR documentation. Furthermore, in Qubole, we disabled our caching framework that automatically caches S3 data in HDFS as we wanted to do a fair comparison where both systems access S3 data directly.
Cluster Bringup Time
Launching the above mentioned cluster took 302 seconds in EMR, while it took 147 seconds in Qubole. Qubole was 2x faster. In the case of Qubole, a cluster is considered to be launched when HDFS is up and running. In the case of EMR, we considered startup time to be the difference between StartDate and CreationDate in the AWS console. Launching a 11-node cluster showed similar characteristics.
Dataset and Queries
In our experiments, we used the popular TPC-H dataset. We generated a TPC-H scale 5 (5GB) dataset using the dbgen utility. Users tend to partition data by day. To simulate this, we split the lineitem into 1000 partitions (for approximately 3 years of data), each with one file. None of the other tables were partitioned. The data was in delimited text format. We uploaded the dataset to S3 using the s3cmd utility. We created external Hive tables against datasets residing in S3. Here’s a sample DDL for the lineitem table:
create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY INT, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) partitioned by (dummy STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 's3://public-qubole/datasets/tpch5G/lineitem/txt_part_1000'; alter table lineitem recover partitions;
Hive does not yet support the original forms of the TPC-H queries. We used the modified versions of the TPC-H queries contributed by Reynold Xin available here. We’ve uploaded the data to a publically accessible S3 location. The complete set of DDLs and queries is available in our public bitbucket repository via the following git command:
git clone 'https://bitbucket.org/qubole/tpch.git'
For each query, we took an average of 3 runs in Qubole and EMR and plotted speedup times calculated as
Speedup = EMR_time / Qubole_time
Qubole shows an average speedup of ~4x over EMR. We repeated the experiment for smaller scale dataset of 1GB with 303 partitions mimicing the use-case of queries over data filtered over time (e.g. last year of data). Qubole showed an average speedup of 5x for the smaller datasets.
Writing data to S3
The TPC-H queries tested above are read-heavy. We wanted to compare write performance to S3 in both platforms. We performed a table loading test by creating a partitioned version of the orders table using the following statements:
create external table orders_part (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) partitioned by (O_ORDERDATE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 's3:///tpch/tpch5G/orders_part'; insert overwrite table orders_part partition(o_orderdate) select O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT, O_ORDERDATE from orders;
Qubole executed the above statement twice as fast as Amazon Elastic MapReduce.
Query performance also has an impact on user experience – cleansing and analyzing a few terabytes of semi-structured data such as social media data, web and online gaming logs can be time consuming. With Qubole, in many cases, these processes can finish in a fraction of the time and cost.
Developers authoring queries for their pipelines or business analysts preparing or analyzing data can be much more productive. Of course, performance comparisons have their limitations and the proof of the pudding is in the eating. So don’t take our word for it – bring your datasets and workloads and signup for a free trial!