The Accenture Technology Labs Hadoop Deployment Comparison study recently stated something that we at Qubole have known for a long time, that an investment in Hadoop-as-a-Service has many advantages over implementing a bare-metal Hadoop cluster.
The study used Accenture’s Data Platform Benchmark, to assess the Total Cost of Ownership for both solutions. This method of analysis has thrown up a more useful conclusion than if it has been looking at more limited metrics.
The approach was thorough and well executed using a cluster with a client node, a primary NameNode, a secondary NameNode, a JobTracker node and 22 worker-node servers, each of which runs a DataNode for HDFS as well as TaskTracker for MapReduce. With a data set of this breadth, the results would be consistent across a wide range of industries, uses and workloads.
The results of the study showed that: “Accenture’s study revealed that Hadoop-as-a-Service offers better price-performance ratio.”
It also debunked the idea that Hadoop clusters run slowly in virtual environments compared to the bare-metal equivalents. This is something that we have known to be the case and that this report categorically found that this was true only backs up our long held beliefs.
“Throughout the study, we learned that the 1/0 virtualization overhead that comes with the cloud environment is a worthy investment to enable optimization opportunities relevant only the cloud.”
The speed and customization of a cloud-based solution also allowed for further increases in performance compared to a bare-metal equivalent. One of the aspects of this discussed in the paper was the customization of performance tuning.
In bare-metal offerings, performance tuning is often difficult or time consuming for individual tasks. This is due to the hardware being configured for multiple uses, so is regularly customized to work adequately rather than at optimum, for each of the tasks. Due to the workload being used for multiple clusters across the enterprise this is time consuming to tune for individual tasks.
The nature of Hadoop as a service whereby clusters are used for individual tasks, means that they can be easily customized and tuned for specific needs. This gives it a considerable advantage as this improves the speed and reliability for each individual task.
One of the issues that Accenture pointed out when tuning in the cloud was that it was often more difficult and required a deep knowledge of Hadoop to customize. They worked through this with an automated tuning system on the Amazon EMR service. For this they used Starfish, which helped to optimize Hadoop for the tasks.
Given that this conducted on Amazon EMR, it is important to state that Qubole’s service benefits from a highly optimized and tuned deployment of Hadoop that provides faster and more efficient job scheduling as well as auto-scaling benefits.
With this optimization opportunity in the Qubole platform and the reporting of it in the paper it would perhaps have been of benefit to the research to conduct this same research across multiple cloud based systems rather than conducting it purely on Amazon EMR.
Qubole’s offerings would have bypassed some of the issues and additional expenses that come with sourcing and licensing external optimization programs.
At the same time as offering this customizable auto-tuning facility, one of the important aspects that must be looked at is the variability in speeds and performance of the various Hadoop-as-a-service providers.
Qubole Out Performs Amazon EMR
A recent study showed that Qubole out performed Amazon in a series of KPIs including:
- 2x faster in launching a cluster
- 5x faster in query execution against data in S3
- 2x faster in writing data to S3
These would have created a better range of results and comparisons between the two options. Although the results of this study were unequivocally in favor of Hadoop-as-a-Service, it would have been beneficial to look at a broader spectrum of providers.
In conclusion, we are pleased with the findings in this report as they uphold many aspects of Hadoop-as-a-Service that we have known the be the case. That an independent paper has shown the advantages in a well researched and fully documented format will hopefully push forward the ideas that we have long held. That Hadoop-as-a-Service is the best option for many companies and enterprises today.