Big Data’s Moment in the Cloud Has Been Acknowledged

By Published January 29, 2016 Updated September 19th, 2017

We were delighted to see the announcement of the latest version of Cloudera Director, and a corresponding write up on Curt Monash’s DBMS2 blog. The industry’s movement toward cloud-optimized features, such as support for Spot Instances and dynamic creation and termination of clusters, validates the direction that we’ve set for our company and product.

Qubole’s Vision

At Qubole, we believe strongly that Big Data analytics workloads belong in the cloud. Our vision is to take all the benefits of the cloud and apply them to Big Data. We’ve believed this from day one, way back when Ashish Thusoo and Joydeep Sen Sarma founded the company in 2011. It’s why we have a product that runs on all three major public infrastructure clouds; Google Cloud Platform, Amazon Web Services, and Microsoft Azure.

Big Data as a Service

The cloud allows for delivery as-a-service, rather than as software. This accelerates the time to value for our customers by eliminating installation and integration. In fact, in our recent customer survey, we found that on average our customers are able to successfully query their data, whether through Hive, Spark, MapReduce, or Presto, by the second day upon signing up for Qubole Data Service. Data teams can get going right away, rather than waiting 6-18 months for long and complex professional services engagements to build out the application.

Delivery as-a-service also significantly reduces the effort and complexity of maintaining the platform for end users. Simple things that every user expects, such as a SQL workbench UI, collaborative notebook data visualization capabilities, the ability to drill down into logs and debug a job, and a simple template for cluster definition, are all delivered as part of our core service. These core elements reduce the time investment needed from administrators. In our customer base, we see on average 21 users for every one administrator. Not only does this free up valuable IT resources, but it also empowers data scientists and analysts to directly access and analyze data to accelerate business decision making.

Optimizations on Objects Storage

As Cloudera has already found, there are challenges to making some aspects of the cloud work well with existing Hadoop and Spark software. In particular, object stores, despite the benefits of possessing nearly unlimited capacity and scalability, don’t always work like file systems, which is an expectation for HDFS. In particular, it can be challenging to optimize performance on an object store such as Amazon S3, where directory listings and move operations are not very performant. Fortunately, we’ve made significant optimizations to how Hadoop runs with S3, and our customers have enjoyed these benefits for several years.

Doubling down on object storage is a big part of our reference architecture for customers. We’ve gone ahead and made similar optimizations for Google Cloud Storage and Azure Blob Storage. For our customers, keeping the “source of truth” version of their data in a dedicated storage service means flexibility on the compute side, with clusters that can scale up and down and workloads that can be parallelized across different clusters. This was a major consideration for Pinterest when they built their Big Data platform.

Cloud Scale

Finally, we’re able to take advantage of the scale of the cloud. There’s no need for capacity planning when instances can be provisioned on-demand in a matter of minutes. Cloudera Director can now automatically start and stop clusters, which is a nice feature. At Qubole, we’ve taken this even further. We actually dynamically and automatically scale the size of the cluster according to the presented workload. We even do this in the middle of a job so that when bursty workloads happen, our customers have the right amount of infrastructure on hand to handle it. Of course, we also downscale gracefully, which is the much harder part of the equation, so that compute costs remain reasonable. Our customers love auto-scaling – the average customer cluster can scale up to 34x its minimum size. After moving to Qubole, TubeMogul saw a 33% cost reduction and five-fold increase in query execution from auto-scaling.

We also use the capacity of scale as an offensive strategy. For example, AWS’s Spot Instances represent excess capacity that can be purchased at significant discounts of up to 80-90%. Qubole makes Spot integration easy – it’s a checkbox feature for clusters and nearly all of our customers take advantage of it for scaling up their clusters. Spot integration is so pervasive for us that 47% of all of our customers’ compute hours use Spot. Imagine that – nearly half of all of our customer’s Big Data computation is so heavily discounted that it is almost free. Of course, Spot has a downside – the instance can be taken away at any time without notice. To combat this, we’ve optimized HDFS so that the loss of a single Spot instance is not catastrophic to a running job.

Looking Forward

You’ll notice that the major optimizations we discussed (efficient performance for the object store, auto-scaling, and spot integration) were all introduced in 2012. Since that time, we’ve iterated and improved them to meet our customers’ needs (such as auto-scaling for Spark and Presto clusters and re-balancing for Spot instances). More importantly, we’ve moved on to the next set of improvements, by adding features around security, collaboration, performance, and simplicity. And, we’ve continued to grow our list of happy customers, which include enterprises such as as Oracle, Under Armour, and Fanatics.

We’re excited that Big Data in the cloud has reached this inflection point of market recognition in 2016. As the market scrambles to catch up with Qubole, we are constantly adding new data engines, cloud providers and enterprise features, in addition to cementing our leadership role as the definitive best as-a-service delivery model for Big Data.

Xing is the Director of Product Management for Qubole.