Intelligence in QDS
- By Xing Quan
- September 27, 2016
The concept of intelligent automation has always played a key role in Qubole Data Service (QDS). It’s one of the main reasons why we can help our customers bring self-service Big Data access across the enterprise. Intelligent automation also plays a big role in the small ops footprint that QDS requires, helping our customers achieve a 1:21 admin to user ratio.
We’ve been busy working on the next generation of intelligent automation, spanning areas such as performance and cost optimization. Today, I’d like to take you through a tour of the QDS product and how intelligent automation plays a role in accelerating Big Data deployments for our customers. I’ll highlight the new areas that we’re investing in (email [email protected] to try out our beta features) and also review how we’ve improved on long-standing features such as auto-scaling.
We recently introduced RubiX as an open source project for caching data read from cloud object storage. RubiX is an example of performance automation without any tuning necessary – QDS provides the best configuration by default. Presto queries with RubiX performed up to 8x faster than directly reading from the object store. RubiX was designed to provide one central caching story across all the SQL engines that QDS supports, including Presto, Hive, and Spark SQL. To optimize for the cloud object store, we completely changed how we cache data, doing so at the file level rather than at the table level. We’re excited to announce that RubiX is now available for Hive for beta access. Going forward, we will also look to utilize the indexing information within data formats to lessen the amount of data that needs to be processed, further improving on performance.
We’ve also invested a lot into optimizing split computation, both for Hive and Spark SQL. Split computation is one of the most common tasks for distributed big data engines, where work is allocated and divided up for parallel execution. It can be very slow with cloud object stores, and our optimizations can speed up performance of a Spark SQL job by 6.5 times. Our optimizations for split computation are included automatically in the integration with object storage.
We often have customers asking us how they can optimize performance for their specific setup, which includes data formats (ORC, Parquet, or Avro as examples), data partitioning, and schema design. To meet this challenge, we set out to build a general recommendation system that can be tailored to the exact workloads that each customer runs and also learns and evolves as the workloads change.
We’re excited to announce Tenali, a beta product for QDS that can automatically analyze command workload history and find the queries that are the most representative of the workload. With this information, QDS solutions architects can make specific recommendations and chart the performance of the representative queries over time. Over time, we will add recommendations for cluster setup and tuning, as well as automatic actions to invoke from recommendations. We’ll write more about Tenali in the near future.
QDS has always supported AWS Spot instances as part of the cluster definition. AWS Spot instances represent excess capacity and are priced at up to 80% discount from on-demand instance prices. By setting a simple policy, such as “bid up to 100% of the on-demand price and maintain a 50/50 on-demand to Spot ratio”, QDS will automatically manage the composition and scaling of the cluster while making bids for Spot instances. Our integration with Spot instances is a very popular feature, so much so that 82% of all QDS clusters make use of Spot instances.
We recently started extending out our Spot integration story. Heterogeneous clusters is a beta feature for QDS, enabling the inclusion of multiple instance types for nodes within a cluster. By casting a wider net in instance types, QDS can take greater advantage of the broader Spot market, where there can be potential efficiencies in pricing. For example, suppose a cluster needs to scale up by 2 r3.4xlarge nodes. If the price of one r3.8xlarge node is cheaper than the combined cost of the 2 r3.4xlarge nodes, QDS will go ahead and purchase the r3.8xlarge. By taking advantage of these efficiencies, we’ve found that customers can save up to 90% from on-demand instances. Similar to our basic Spot integration, heterogeneous clusters completely automates the purchasing decision.
Cluster Lifecycle Management
QDS automatically manages the entire lifecycle of Hadoop, Spark, and Presto clusters. This simplifies both the user and admin experiences. Users such as data analysts and data scientists can simply submit jobs to a cluster label and QDS will automatically bring up clusters. There is no dependency on an admin to ensure cluster resources. Similarly, admins no longer need to spend time manually deploying clusters or developing scripts or templates to automate this action.
QDS also automatically shuts down the cluster when there are no more active jobs running on it. This provides protection against accidentally leaving a cluster on, which could incur unnecessary compute charges. Intelligence is used to detect what type of activity occurred on the cluster. QDS is more aggressive in shutdown when all jobs were submitted for programmatic or batch execution. Conversely, QDS is less aggressive in shutdown for interactive ad hoc cases, where there is a higher chance a user will come back to an idle cluster.
QDS automatically adds or removes nodes on a running Hadoop, Spark, or Presto cluster to better match cluster sizing with workload. Auto-scaling is particularly effective for workloads that are less predictable and for which there are many users running jobs concurrently on the cluster. In our benchmark, we see QDS auto-scaled clusters performing much faster than a minimally sized cluster and at a much lower price point than maximally sized clusters, creating up to $300,000 of value for our largest customers.
QDS auto-scaling works right out of the box without any tuning necessary. In addition, QDS has added optimizations above and beyond basic auto-scaling. For example, QDS can auto-scale based on the rate of HDFS consumption. QDS also ensures that HDFS replication is not affected in down-scaling a node, re-distributing HDFS blocks across the remaining nodes prior to down-scaling. Finally, auto-scaling integrates tightly with AWS spot instances, even re-balancing the cluster when Spot prices fluctuate. You can read further about the optimizations we’ve made to auto-scaling in the QDS documentation.
All of these areas (performance optimization, recommendations, cost optimization, cluster lifecycle management, and cluster elasticity) are examples of QDS intelligent automation at work in the deployment of a Big Data platform. We will continue to invest in these areas to bring the best experience to our customers. If you’re curious and want to try out QDS, get started with our 15-day free trial!
To get an overview of QDS, click here to register for a live demo.