Spot nodes on AWS (and preemptible VMs on Google Cloud Platform, GCP) are a great way to reduce your Total Cost of Ownership (TCO) for big data workloads. However, Presto queries can fail if a Spot node is interrupted before a query result set is returned. Presto on Qubole differentiates itself via value additions that handle Spot interruptions without sacrificing reliability. These features are switched on by default and provide higher reliability and stability in a transparent manner to the majority of Qubole users who find Spot nodes to be a great way to reduce their cloud bills.
In this blog, we quantify these benefits using real-world Spot interruption data from a customer’s workload on Amazon Web Services (AWS). We also compare against a competitive Presto offering (referred to as “ABC” throughout this blog). Our analysis shows that Qubole’s Spot interruption handling can help Presto on Qubole achieve upto 15% higher success rates (no query failures) than the competition. To achieve a similar success rate, we would need to use all On-Demand nodes on ABC. As a result, ABC will incur more than 2X the cost incurred with Presto on Qubole.
This blog starts with a brief background of Spot interruption handling in Presto on Qubole. It is then followed by a brief description of our Benchmark. We present our analysis in two parts:
- We first compare the performance of Presto on Qubole’s Spot interruption handling against Presto on ABC Presto through experimental runs. These replay a mapping of real world production workloads into TPC DS queries as well as real Spot interruption patterns observed previously, on both platforms, Qubole and ABC.
- We analyse production Presto queries and Spot interruption data from a Qubole customer and showcase how many of their queries succeeded due to our Spot interruption handling.
The below analysis holds true for any offering on top of plain open source Presto, as these features are unique to Qubole. While this study uses representative production data on queries and Spot interruptions, actual Spot Node related benefits may vary, due to the high variability in Spot interruptions across different AWS instance types, different regions, Availability Zones (AZs) and time.
Handling Spot Interruption Notifications
Let’s start with a short summary of Spot interruption handling features of Presto on Qubole. Please see Spot Interruption in Presto on Qubole for a more detailed explanation.
AWS issues a two-minute notification before taking a Spot node away (referred to as Spot Interruption Notification Handling- STNH). When a Spot interruption notification is received for a node, Presto on Qubole stops scheduling new queries on this node.
At the same time, Qubole adds new nodes to the cluster to replace the nodes that are about to be lost (“interrupted”), in order to maintain a steady cluster size for incoming queries.
Even with Spot Interruption Notification Handling described above, there is still a chance that a query might fail due to a Spot interruption. This is because not all the tasks running on the about-to-be-lost Spot nodes will complete during the Spot Interruption notification window of 2 minutes. For such cases, Presto on Qubole uses a query retry mechanism that determines if and when to retry the query for a higher chance of success.
The Benchmark for Costs Saved by Spot-interruption Handling
Our benchmark comprises two parts – a query runner that can use TPCDS queries to emulate customer production workloads, and a Spot interruption simulator that uses production Spot interruption data to introduce node interruptions in a Presto cluster.
- Query runner: Emulates query run patterns (latency and periodicity) observed on a Presto on Qubole production cluster, using TPCDS queries number 38, 86, 84, 58, 23 and 88. It runs them a few times with varying concurrency as observed in the customer production workload over a period of 200 minutes.
- Spot interruption simulator: Replays given Spot interruption patterns on a Qubole cluster to simulate Spot interruptions on an On-Demand cluster. Qubole captures Spot interruption patterns from every cluster running on the Qubole platform. We pick one representative pattern from a Presto on Qubole cluster with 10 Spot interruption events in 200 minutes in the following simulation.
Part I: Spot interruption Benchmark run comparison (Presto on Qubole vs ABC)
We used the following cluster configuration for running the benchmark described above on Presto on Qubole and ABC:
Coordinator Node Type: r5.4xlarge – 16 cores, 128GiB memory
Worker Node Type: r5.4xlarge – 16 cores, 128GiB memory
Minimum Worker Nodes: 10
Maximum Worker Nodes: 10
Spot nodes percentage: 75%
We used the latest Presto versions supported on both platforms – Presto 317 on Qubole, and Presto 0.2xx on ABC.
Figures A and B show the results of our benchmark run (with 10 Spot interruption events) on a timeline graph, over a period of 3 hours.
The vertical gray lines represent Spot interruption events. The left edge of this line is a Spot interruption notification event, and right edge represents the time of actual Spot interruption (i.e when node was terminated by Spot interruption simulator).
There were 13 queries that failed due to a Spot interruption. The graph for ABC (Figure A) shows these using Red bars. A few of these queries failed right away and are therefore represented as very thin lines overlapping Spot interruption events at 23:33.
Fig. A: Benchmark run with 10 Spot interruption events on ABC – 13 queries failed
Figure B below shows the timeline of the same benchmark run (with 10 Spot interruption events) on Presto on Qubole.
As seen in Figure B, all the queries succeeded with Presto on Qubole. The queries which couldn’t be completed within the Spot interruption notification window, were automatically retried (refer to Smart Query Retries in Presto on Qubole for more details), and succeeded—these are represented by “successful-with-retry” in Figure B. As seen in the figure, 4 queries succeeded due to Smart Query Retries.
The queries which arrived immediately after a Spot interruption event were not scheduled on the about-to-be-lost Spot nodes due to Qubole’s handling of Spot interruption notifications. As a result these queries did not fail when the about-to-be-lost nodes were terminated. These are represented by “successful-STNH” in Figure B. As seen in the figure, 6 queries succeeded due to Spot interruption Notification Handling.
Fig. B: Benchmark Run with 10 Spot interruption events on Presto on Qubole – No query failures
Figure C below represents query success count and success percentage in both Presto on Qubole and ABC clusters. There were no query failures across multiple runs of the benchmark for Presto on Qubole (100% success rate), whereas ABC suffered 11 query failures (resulting in only an 85.9 % success rate).
Fig C: Presto on Qubole (QDS) vs ABC Query Success Rate
To achieve a similar success rate in the given time (~2.5 hours), ABC’s Presto offering would have had to use 100% on-demand nodes and incur 1.6X the cost incurred with Presto on Qubole—equivalent to 60% cost savings.
Figure D below depicts the same.
Fig D: Presto on Qubole costs 1.6X les than ABC with 100% success rate
Part II: Analysis of Customer Queries and Spot interruption patterns
To showcase how effective Qubole’s Spot interruption handling features are in real customer use cases, we took a snapshot of the queries ran and the Spot interruption events encountered during a 5-hour period on a customer’s (referred to as Customer “X”) Presto on Qubole cluster.
Production Cluster configuration of Presto on Qubole; Customer X:
Coordinator node type: r5.2xlarge
Worker node type: r5.4xlarge
Min node count: 2
Max node count: 25
Spot percentage: 90%
4-10 Spot interruption events were commonly encountered over 5-hour periods across the runtime of this cluster. We picked a 5-hour period with 8 Spot interruption events for our analysis.
Figure E below shows a timeline of 341 queries that were executed over a period of 5 hours. The solid lines with different colors (dark green, pink, yellow and light green), represent the exact timeline when queries were running. The graph legend maps different colors to query status, i.e, whether the query was successful on its own, due to Spot interruption Interruption notification handling, or due to query retry.
As seen in Figure E, none of the queries failed due to a Spot interruption, while there were some unrelated failures (due to syntax issues, etc). All the queries that hit Spot interruption boundaries (i.e the two minute duration between receiving the Spot Interruption Notification and the actual Spot interruption event), were successful either due to handling of Spot Interruption notifications or due to automatic query retry.
Fig E: Customer query runs over 5 hours with 8 Spot interruption events on Presto on Qubole – No query failures
Customer X Cost comparison:
Figure F shows the cluster cost comparison for customer X. In this case, customer X used 90% Spot instances in the cluster, and was able to get nearly 100% query success rate due to Qubole’s Spot handling features. As seen in the figure, 3 queries were successful as a result of Smart Query Retry and 11 were successful with help of Spot interruption Notification Handling (STNH). Overall 14 queries would have failed if not for Qubole’s Spot handling features.
To guarantee a similar success rate, any competitive Presto offering or vanilla OS Presto would have to use 100% on-demand nodes and incur 1.6X the cost incurred with Presto on Qubole (equivalent to 60% cost savings).
Fig F: Presto on Qubole 1.6X Cheaper for Customer X
Presto on Qubole helps customers reduce costs with Spot nodes without sacrificing reliability
Spot Nodes on AWS (and Preemptible VMs on GCP) are extremely popular among our customers who find them very effective in reducing their cloud costs. Qubole helps Presto customers utilize Spot nodes without sacrificing reliability through built-in features that gracefully handle Spot interruptions.
In this blog, we quantified the gains from Qubole Spot interruption handling features using real production queries and Spot interruption patterns from our customers. Our analysis showed that Presto on Qubole can achieve 15% higher success rate than a competitive Presto offering and can be effectively 1.6X cheaper than alternate mechanisms to achieve 100% success rate.
Read more on how Presto on Qubole can be used for interactive and ad-hoc queries.