Introducing Capacity Reservation for Application Master to increase Workload Reliability despite Spot Interruptions

August 13, 2020 Shefali Aggarwal

AWS Spot instances reduce cloud costs by up to 90% but can be interrupted by AWS at any given time causing running workloads to fail.

Introducing Capacity Reservation for Application Masters:

To address this problem and ensure the reliability of running workloads despite Spot interruptions, Qubole is excited to announce the public preview of Capacity Reservation for Application Masters. With this feature, Qubole clusters will reserve a certain amount of memory and CPU on On-Demand Instances for AMs, and no other tasks will be scheduled in this reserved space. This reservation maximizes the chances of scheduling the AMs on On-Demand instances instead of Spot Instances, thereby avoiding Application Master loss due to Spot Interruption, which can fail the entire workload.

Capacity Reservation for AMs on On-Demand Instance is an excellent choice for clusters that require high workload reliability but have a high percentage of spot instances for cost reduction.

How to Enable It?

To use this feature, simply pass yarn.scheduler.Application Master-reservation.enabled=true through Override Hadoop Configuration Variables under Hadoop Cluster Settings in the cluster UI’s Advanced Configuration. After it is enabled, space for 4 Application Masters is reserved on the cluster by default. It is further customizable based on specific environment and workload requirements. Instructions are provided Qubole documentation.

Fig 1: Enabling APPLICATION MASTER Reservation feature on Qubole Clusters

Job Failure due to Application Master running on Spot Instances

Application Master coordinates all the tasks or executors (Spark) launched as part of a particular job. If the Application Master is scheduled on a spot instance that was interrupted by AWS, the job will fail. Qubole’s in-built resilience feature retries the entire job but this leads to Job completion delays and additional cost due to re-computation. As illustrated in figure 2 below, there is no spot interruption on the instances where the tasks are executed. Yet, the job will fail since the Application Master based spot instance is interrupted.

Fig 2: Job failure from Application Master loss due to Spot interruption

Increased Fault Tolerance with Application Master on On-Demand

Application Master running on an On-demand instance provides higher reliability for job execution as shown in Figure 3. When a Spot instance, on which these tasks are executed, is interrupted by AWS, Qubole’s fault tolerance mechanism retries tasks in a different instance, thus minimizing the job interruption. For more details, on Qubole’s fault tolerance mechanism, read this blog.

Fig 3: Task retries upon failure due to Spot instance interruption

Smart Scheduling: Application Masters on On-Demand Instance

Qubole’s smart scheduler first checks for available capacity on On-Demand instances to avoid job failures due to Application Master loss. If the required capacity is available, Application Masters are scheduled on On-Demand instances protecting them from Spot interruptions.

However, in cases when the required capacity is not available across all the On-Demand instances, Qubole schedules Application Masters on Spot nodes so that the job can start its execution without any delays. This increases the risk of job failure, which can be avoided by proactively ensuring that a specific capacity is reserved for Application Masters.

Fig 4: Qubole Smart Scheduling allocating AMs on On-Demand Instances whenever capacity is available

On-Demand Instance Capacity Reservation

By reserving capacity on On-Demand instances, Qubole ensures that no tasks are scheduled on this reserved space. When a new job is submitted, its Application Master can be scheduled on On-Demand Instance, thus providing additional reliability to the job and allowing it to complete despite spot interruptions. If there is no available capacity, clusters can still be configured to provision new On-Demand nodes to satisfy this request. This guarantees that all the AMs are scheduled on highly reliable instances thereby reducing job failures due to spot interruptions.

Fig 5: Capacity reserved in On-Demand Instances for AMs

Summary:

Spot instances are an integral part of Qubole and can provide up to 90% is cost savings, but could result in job failures due to interruption. Capacity Reservation for Application Masters on On-Demand instances, along with Qubole’s Smart Scheduling, provides a reliable path to keep leveraging Spot instances to reduce cost.

To learn more about Qubole’s Intelligent Spot Management, read our AWS Blog on Spot Optimization. You can also experience the benefits first hand by signing up for a free trial.

The post Introducing Capacity Reservation for Application Master to increase Workload Reliability despite Spot Interruptions appeared first on Qubole.

Previous Article
Apache Airflow Concepts – DAG Scheduling and Variables
Apache Airflow Concepts – DAG Scheduling and Variables

In our last blog, we covered all the basic concepts of Apache Airflow. In this blog, we will cover some of ...

Next Article
Qviz – Qubole Visualization Framework for Jupyter-Based Notebooks
Qviz – Qubole Visualization Framework for Jupyter-Based Notebooks

Data visualization is a critical aspect of Exploratory Data Analysis that helps Data Analysts and Scientist...