Intelligent Spot Instance Availability: How Machine Learning Reduces Interruptions by up to 94%

Discover how Cast identifies Spot Instances with low interruption rates and prioritizes them when scaling clusters.

Mantas Čepulkovskis Avatar

Running workloads on Spot Instances is one of the most effective ways to reduce cloud compute costs, but it comes with a tradeoff: interruptions. Cloud providers may reclaim spot capacity with minimal notice, resulting in workload interruptions that can affect application stability and user experience.

The traditional approach to managing Spot interruptions has focused on graceful handling after the fact – draining nodes quickly, restarting pods elsewhere, and hoping the disruption is minimal. 

While these strategies help, they treat all Spot Instances as equally unreliable. In reality, not all Spot Instances experience equal rates of interruption. Some instance types in certain availability zones maintain stable capacity for days or weeks, while others face frequent interruptions within hours or even minutes. 

Knowing which Spot Instances are more reliable opens the doors to proactive decision-making when selecting capacity for workloads. By picking Spot Instances that are statistically less likely to be interrupted, teams can dramatically reduce interruption rates before they occur. This approach improves workload stability while maintaining the cost advantages of Spot Instances. 

Our latest Reliable Spot Instances feature does that automatically: it identifies Spot Instances with lower historical interruption rates and prioritizes them when scaling clusters.

Our feature reduces the Spot Instance interruption rate by:

  • 23.2% on AWS
  • 28.7% on Google Cloud Platform
  • 40% on Azure

How do Reliable Spot Instances work? Here’s a real-life example showing its reliability and cost impact.

Customer impact: measured interruption reductions

Cast AI measured the effectiveness of Reliable Spot Instances across customer clusters, with significant and consistent interruption reductions across cloud providers.

Overall results by cloud provider

AWS

Across AWS clusters using Spot Reliability, we observe an average of 23.2% reduction in spot node interruptions compared to baseline spot usage.

Google Cloud Platform

Google Cloud Platform shows even stronger results, with an average 28.7% reduction in interruption rates when Spot Reliability is enabled.

Azure

We noted the strongest interruption reduction on Azure, amounting to 40%.

Azure cluster-specific performance: interruption reduction of up to 94% 

The impact of our feature varies depending on specific cluster configurations, with some clusters achieving up to a 94% reduction in interruptions, essentially transforming Spot Instances into near-on-demand reliability at attractive Spot pricing.

Survival analysis: a statistical approach to Spot Instance reliability

To identify which Spot Instances are more reliable, Cast AI uses survival analysis – a statistical methodology originally developed for medical research and reliability engineering. Survival analysis is particularly well-suited for assessing Spot Instance reliability, as it effectively handles the challenge of partial observations.

In the context of Spot Instances, survival analysis models answer a fundamental question: “What is the probability that a spot node will remain available for at least X minutes?” 

By analyzing historical interruption patterns across hundreds of millions of node observations, we can estimate survival functions that quantify reliability for each instance type and availability zone combination.

The methodology works by examining node lifetimes – both interrupted and non-interrupted nodes – to build statistical models of reliability. Crucially, survival analysis can incorporate nodes that were removed for reasons other than interruptions, making efficient use of all available data to produce accurate reliability estimates. The result is a reliability score for each instance type and zone that reflects its true interruption risk based on recent historical patterns.

These reliability scores are updated frequently using the latest interruption data, allowing the system to adapt to changing cloud capacity conditions over time.

Automating Spot Instance selection: integration with Cast AI Autoscaler

Cast AI’s Reliable Spot Instances feature integrates machine learning-based reliability scores directly into the autoscaling decision process, creating a powerful combination of efficiency and stability. 

The key innovation is that reliability assessment happens automatically during every autoscaling event. When the Cast AI Autoscaler needs to provision additional capacity and the Reliable Spot Instances feature is enabled, it considers both cost and reliability simultaneously. The combination of efficient autoscaling and intelligent reliability selection ensures both cost optimization and workload stability.

Getting started with Spot Reliability

With the Spot Instance Reliability feature, teams can avoid the trade-off between cost efficiency and stability. By combining real-time cost data with interruption insights, our Autoscaler makes smarter, reliability-aware scaling decisions automatically. The result: fewer disruptions, optimized spending, and clusters that stay resilient – even in the dynamic world of Spot pricing.

Reliable Spot Instances is a feature available for AWS, GCP, and Azure clusters managed by Cast AI. To enable this feature, refer to the Spot Instance Reliability documentation.

And check out our Spot Instance Availability Map to get more insights into Spot Instance interruption rates, pricing, and insufficient capacity errors by cloud region. 

Cast AIBlogIntelligent Spot Instance Availability: How Machine Learning Reduces Interruptions by up to 94%