Spot Instance Availability Demystified: AWS, Azure, and GCP 

Many companies steer clear of Spot instances because they think they’re unstable. But in reality, Spot instances are a stable and cost-effective resource as long as you use automation to select, provision, manage, and decommission them.  Despite the potential to reduce compute costs by up to 90%, many businesses avoid deploying Spot instances because of…

Leon Kuperman Avatar
Spot instance availability

Many companies steer clear of Spot instances because they think they’re unstable. But in reality, Spot instances are a stable and cost-effective resource as long as you use automation to select, provision, manage, and decommission them. 

Despite the potential to reduce compute costs by up to 90%, many businesses avoid deploying Spot instances because of their perceived volatility through Spot instance interruption. Cloud providers can reclaim these instances with little notice. As a result, these organizations miss out on dependable, scalable, and cost-effective instances found across various geographical locations. 

The industry has created a blanket conception of Spot instance availability and pricing. The right answer is to use data-driven decision-making by cloud and by region.

To manage Spot instances successfully, Kubernetes clusters need automation and decision-making algorithms. This entails reacting to Spot instance terminations on time and actively managing the mix of instance types and pricing models (Spot, on-demand, and reserved) depending on cost, availability, and workload needs.

With automation in place to select, provision, manage, and decommission compute instances, Spot instances can be a highly reliable and cost-effective solution.

Spot Instance Hurdles: What You Need to Know

The potential interruption of Spot instances is the biggest hurdle that application and DevOps teams face. Clusters and underlying provisioning algorithms only have 30 seconds on Azure and Google Cloud and 2 minutes on Amazon Web Services to replace a Spot instance after the provider reclaims it. If you don’t react fast enough, your application could experience downtime.

There are other challenges as well:

  • Since Spot instances are obtained through a bidding process, instance availability is often dictated by the amount per hour an organization is prepared to spend. The Spot instance is retained as long as no other organization outbids the specified maximum price and inventory is available. Spot instances shut down immediately once those conditions are breached.
  • The cloud provider might run out of spot capacity, often during busy seasons like BFCM (Black Friday, Cyber Monday). For example, before implementing automation, OpenX encountered network, compute, and storage capacity limits on such high-traffic occasions. 

So, it’s no surprise that teams are hesitant to run Spot instances. They just want to sleep well at night, knowing all their workloads have a place to run. During high-pressure moments, the drive to lower costs takes a back seat, but it doesn’t have to be this way. This is where automation becomes a game-changer. 

Mastering Spot Instances with Smart Automation Strategies

Automating spot instances over their full lifecycle opens the door to significant savings while maintaining performance. Here are examples of how automation can help you effectively manage Spot instances and cut cloud costs while maintaining performance.

Automated Spot instance provisioning and termination 

An automated tool should be capable of swiftly analyzing a workload’s needs, determining the best match among available Spot instances, and provisioning necessary resources. When no more tasks are left to manage, the solution automatically shuts down these instances you don’t want to waste money on resources that don’t provide value to your organization, even if they are inexpensive on a per CPU-hour basis.

For example, Branch, a marketing automation industry leader, wanted to employ Spot instances since they offered the largest savings on a price per CPU hour basis. However, the risk of downtime along with the time commitment required to maintain Spots made it impossible. They used automation to safely use Spot instances within their Kubernetes clusters, moving away from upfront reservations while scoring additional savings and discounts.

Branch removed the upfront cost of several million dollars per year for reservations while saving millions of dollars in cloud OpEx (more than 25% on EC2 compute expenses) by properly using Spot instances through the entire lifecycle.

Partnering with CAST AI has been a big success for Branch, saving us several millions of dollars per year in AWS Cloud compute costs for our Kubernetes clusters, while maintaining our reliability SLAs.

Mark Weiler, former Senior VP of Engineering at Branch 

Ensuring Continuity: On-Demand Backup for Spot Instance Gaps

There may be periods during the year when there aren’t enough Spot instances available on the market like BFCM. Automation solutions can move workloads from Spot to on-demand instances as needed to ensure all workloads have a place to run.

In another example, the programmatic advertising company OpenX runs all of its computing on Spot instances. The company employs automation and Spot fallback functionality to guarantee workloads always have a place to operate by transferring them to on-demand resources in the event of a Spot “drought” in a specific region or availability zone. 

We certainly have spot fallback always enabled, and it’s a normal situation for us to be unable to obtain spot capacity at the moment. But the capacity situation at Google Cloud is very dynamic. If you can’t obtain the spot capacity now, you might be able to in 10 minutes. That’s why spot fallback works great for us – we can expect CAST AI to maintain the best possible cost for the cluster by constantly attempting to replace the on-demand capacity with spot.

Ivan Gusev, Principal Cloud Architect at OpenX

The Best of Both Worlds: Partial Use of Spot Instances

Automation solutions shouldn’t be a black box. Fine tuned configurability is important when selecting the mix between spot and on-demand infrastructure. Some solutions allow users to run a subset of workloads on Spot instances without modifying their manifest files.

One approach is to leverage a Mutating Admission Webhook (mutating webhook), which modifies the workload manifest and adds spot toleration to affect the intended pod placement by the Kubernetes Scheduler.

In the case of CAST AI automation, clusters can be configured to fully run on Spot instances, and the right instance types will be identified for all workloads as appropriate. This is a great choice for development and staging environments, batch job processing clusters, and other cases where interruptions don’t create issues.

Should the prospect of allocating 100% of workloads to Spot instances appears overly hazardous, the ratio may be adjusted to 60% on stable on-demand instances and 40% on Spot instances, as just one example.

This type of conservative configuration guarantees a sufficient number of pods on dependable computing resources to meet the fundamental load, concurrently facilitating significant cost reductions for pods surpassing the baseline demand. This strategy proves highly effective in production environments.

Achieving Spot Instance Dependability Through Automation

Strategic automation enables organizations to harness Spot instances for their availability and reliability benefits while optimizing cloud expenditure.

For those considering the adoption of Spot instances, we invite you to explore our newly developed Spot Instance Availability Map. This innovative tool is an interactive, global heat map that provides insights into the availability, reliability, and cost-efficiency of Spot instances across various regions and availability zones of AWS, Azure, and GCP. 

By presenting a clear visualization of Spot instance metrics, the map serves as a practical resource for debunking misconceptions and evaluating the potential risks and rewards of using Spot instances in specific locales.

We encourage users to engage with the map and share their feedback. What additional features or data would enhance capacity planning efforts, especially in the context of increasing industry-wide emphasis on cost savings? 

We hope that the Spot Instance Availability Map will inform and inspire organizations to reconsider the strategic role of Spot instances within their FinOps frameworks, potentially transforming perspectives on this valuable resource.

CAST AI Blog Spot Instance Availability Demystified: AWS, Azure, and GCP