How To Run Fault-Tolerant Clusters On Spot Instances

Even when looking to cut cloud costs, many companies still shy away from using spot instances. That’s a fair point – not every workload out there is a good match […]

Laurent Gil Avatar
fault tolerant clusters

Even when looking to cut cloud costs, many companies still shy away from using spot instances. That’s a fair point – not every workload out there is a good match for spot capacity, which the provider can reclaim at any minute. 

Still, spot instances offer up to 90% of savings on the on-demand rates without pulling you into any commitments like reserved instances or savings plans.

The good news is that even if fault tolerance and availability are extremely important to you, spot instances are still an option. As long as you have some automation in place.

Automation plays a crucial role in enabling teams to run fault-tolerant clusters on spot instances, reducing cloud costs for Kubernetes users, and keeping the lights on no matter what. 

But how do you get there? Let’s start at the beginning.

Why are spot instances so tricky for running fault-tolerant clusters?

Fault tolerance refers to the system’s ability to continue operating in spite of failures or issues with hardware or software.

This capability might be more important for some workloads than others. Few companies dare run production workloads on spot instances precisely for that reason. 

You only have a brief window of 30 seconds (Azure and Google Cloud) to 2 minutes (AWS) after the provider reclaims a spot instance to replace it. Spinning up a new instance takes more than that, so your workloads will be left without a place to run, which could potentially make your application go down.

Here are just a few of the many challenges of spot instances:

  • Since you get spot instances via a bidding process, you’re the one to specify the price per hour you’re willing to pay. The spot instance will be yours as long as nobody outbids you.
  • Spot instances shut down immediately after their pricing increases and goes beyond your maximum bid. 
  • The provider might run out of spot instances to offer, something that often happens during busy seasons like the Christmas holidays. Our customer OpenX encountered limits in the network, compute, and storage capacity on such occasions. 

Result? Teams hesitate to run spot instances, fearing that their applications aren’t sufficiently fault-tolerant and able to handle interruptions. After all, everyone wants to sleep well at night.

Automation is here to change all that.

Automation for fault-tolerant clusters on spot instances

Automating spot instances across their entire lifecycle opens the door to snatching some really great discounts and keeping all the lights on. Here are some examples of features that accomplish just that.

Automated spot instance provisioning and termination 

An automation tool should be able to quickly analyze the requirements of a workload, find the best match among the available spot instances, and provision these resources

Once there are no more jobs to be done, the tool shuts instances down automatically. You don’t want to spend money on resources that don’t bring your business any value, even when they’re as cost-efficient as spot instances.

The marketing automation leader Branch was looking to use spot instances since they provided the highest discount on the price per compute hour, but the risk of downtime and the time investment to manage spots made it impractical. They partnered with CAST AI to unlock the ability to safely use spot instances within their Kubernetes compute clusters, transitioning away from upfront reservations via savings plans.

As a result, Branch eliminated the upfront spend of several million dollars per year on savings plans and saved millions of dollars of cloud OpEx spend (over 25% of EC2 compute costs) by leveraging spot instances safely.

Fallback to on-demand when no spot instances are available

There may be times during the year when a shortage of spot instances happens. To avoid the risk of interruption, automation tools are capable of moving workloads from spot instances to on-demand when needed. This way, you minimize the risk of interruption and make sure that all workloads have a place to run, even when there are no spot resources in sight.

For example, the programmatic advertising platform OpenX runs nearly 100% of its compute on spot instances and uses automation and the spot fallback feature to ensure workloads always have a place to run by moving them to on-demand resources in case of a spot drought. 

We certainly have spot fallback always enabled, and it’s a normal situation for us to be unable to obtain spot capacity at the moment. But the capacity situation at Google Cloud is very dynamic. If you can’t obtain the spot capacity now, you might be able to in 10 minutes. That’s why spot fallback works great for us – we can expect CAST AI to maintain the best possible cost for the cluster by constantly attempting to replace the on-demand capacity with spot.

Ivan Gusev, Principal Cloud Architect at OpenX

Partial utilization of spot instances

It’s important that you pick an automation tool that isn’t a black box and gives you some control over configuration. CAST AI includes the features listed above, but it also gives users the flexibility to run just a portion of workloads on spot instances without having to modify manifest files. 

All you need to do is install and configure a Mutating Admission Webhook (mutating webhook), which mutates the workload manifest and adds spot toleration to influence the desired pod placement by the Kubernetes Scheduler.

If you set the webhook to spot-only, it will mark all workloads in your cluster as suitable for spot instances. As a result, the platform’s autoscaling mechanism will prefer spot instances when scaling your cluster up. 

This is especially recommended for development and staging environments, batch job processing clusters, and other scenarios where interruptions won’t cause any havoc.

If running 100% of your workloads on spot instances seems scary, you can easily modify it and use a ratio of 60% on stable on-demand instances and the remaining 40% on spot instances.

This conservative configuration guarantees that there are enough pods on stable compute for the base load but still allows for significant savings for pods above the base load. All types of environments could benefit from this, including production.

Embrace automation to build fault tolerance into your clusters

Automation empowers teams to achieve fault tolerance while reducing costs. By embracing automation, you get to unlock some serious opportunities for reducing your cloud bill.

Book a demo and get a personalized walkthrough of the CAST AI platform with a focus on spot instance automation.

CAST AI clients save an average of 63%
on their Kubernetes bills

Book a call to see how much you could save with spot instance automation

CAST AI Blog How To Run Fault-Tolerant Clusters On Spot Instances