You may already know what the catch is with spot instances. The cloud provider can pull the plug at any time with as little as 30-second notice.
We’re not saying that you should opt to reserve instances instead. Far from it. Reserved Instances are a path to vendor lock-in and paying more in the long term.
There is a way to use spot instances effectively. Even for production workloads.
Read this guide and learn how to handle spot instances and make the financial team pleasantly surprised when they see the bill.
This is part 3 of our cost optimization series. The rest of the series include:
- Surprised by your cloud bill? 5 common issues & how to deal with them
- How to choose the best VM type for the job and save on your cloud bill
Note: The cloud world changes quickly, so this article is regularly updated to reflect that. Last update: 25.03.2022.
Why are spot instances so tricky?
Interruptions are inevitable
CSPs offer their unused capacity at prices that offer savings up to 90%. The only catch is that they can pull the plug with short notice, from 2 minutes to as little as 30 seconds. This is why spot instances are more difficult to manage for production workloads.
Since you bid on spare computing resources, you have no guarantee on how long these capacities will stay available. Interruptions are bound to happen. That’s why you shouldn’t be using them for workloads that can’t tolerate them and are critical.
Pulling the plug happens fast
CSPs offer short interrupt notice. Amazon gives you 2 minutes, Azure and Google only 30 seconds. Is that enough time to drop everything and find a replacement for your instance? Not for a human.
Let’s say that you already set your eyes on an on-demand instance. Creating a new VM takes around 5 minutes on AWS (and even longer if you use Kubernetes), so you’re looking at a few minutes of potential downtime. Another method is having some paused machines that can step in whenever you lose an instance. But then your savings aren’t going to be so spectacular.
The best way to handle spot instance interruptions is through automation.
Note: Rebalance recommendation
In November 2020, AWS introduced a new feature that you can use to proactively rebalance workloads running on EC2 spot instances without having to wait until your instance receives the interruption notice. It’s a signal that notifies you when the risk of interruption increases for a Spot Instance that you’re using. It arrives sooner than the interruption notice, giving you time to rebalance your workload to new or existing instances.
The amount of available capacity sold as spot instances can vary a lot based on size, region, time of day, and many other factors. And all of them are subject to frequent changes.
The availability of a Spot Instance is based on supply & demand. This might lead to unexpected behavior if you happen to pick the most popular instance types and a market surge like Black Friday occurs.
So, why use spot instances at all?
Some of your workloads probably don’t need on-demand machines at all times. Tech companies like Salesforce, Lyft, or AutoDesk use spot instances.
If you’re still harboring some doubts, consider this scenario:
Let’s say that you have 10 pods running for your application – a product catalog service. Half of the pods are running on a Spot Instance.
At some point, the instance gives you a preemption notice – it’s about to be taken away from you. If that happens, you’ll lose half of your capacity.
You’re not going to experience downtime immediately. Instead, the pods will be redistributed to other machines that are still available after the interruption.
But what if you want to handle the interruption gracefully and replace that capacity before it becomes an issue?
You can quickly order a new instance within the allotted time – for example, a different type of Spot Instance. Or go with an on-demand instance if there’s no Spot Instance capacity on the marketplace.
And you can replace that on-demand instance with a Spot Instance a couple of hours later when the market pressures are alleviated.
By not locking yourself up in a reserved plan, you get a lot of flexibility and avoid getting locked in with your vendor (or even a specific instance type). That’s why using spot instances is a good idea. Take a look here: How to find out exactly how much you can save with spot instances
When to use spot instances?
If a service is stateless and can be scaled out – that is, have more than one replica – it is a good candidate for spot instances.
The good news is that most services are stateless in modern architectures. K8s was designed for stateless architectures.
Here are some examples of workloads that work well in spot instances:
- Batch processing jobs – they’re fault-tolerant and instance-flexible.
- Containers and microservices – they’re typically self-contained, highly available, fault-tolerant, and capable of handling interruptions.
- High Performance Computing (HPC) – these apps usually need very high compute capabilities, massive amounts of memory, fast storage, and high network performance. spot instances can support them via bursting or even serve as primary compute infrastructure.
- CI/CD operations – it doesn’t matter what tools you use; these instances can come in handy in your deployment process.
- Distributed databases – Elasticsearch or MongoDB can handle an interruption without losing any data or affecting the service.
- Any app on an orchestrated environment
Which CSP to choose for spot instances?
|Product name||Spot Instance||Spot VM||Preemptible VM instance|
Variable (based on demand and updated every 5 minutes)
Check the Spot Instance Advisor.
Query pricing information using the Azure retail prices API.
To learn more, explore the VM instances pricing lists.
Limit of 20 spot instances per AWS Region.
The limits are dynamic – it might be lower than 20 spot instances for new accounts and then increase over time.
Your account might have limits on specific Spot Instance types.
No support for sizes:
Promo versions of any size (like Dv2 or NV promo sizes)
No support for region: Microsoft Azure China 21Vianet.
|A Preemptible Instance can be stopped at any time due to system events (depending on current conditions, varies by zone and day). |
You can’t migrate Preemptible Instances to a regular VM instance.
|30 seconds||30 seconds|
|Maximum time limit||Unlimited (depends on extra capacity)||Unlimited (depends on extra capacity)||24 hours / 6 hours in some instances (you can reset the counter)|
Do this before getting a Spot Instance
1. Know your workload
How aggressive are you going to be about implementing spot instances? Before getting into the Spot Instance business, you need to know how much time your application needs to finish a job.
Can it handle interruptions well? Will you have an automation tool in place to move your workload to another instance before your time runs out?
2. Cherry-pick your instances
Next, it’s time to examine what the CSP has to offer. Take a look around and consider going for slightly less popular instances. They might come with a lower chance of interruptions and run stable for a longer time.
When looking through the available instances, be sure to check the frequency of interruption. The frequency of interruption is the rate at which the instance reclaimed capacity during the trailing month.
AWS displays it in the Spot Instance Advisor in ranges of <5%, 5-10%,10-15%,15-20% and >20%:
3. You can still use spot instances for more important workloads
For example, AWS offers a type of Spot Instance where you get an uninterrupted time guarantee for up to 6 hours (in hourly increments) and pay just a little more.
A Spot Instance running for a predefined duration can achieve a discount of up to 30-50% compared to on-demand pricing.
4. Bid your price
Now it’s time to set the maximum price you’re willing to pay for the Spot Instance. Your Spot Instance will run only when the marketplace price matches your bid or is lower.
The rule of thumb here is using maximum price that equals on-demand price.
If you set a custom amount and the price goes up, you risk getting interrupted.
5. Manage spot instances in groups
When using groups of spot instances, you can request multiple instance types at the same time. As a result, you increase your chances of getting filled.
Another perk is that you can set a maximum price per hour for the entire fleet rather than a given spot pool (a group of instances with the same type, OS, availability zone, and network platform.
- AWS Spot Fleet – you can manage a large fleet of spot instances with different allocation strategies (for example, considering the lowest price or only capacity optimized types).
- Azure VM scale set – use this feature to create and manage a group of load-balanced VMs, increasing or decreasing their number automatically.
- Google managed instance group – you can bring preemptible instances together in a group after specifying the preemptible option in the instance template.
But to make it all work, prepare for a massive number of manual configuration, setup, and maintenance tasks.
6. Turn to automation
You can avoid downtime from lost instances by implementing automation tools for managing your cloud infrastructure via autoscaling methods.
By using an automation tool, you can pick how much of your workload will be running on a Spot Instance, and then automatically fall back to on-demand instances in case of interruptions.
Automation is here to make sure that your workload has a place to run. And thanks to features like AWS Rebalance events, you can mitigate the risk even before receiving the interrupt notice.
You can get away with adding some basic levels of automation to how you manage these instances. But to achieve the best results, you need a solution that carries out automated actions based on predictive analytics.
Here’s what automating spot instances can do for you
Remember the case study we shared with you in the article about choosing the best VM for the job?
To test our cost savings approach at CAST AI, we opted for an open-source e-commerce demo app that we adapted from Google.
We first prepared the app:
- We load-tested our application with ~1k concurrent users.
- We scaled the pods for every microservice accordingly (using AWS EKS with a statically-scaled deployment).
- To capture the metrics, we ran the test for 30-60 minutes.
- We then extrapolated the costs over a 30-day period (we took assumed traffic seasonality into account).
- To make it happen, we generated a likely 30-day usage pattern using a Python script. The example usage experienced spikes every day at around the same time and had several days of a week with heavier traffic.
This is how we calculated the monthly costs of running the demo app on the AWS test instance.
Our initial monthly cost of running the app was $691.20.
We applied the CAST AI Spot Instance policy ensuring that our application saved on costs relative to the on-demand pricing.
We used the most aggressive policy settings where all the instances in use are spot instances.
This brought the total compute costs down to $65.01 – saving 90% over the original costs of an unoptimized deployment.
Spot instances are a beast that can be tamed. But to reap the pricing benefits and use them safely in production, you will need to use an automation solution like CAST AI.
Do you run Kubernetes on EKS, GKE, or AKS? Discover how much you could save up
Generate a free savings report to see your potential savings. To do that, you’ll need to connect your cluster to CAST AI and let the read-only agent analyze your setup. Once it generates the report, you can apply the recommendations manually or let the platform do the work for you and keep on reducing your cloud bill by taking advantage of spot instances.
Wondering how it all works? Read this: How to reduce your Amazon EKS costs by half in 15 minutes
A Spot Instance is a type of virtual machine instance offered by all the three major cloud service providers (in Google Cloud, it’s called Preemptible Instance).
Spot Instances are unused instances that providers offer at a lower price than On-Demand instances. For example, AWS Spot Instances are up to 90% cheaper than On-Demand instances and can lead to massive cost savings.
However, Spot Instances can be reclaimed by providers at any time following a short preemption notice that gives you from 30 seconds to 2 minutes to move your workloads to another instance.
Spot Instances are instances that are not being used at the moment. It all starts with sending a request for a Spot Instance where you set the maximum price you’re willing to pay for it. If the highest amount you’re prepared to pay exceeds the Spot pricing and the provider has some free capacity, your Spot Instance will start. The instance won’t launch if the highest price you set for it is lower than the Spot pricing.
Tip: To maximize your chances of snatching a Spot Instance, set a maximum price that equals the On-Demand price. If you set a different amount and the pricing of Spot Instances goes up, you risk getting your workload interrupted.
In general, any workload that is stateless, fault-tolerant, and can be scaled out (have more than a single replica) is a good candidate for Spot Instances. Today, most services developed in modern architectures are stateless.
Here are some examples of workloads that tend to work well with Spot Instances:
– Batch processing jobs
– Containers and microservices
– High Performance Computing (HPC) and machine learning applications
– CI/CD operations, no matter what tools you use
– Distributed databases such as Elasticsearch or MongoDB
– Any application in an orchestrated environment
Reserved Instances are a popular product among companies looking to save cloud costs. But you can save more money with Spot Instances than you would with even a 3-year Reserved Instance commitment. For example, Amazon EC2 Reserved Instances provide reductions of up to 75% when compared to On-Demand pricing – while Spot Instances can lower your bill by up to 90%.
A Reserved Instance works based on the principle of “use it or lose it.” Every hour that your instance is idle causes you to lose the financial benefits.
Another disadvantage of Reserved Instances is that you’re required to make a commitment. In Reserved Instances, you agree to use a specific type of resource for one or three years. But how can you forecast the needs of your business during this time? Changing requirements is bound to generate even more costs. Spot Instances, on the other hand, can be turned off at any moment.
Read this article to learn more: Do AWS Reserved Instances and Savings Plans really reduce costs?
You get charged depending on the Spot pricing at the start of each instance hour. If AWS terminates your Spot Instance because the instance pricing surpassed the amount you bid for it, you won’t be charged for that partial hour of usage.
To start a Spot Instance, you can either create a Spot Instance request on your own or use a managed service that spins up instances for you whenever your workloads need extra resources. Once the Spot Instance request is completed, the Spot Instance is launched and runs as long as it’s available and its pricing matches the amount you bid for it.
Spot Instances are only interrupted about 5% of the time on average, although this ratio varies significantly based on the Availability Zone and instance type.
Checking the frequency of interruption when going through the available instances is a smart move. The frequency at which the instance regained capacity during the preceding month is called the frequency of interruption.
If you’re using AWS, take a look at Amazon’s Spot Instance Advisor to get a rough idea of how often you may expect interruptions. It displays that frequency in ranges of <5%, 5-10%,10-15%,15-20% and >20%.
Spot Instances are definitely worth your attention if you’re looking to optimize your cloud costs. They offer a fantastic method to save money on flexible or fault-tolerant workloads, adding capacity to your existing resources.
Leave a reply
I’m interested, if there’s no spot instance capacity when the notice is issued, will CAST deploy on-demand instance or this scenario is improbable?
Hi Mark. Glad this caught your interest. If there is no spot instance capacity – CAST AI will deploy on-demand instances.
Can I save as much as with spot instances while not actually using them? My current workloads are not interruption-friendly..
Without using spot instances the savings percentage will decrease but you can still save plenty on your cloud bill. Our engineering team could consult you on how to prepare your workloads to be interruption-friendly.
You can connect our read-only agent to your EKS, GKE or AKS cluster and get immediate results of what the savings could be
Also the following article explores best practices to optimize cost:
Do my chances of getting more savings increases while using less popular instances and how much can that saving % fluctuate from time to time as your experience goes? ty for the answers and a great article.
That is correct, you can get even more savings while using less popular instances. The percentage can fluctuate a bit based on instance prices for a given month. Usually, it is around 1-2%.
Thanks for the article Zilvinas, it was well structured and easy to read.
saving up to 90% is a pretty bold claim but I guess its technically possible if you do everything just tragically.
Spot instances seems to be a great additional way to save more money, especially when there’s automation to handle it. Thanks for the article, it was really insightful!
If cast can really manage spot instance interruption effortlessly, I’ll have to give you guys a shot
Informative article, a solution I was looking for
Informative article but not gonna lie I as skeptical to the claims at first. So I went to check the eks optimizer tool and quick inspection confirmed it was read-only access on the script. I ran it on my cluster and although it wasnt 90% the savings were still quite astonishing. Gonna run it for some time and see if it changes
as a person with a short attention span, I’d love some TL;DR bullet points about spot instances in general at the bottom of the article
The alerting notices are really short and that’s where automation shines if configured properly
I was wondering how cast EKS cluster optimization decided on what clusters suit for spot instances, but this clears it up a bit. My potential savings are pretty low but thats just because I’ve used a test cluster that our company uses
Even though the EKS optimizer showed only a few clusters that are can be optimized with spot instances, it still saves me 1k$ a month. And now I know for a fact that those clusters are spot friendly, so good job on accurate identification
So spot instances are just low commitment but high risk-reward and reserved instances are just otherwise
The estimator offered 60% reduction with a suggestion to move half of my nodes to spot instances, thats interesting, I’ll see how it turns out on our other clusters and if its more complex, we might consider it with the team
Recently we started looking in to spot instances a bit more than before due to the accumulating costs that we reach each month and which never seem to stop rising. I’ll use your other guide on deciding if our workloads are spot instance friendly
Going with important deployments and loads in to spot instances is kind of nerve racking due to you never knowing when its going to be shut with a small notice. And I bet that no dev can react fast enough to move all the workloads around in time to a new one.. Automation is king
Thanks Zilvinas for insightfull article and I like the direction cast is going, overcoming spot interruption issues
I guess most of the things have been said right from fellow commenters, about this article and spot instances in general, great educational content and keep showing stuff with graphs, easiest way to understand things, at least for me
Everyone understands that spot instances will save you money, but not everyone can use them to max potential. Thanks for your shared knowledge
great table for spot instance comparison between CSP’s