Spot Instances: How to reduce AWS, Azure, and GCP costs by 90%

Žilvinas Urbonas, Senior Software Engineer
· 14 min read
spot instances

You may already know what the catch is.

The cloud provider can pull the plug at any time with as little as 30-second notice.

We’re not saying that you should opt to reserve VM instances instead. Far from it. Reserved Instances are a path to vendor lock-in and paying more in the long term.

There is a way to use Spot Instances effectively. Even for production workloads. 

Read this guide and learn how to handle Spot Instances and make financial team pleasantly surprised when they see the bill.

This is part 3 of our cost optimization series. The rest of the series include:

  1. Surprised by your cloud bill? 5 common issues & how to deal with them
  2. How to choose the best VM type for the job and save on your cloud bill

____

Table of contents:

Note: The cloud world changes quickly, so this article is regularly updated to reflect that. Last update: 01.07.2021.

Why are Spot Instances so tricky?

how do spot instances work

Interruptions are inevitable

CSPs offer their unused capacity at prices that offer savings up to 90%. The only catch is that they can pull the plug with short notice, from 2 minutes to as little as 30 seconds. This is why Spot Instances are more difficult to manage for production workloads. 

Since you bid on spare computing resources, you have no guarantee on how long these capacities will stay available. Interruptions are bound to happen. That’s why you shouldn’t be using them for workloads that can’t tolerate them and are critical.

Pulling the plug happens fast

CSPs offer short interrupt notice. Amazon gives you 2 minutes, Azure and Google only 30 seconds. Is that enough time to drop everything and find a replacement for your instance? Not for a human. 

Let’s say that you already set your eyes on an on-demand instance. Creating a new VM takes around 5 minutes on AWS (and even longer if you use Kubernetes), so you’re looking at a few minutes of potential downtime. Another method is having some paused machines that can step in whenever you lose an instance. But then your savings aren’t going to be so spectacular. 

The best way to handle Spot Instance interruptions is through automation.

Note: Rebalance recommendation

In November 2020, AWS introduced a new feature that you can use to proactively rebalance workloads running on EC2 Spot Instances without having to wait until your instance receives the interruption notice. It’s a signal that notifies you when the risk of interruption increases for a Spot Instance that you’re using. It arrives sooner than the interruption notice, giving you time to rebalance your workload to new or existing instances.

Limited capacity

The amount of available capacity sold as Spot Instances can vary a lot based on size, region, time of day, and many other factors. And all of them are subject to frequent changes. 

The availability of a Spot Instance is based on supply & demand. This might lead to unexpected behavior if you happen to pick the most popular instance types and a market surge like Black Friday occurs.

So, why use Spot Instances at all?

Some of your workloads probably don’t need on-demand machines at all times. Tech companies like Salesforce, Lyft, or AutoDesk use Spot Instances.

If you’re still harboring some doubts, consider this scenario:

Let’s say that you have 10 pods running for your application – a product catalog service. Half of the pods are running on a Spot Instance.

At some point, the instance gives you a preemption notice – it’s about to be taken away from you. If that happens, you’ll lose half of your capacity. 

You’re not going to experience downtime immediately. Instead, the pods will be redistributed to other machines that are still available after the interruption. 

But what if you want to handle the interruption gracefully and replace that capacity before it becomes an issue

You can quickly order a new instance within the allotted time – for example, a different type of Spot Instance. Or go with an on-demand instance if there’s no Spot Instance capacity on the marketplace.

And you can replace that on-demand instance with a Spot Instance a couple of hours later when the market pressures are alleviated. 

By not locking yourself up in a reserved plan, you get a lot of flexibility and avoid getting locked in with your vendor (or even a specific instance type). That’s why using Spot Instances is a good idea. 

When to use Spot Instances?

when to use spot instances

If a service is stateless and can be scaled out – that is, have more than one replica – it is a good candidate for Spot Instances

The good news is that most services are stateless in modern architectures. K8s was designed for stateless architectures. 

Here are some examples of workloads that work well in Spot Instances:

  • Batch processing jobs – they’re fault-tolerant and instance-flexible.
  • Containers and microservices – they’re typically self-contained, highly available, fault-tolerant, and capable of handling interruptions.
  • High Performance Computing (HPC) – these apps usually need very high compute capabilities, massive amounts of memory, fast storage, and high network performance. Spot Instances can support them via bursting or even serve as primary compute infrastructure. 
  • CI/CD operations – it doesn’t matter what tools you use; these instances can come in handy in your deployment process. 
  • Distributed databases – Elasticsearch or MongoDB can handle an interruption without losing any data or affecting the service.
  • Any app on an orchestrated environment

Which CSP to choose for Spot Instances?

AWSAzureGCP
Product nameSpot InstanceSpot VMPreemptible VM instance
Pricing
Variable (based on demand and updated every 5 minutes)
Check the Spot Instance Advisor.
Fixed
Query pricing information using the Azure retail prices API.
Fixed
To learn more, explore the VM instances pricing lists.
Support limitations
Limit of 20 Spot Instances per AWS Region.

The limits are dynamic – it might be lower than 20 Spot Instances for new accounts and then increase over time. 

Your account might have limits on specific Spot Instance types. 



No support for sizes:
B-series
Promo versions of any size (like Dv2 or NV promo sizes)

No support for region: Microsoft Azure China 21Vianet.
A Preemptible Instance can be stopped at any time due to system events (depending on current conditions, varies by zone and day). 

You can’t migrate Preemptible Instances to a regular VM instance.
Preemption time


2 minutes


30 seconds30 seconds
Maximum time limitUnlimited (depends on extra capacity)Unlimited (depends on extra capacity)24 hours / 6 hours in some instances (you can reset the counter)

Do this before getting a Spot Instance

how to use spot instances

1. Know your workload

How aggressive are you going to be about implementing Spot Instances? Before getting into the Spot Instance business, you need to know how much time your application needs to finish a job. 

Can it handle interruptions well? Will you have an automation tool in place to move your workload to another instance before your time runs out?

2. Cherry-pick your instances

Next, it’s time to examine what the CSP has to offer. Take a look around and consider going for slightly less popular instances. They might come with a lower chance of interruptions and run stable for a longer time.

When looking through the available instances, be sure to check the frequency of interruption. Frequency of interruption is the rate at which the instance reclaimed capacity during the trailing month. 

AWS displays it in the Spot Instance Advisor in ranges of <5%, 5-10%,10-15%,15-20% and >20%:

3. You can still use Spot Instances for more important workloads

For example, AWS offers a type of Spot Instance where you get uninterrupted time guarantee for up to 6 hours (in hourly increments) and pay just a little more.

A Spot Instance running for a predefined duration can achieve a discount of up to 30-50% compared to on-demand pricing.

4. Bid your price

Now it’s time to set the maximum price you’re willing to pay for the Spot Instance. Your Spot Instance will run only when the marketplace price matches your bid or is lower.

The rule of thumb here is using maximum price that equals on-demand price.

If you set a custom amount and the price goes up, you risk getting interrupted.

5. Manage Spot Instances in groups

When using groups of Spot Instances, you can request multiple instance types at the same time. As a result, you increase your chances of getting filled. 

Another perk is that you can set a maximum price per hour for the entire fleet rather than a given spot pool (a group of instances with the same type, OS, availability zone and network platform.

  • AWS Spot Fleet – you can manage a large fleet of Spot Instances with different allocation strategies (for example, considering the lowest price or only capacity optimized types).
  • Azure VM scale set – use this feature to create and manage a group of load-balanced VMs, increasing or decreasing their number automatically.
  • Google managed instance group – you can bring preemptible instances together in a group after specifying the preemptible option in the instance template.

But to make it all work, prepare for a massive number of manual configuration, setup, and maintenance tasks. 

6. Turn to automation

You can avoid downtime from lost instances by implementing automation tools for managing your cloud infrastructure via autoscaling methods.

By using an automation tool, you can pick how much of your workload will be running on a Spot Instance, and then automatically fall back to on-demand instances in case of interruptions.

Automation is here to make sure that your workload has a place to run. And thanks to features like AWS Rebalance events, you can mitigate the risk even before receiving the interrupt notice.

You can get away with adding some basic levels of automation to how you manage these instances. But to achieve the best results, you need a solution that carries out automated actions based on predictive analytics. 

Here’s what automating Spot Instances can do for you

Remember the case study we shared with you in the article about choosing the best VM for the job?

To test our cost savings approach at CAST AI, we opted for an open-source e-commerce demo app that we adapted from Google.

We first prepared the app:

  1. We load-tested our application with ~1k concurrent users.
  2. We scaled the pods for every microservice accordingly (using AWS EKS with a statically-scaled deployment). 
  3. To capture the metrics, we ran the test for 30-60 minutes.
  4. We then extrapolated the costs over a 30-day period (we took assumed traffic seasonality into account). 
  5. To make it happen, we generated a likely 30-day usage pattern using a Python script. The example usage experienced spikes every day at around the same time and had several days of a week with heavier traffic. 

This is how we calculated the monthly costs of running the demo app on the AWS test instance.

Our initial monthly cost of running the app was $691.20

We applied the CAST AI Spot Instance policy ensuring that our application saved on costs relative to the on-demand pricing. 

We used the most aggressive policy settings where all the instances in use are Spot Instances.

This brought the total compute costs down to $65.01 – saving 90% over the original costs of an unoptimized deployment.

Wrap up

Spot Instances are a beast that can be tamed. But to reap the pricing benefits and use them safely in production, you will need to use an automation solution like CAST AI.

Do you run Kubernetes on EKS? Discover how much you could save up

Generate a free savings report to see your potential savings. To do that, you’ll need to connect your cluster to CAST AI and let the read-only agent analyze your setup. Once it generates the report, you can apply the recommendations manually or let the platform do the work for you and keep on reducing your cloud bill by taking advantage of Spot Instances.

Wondering how it all works? Read this: How to reduce your Amazon EKS costs by half in 15 minutes

eks cost optimization

FAQ

Leave a reply

20 Comments
Oldest
Newest
Inline Feedbacks
View all comments
Mark
2021-05-18 8:02 AM

I’m interested, if there’s no spot instance capacity when the notice is issued, will CAST deploy on-demand instance or this scenario is improbable?

R.K
2021-05-18 8:18 AM

Can I save as much as with spot instances while not actually using them? My current workloads are not interruption-friendly..

J.mil
2021-05-18 8:36 AM

Do my chances of getting more savings increases while using less popular instances and how much can that saving % fluctuate from time to time as your experience goes? ty for the answers and a great article.

Kimi
2021-05-18 9:00 AM

Thanks for the article Zilvinas, it was well structured and easy to read.

Alexander
2021-05-18 9:19 AM

saving up to 90% is a pretty bold claim but I guess its technically possible if you do everything just tragically.

JJredi
2021-05-20 6:07 AM

Spot instances seems to be a great additional way to save more money, especially when there’s automation to handle it. Thanks for the article, it was really insightful!

ano
2021-05-20 6:35 AM

If cast can really manage spot instance interruption effortlessly, I’ll have to give you guys a shot

Steven T.
2021-05-21 10:36 AM

Informative article, a solution I was looking for

Jeremy
2021-05-25 11:23 AM

Informative article but not gonna lie I as skeptical to the claims at first. So I went to check the eks optimizer tool and quick inspection confirmed it was read-only access on the script. I ran it on my cluster and although it wasnt 90% the savings were still quite astonishing. Gonna run it for some time and see if it changes

Rawiya
2021-05-26 10:20 AM

as a person with a short attention span, I’d love some TL;DR bullet points about spot instances in general at the bottom of the article

Fahid
2021-06-09 10:27 AM

The alerting notices are really short and that’s where automation shines if configured properly

Gonzales
2021-08-11 7:16 AM

I was wondering how cast EKS cluster optimization decided on what clusters suit for spot instances, but this clears it up a bit. My potential savings are pretty low but thats just because I’ve used a test cluster that our company uses

nicha
2021-08-11 8:18 AM

Even though the EKS optimizer showed only a few clusters that are can be optimized with spot instances, it still saves me 1k$ a month. And now I know for a fact that those clusters are spot friendly, so good job on accurate identification

don
2021-08-12 8:35 AM

So spot instances are just low commitment but high risk-reward and reserved instances are just otherwise

dom
2021-08-12 10:21 AM

The estimator offered 60% reduction with a suggestion to move half of my nodes to spot instances, thats interesting, I’ll see how it turns out on our other clusters and if its more complex, we might consider it with the team

Gerard
2021-08-16 3:47 AM

Recently we started looking in to spot instances a bit more than before due to the accumulating costs that we reach each month and which never seem to stop rising. I’ll use your other guide on deciding if our workloads are spot instance friendly

zack
2021-08-17 8:18 AM

Going with important deployments and loads in to spot instances is kind of nerve racking due to you never knowing when its going to be shut with a small notice. And I bet that no dev can react fast enough to move all the workloads around in time to a new one.. Automation is king