Iterable saves over 60% on Amazon EKS by automating Spot Instances

Company

Iterable is the top-rated AI-powered customer communication platform that enables brands like Redfin, Priceline, Calm, and Box to deliver joyful experiences with harmonized, individualized, and dynamic cross-channel communications at scale. The company raised $343 million of funding at a valuation of $2+ billion.

Challenge

As an advanced Kubernetes user, Iterable faced the common issue of rapidly growing cloud infrastructure costs. After solving challenges around cost transparency and monitoring, Iterable was ready for a solution that would manage cloud resources automatically with cost optimization in mind.

Solution

By implementing Cast AI, Iterable mitigated key pain points around resource utilization and reduced its annual cloud bill by over 60%. The team can now use automation mechanisms to provision the most cost-efficient resources, scale them to match real-time demand, and decommission idle VMs to reduce cloud waste. This is how Iterable can achieve an optimal cost-performance balance and fully utilize its AWS Savings Plans.

Results

  • 60% cloud cost savings
  • Maximized value from AWS Savings Plans
  • Full Spot Instance automation without interruptions

Autoscaling resources up and down

Due to Iterable’s unique use case, the application experiences increased resource demand at the beginning of the hour. Cast scales resources up and down automatically in line with changing demand.

Provisioning the most cost-effective Spot Instances

Cast’s autoscaler analyzes workload demands in real time and selects the most cost-effective resources from the entire EC2 portfolio. As a result, Iterable’s workloads run on machines that AWS has just recently released – for example, the i7ie family.

The autoscaler also leverages the older generation metal compute instances that AWS sells at a much lower price, often because customers are upgrading to newer types or migrating to other hyperscalers.  

By selecting instances for the best deals that match baseline performance characteristics, the autoscaler casts a very wide net while searching for the best Spot discounts on the market. Their month with the highest savings had an average Spot discount of 87%!

Hourly rebalancing of Spot Instance

During rebalancing, suboptimal nodes are automatically replaced with new ones that are more cost-efficient and run the most up-to-date node configuration settings.

As a result, Cast removes around 2000 CPUs, reducing the number of nodes by 10% and generating significant savings. During low-demand periods like nighttime, rebalancing can achieve even 100% savings by draining and removing the nodes altogether. 

Interruption rate 

By keeping the VMs fresh, Cast reduces the interruption rate significantly. Other similar-sized customers running in the same regions as Iterable have faced 50-60 interruptions per day, with Cast handling them automatically.

At the same time, Iterable’s production cluster saw <5 interruptions. Iterable is able to proactively refresh its instances when the demand is the lowest, and the cluster cost savings become the highest.

When you start talking about those numbers, leadership cares, finance cares, they start building it into their cost projections and estimates. Even 1 or 3% of a massive AWS bill is significant.

So when you say numbers like 20%, people are losing their minds. And our total possible savings take that 20% closer to 60 or 70%. 

That’s just an insane amount of savings. You can hire more people or spend more on infrastructure elsewhere.

Jason Sanghi, Staff Software Engineer, SRE at Iterable

Understanding Kubernetes costs is a challenge

To optimize its operational expenses, Iterable planned to reduce cloud spend, with Kubernetes costs as the second-largest service on the list of optimization candidates. However, the team didn’t have the required mechanisms for measuring or reducing these costs.

With reliability as its critical priority, the team was hesitant to shrink the cluster’s size without full clarity on costs and potential overprovisioning. This was when Iterable’s Staff Software Engineer, SRE Jason Sanghi, turned to Cast.

There’s a lot of offerings on the marketplace, a lot of services. But one of the things about Cast AI was that I could use it in a read-only mode, which tells me how much we’re spending on our cluster at any given time. 

Once I know how much we’re spending at any given time, I can run tests on adding new node types or moving workloads and see if that delivers the savings I’m looking for. Before that, it was all shots in the dark.

Cost visibility and monitoring were just the starting points, as Iterable sought a solution combining these capabilities with cost optimization. Cast delivered tremendous value in significant cost savings.

Onboarding Cast AI was an efficient process

Iterable moved along the onboarding process slowly, carefully checking the impact of Cast on selected clusters across staging and production. 

“When I saw the crazy benefits of Cast in staging, I was still skeptical. I thought this was an ideal world. Even now, when Cast is running on Iterable architecture in our services, I’m still in disbelief that we get the same impact in production,” said Jason Sanghi.

The implementation process went smoothly thanks to the engagement of the Cast support team, which was involved at every stage, always providing expert advice and solving any challenges that arose. 

The Cast support and sales team are so encompassing in their efforts. They will ping you when they notice a problem in your cluster, help with your onboarding, share best practices, and jump on meetings with you. Any issues or bugs I’ve encountered were solved by the next day.

60% cost savings and full transparency

After carefully monitoring and gradually onboarding its clusters to Cast, Iterable was confident enough to deploy the solution in both staging and production environments.

“As SRE/cloud engineers/developers, we’re trained to be skeptics. And when people come to us with this type of savings, we ask many questions regarding reliability. Moving slowly, we gained confidence by running on a small segment of our send services. 

Ultimately, we put Cast management across all of our send services and let it run for a day or two, and I saw our cluster cost go way below a number I had never seen it even get close to. And then my eyes opened really wide, and I said, ‘Wow, we can use this.’

Cast generated 80% of savings on clusters in the staging environment and 20% on clusters in the production environment. Once fully onboarded, Iterable achieved over 60% cost savings.

These automation features were game-changers

Iterable achieved these savings using the following Cast features:

Dynamic node types

When Iterable migrated its services from Amazon EC2 to EKS, many were given generic or standard pod sizes, such as large or medium pods. Minor changes in their sizes could make them fit better on instance types. But back then, the team didn’t know how these changes would work.

“As we onboard workloads, we can see what the engine will choose as the best fitting node type, maybe 40-core CPU, 64-core CPU, various combinations of these can fit on the nodes in a more compatible manner, but for us to try all of those out on our own, it’s like the traveling salesman problem. We would spend all of our days assigning node groups for things that can fit together,” said Jason Sanghi. 

Bin packing

Bin packing introduces dynamic node sizes and lets teams gauge what nodes they will request and which ones are cheaper. It handles the constant compression and deletion of nodes that aren’t needed anymore. 

“We saw about 20% savings just by compressing things down through the bin packing service and sticking with on-demand nodes. We just had one or two node types that we trusted to use, Cast can say, well, you guys trusted these, but we think that these other 10 or 15 will perform in the same manner. And we can just test it over the next couple of weeks for you,” said Jason Sanghi. 

Bin packing quickly brought the expected results. “Running it, just soaking it in there and seeing no issues, no noticeable performance decreases, no customer complaints, no issues or outages was amazing. And then it’s just immediate 20% savings. And for very large clusters, 20% savings is easily over a million per year,” he added.

Spot Instance automation

Initially, the team wasn’t sure if it could use Spot Instances at all, given the potential interruptions and periods when Spot Instances become unavailable (spot droughts).  

“We weren’t sure how a spot drought would be handled. Would the on-demand node be requested in time to keep our services available? Would deployments go through in the same amount of time? If we have a deployment window of five minutes, can we request new nodes and deploy them in five minutes? Those were our main concerns,” said Jason Sanghi.

As the team at Iterable found out, Cast provides the same reliability as Spot Instances. If the team is running 50 on-demand nodes, they could switch to a larger number of Spot Instances to handle the potential interruption rate. If that rate is 20%, one can run 70 or 80 nodes of Spot Instances, which would still be far cheaper than running 50 nodes on demand. 

With Spot Instances, you can get more reliability, more volume, more compute, and more capacity than on on-demand nodes – and for a much lower price. You may have to reframe it to people who claim their workloads can’t handle Spot Instances. And we can respond: well, you only had 40 pods before, now I’ll give you 60 for a lower price.

There’s a lot of ability to just negotiate within Cast and our teams about costs. Before that, all we had was shared cluster costs in a black box. Now, I have a slider inside the Cast platform that lets me choose between on-demand and Spot Instances.

Scheduled rebalancing

Due to Iterable’s use case, demand for cloud resources increases at the top of the hour and falls within the next 20 minutes. The team uses the Scheduled Rebalancing feature to reduce cloud waste, running at every hour.

During rebalancing, Cast automatically replaces suboptimal nodes with new ones that are more cost-efficient and run the most up-to-date node configuration settings.

Every hour, Cast removes around 2000 CPUs for Iterable, reducing the number of nodes by 10% and generating significant savings. During low-demand periods like nighttime, rebalancing action can achieve even 100% savings by draining and removing the nodes altogether. 

Making the most of AWS Savings Plans

Like many companies that heavily rely on cloud services, Iterable purchased AWS Savings Plans to cover some of its services at a discounted rate. As a result of implementing Cast, the company started saving so much that the capacity covered by Savings Plans allows more flexibility.

This is a good place to be in, but if you’re paying for certain reliability, you should use it. Now, we can scale any feature, knowing that we have the ability to offset that growth with Spot Instances. And it’s cheaper growth – a third of the price. All the while, we keep our Savings Plans 100% percent utilized, not overutilized or underutilized. This gives us an extra lever that we didn’t have before to really control our spend.

Incentivizing engineers to find cost savings

Another benefit of implementing Cast was increasing the accessibility of cost-effective Kubernetes solutions among engineers who would otherwise have no time to explore and set such tools up on their own.

“A lot of people want to use the advanced features of tools like Kubernetes, advanced features in the cloud, but they don’t have any access to them, or it’s past the scope of their job. But when you bring stuff like this to them – Spot Instances, dynamic node types – in the form of just a service configuration value, they’re more likely to test them, to use them in deployments or A/B tests, and to see what effect it has. They also feel more empowered in the ownership of their service and the infrastructure it runs on,” said Jason Sanghi.

The advanced reporting capabilities of Cast play a role in raising awareness of costs as well. “We’re excited about the namespace and workload cost reporting because that will fully give engineers insight into their spending,” he added.

I’d recommend Cast to anybody who doesn’t have Spot Instances in their Kubernetes cluster. I guarantee at least one service can handle Spot interruptions, with the 2-minute interval for that pod to be rescheduled on a new node without interrupting your availability. 

People running Kubernetes clusters that don’t utilize Spot Instances would benefit greatly from Cast – or anyone who feels like they’re overprovisioning their infra when they don’t need to. It’s a unique solution.

Cast AICase StudiesIterable

501-5000

Marketing automation

US

Cloud provider logo EKSEKS

Automate and maintain your clusters.


This field is for validation purposes and should be left unchanged.
Download the PDF
By submitting this form, you acknowledge and agree that Cast AI will process your personal information in accordance with the Privacy Policy.
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form