Cloud Cost Transparency, Control, and Saving $3M on Amazon EKS: How Iterable Does It
“People running Kubernetes clusters that don’t utilize spot instances would benefit a lot from CAST AI – or anyone who feels like they’re overprovisioning their infra when they don’t need to be so. It’s a really unique solution.”
San Francisco, CA
Cloud services used
Iterable, the customer activation platform that helps brands deliver joyful experiences with harmonized, individualized and dynamic communications at scale, was facing a common Kubernetes challenge: cost. Container compute expenses were one of the highest-cost items for the Iterable team. But before they could chart a cost-saving path forward, the team had to first solve unique challenges around cost transparency and cost monitoring.
Implementing CAST AI, Iterable was able to mitigate key pain points and reduce their annual bill by over 60%. The team now has more control over their cloud resources and can flexibly adjust the ratio of on-demand and spot instances (a huge boon for cost balance optimization). This is how Iterable achieves an optimal cost balance and utilizes its AWS Savings Plans to the fullest.
When you start talking about those numbers, leadership cares, finance cares, they start building it into their cost projections and estimates. Even 1 or 3% of a massive AWS bill is significant. So when you say numbers like 20% people are losing their minds. And our total possible savings take that 20% closer to 60 or 70%, which translates to $3-4 million per year. So that’s just an insane amount of savings. You can hire more people, you can spend more on infrastructure in other places.Jason Sanghi, Staff Software Engineer, SRE
Understanding and optimizing Kubernetes costs became a challenge
To optimize its operational expenses, Iterable planned to reduce cloud spend – and Kubernetes costs were the second-largest service on the list of optimization candidates. However, the team didn’t have the required mechanisms for measuring or reducing these costs back then.
With reliability as its critical priority, the team was hesitant to shrink the cluster’s size without having full clarity on costs and potential overprovisioning. This was when Jason Sanghi turned to CAST AI.
There’s a lot of offerings on the marketplace, a lot of services. But one of the things about CAST was that I can use it in a read only mode where it tells me how much we’re spending on our cluster at any given time. Once I know at any given time how much we’re spending, I can run tests on adding new node types or moving workloads and seeing if that delivers the savings that I’m looking for. Before that it was all shots in the dark.Jason Sanghi, Staff Software Engineer, SRE
Cost visibility and monitoring were just the starting points. Ultimately, Iterabale was looking for a solution that would combine these capabilities with cost optimization. Luckily, CAST AI covered this aspect of Kubernetes cost management as well.
Onboarding CAST AI was a smooth process
Iterable moved along the onboarding process slowly, careful to check the impact of CAST AI on selected clusters across staging and production.
“When I saw the crazy benefits of CAST AI in staging, I was still skeptical. I thought this was an ideal world. Even now, when CAST AI is running on Iterable architecture in our services, I’m still in disbelief that we get the same impact in production,” said Jason Sanghi.
The implementation process went smoothly thanks to the engagement of the CAST AI support team that was part of the process at every stage, always providing expert advice and solving any challenges that arose.
The CAST AI support and sales team are so encompassing in their efforts. They will ping you when they notice a problem in your cluster, help with your onboarding, share best practices, and jump on meetings with you. Any issues or bugs I’ve encountered were solved by the next day.Jason Sanghi, Staff Software Engineer, SRE
Get results like Iterable – book a demo with CAST AI now
Massive savings and full cost transparency
After careful monitoring and gradual onboarding of its clusters to CAST AI, Iterable was confident enough to bring the solution to both staging and production environments.
As SRE/cloud engineers/developers, we’re trained to be skeptics. And when people come to us with this type of savings, we ask a lot of questions regarding reliability. Moving slowly, we gained confidence by running on a small segment of our send services. Ultimately, we put CAST management across all of our send services, let it run for like a day or two, and I saw our cluster cost go way below a number I had never seen it even get close to. And then my eyes opened really wide and I said, wow, we can use this.Jason Sanghi, Staff Software Engineer, SRE
CAST AI generated 80% of savings on clusters in the staging environment and 20% on clusters in the production environment. In total, the company expects to reach 60-70% of cost reduction once fully onboarded.
Iterable achieved these savings using the following CAST AI features:
Dynamic node types
When Iterable migrated its services from Amazon EC2 to EKS, many of them were given generic or standard pod sizes such as large or medium pods. Minor changes in their sizes could make them fit better on instance types. But back then, the team didn’t have the certainty on how these changes would work.
“As we onboard workloads, we can see what the engine will choose as the best fitting node type, maybe 40-core CPU, 64-core CPU, various combinations of these can fit on the nodes in a more compatible manner, but for us to try all of those out on our own, it’s like the traveling salesman problem. We would spend all of our day assigning node groups for things that can fit together,” said Jason Sanghi.
Bin packing introduces dynamic node sizes and lets teams get a sense of what nodes will request and which ones are cheaper, handling the constant compression and deletion of nodes that aren’t needed anymore.
“We saw about 20% savings just by compressing things down through the bin packing service and sticking with on-demand nodes. We just had one or two node types that we trusted to use, CAST can say, well, you guys trusted these ones, but we think that these other 10 or 15 will perform in the same manner. And we can just test it over the next couple of weeks for you,” said Jason Sanghi.
Bin packing quickly brought the expected results. “Running it, just soaking it in there and seeing no issues, no noticeable performance decreases, no customer complaints, no issues or outages was really amazing. And then it’s just immediate 20% savings. And for very large clusters, 20% savings is easily over a million per year,” he added.
In the beginning, the team wasn’t sure if it could use spot instances at all, given the potential interruptions and entire periods when spot instances become unavailable (spot droughts).
“We weren’t sure how a spot drought would be handled. Would the on-demand node be requested in time to keep our services available? Would deployments go through in the same amount of time? If we have a deployment window of five minutes, can we request new nodes and deploy them in five minutes? Those were our main concerns,” said Jason Sanghi.
As the team at Iterable found out, CAST AI provides the same reliability with spot instances. If the team is running 50 on-demand nodes, they could switch to a larger number of spot instances to handle the potential rate of interruption. If that rate is 20%, then one can run 70 or 80 nodes of spot instances and it would still be far cheaper than running 50 nodes on-demand.
With spot instances, you can get more reliability, more volume, more compute, and more capacity than on on-demand nodes – and for a much lower price. You may have to reframe it to people who claim their workloads can’t handle spot instances. And we can respond: well, you only had 40 pods before, now I’ll give you 60 for a lower price. There’s a lot of ability to just negotiate within CAST AI and our teams about costs. Before that, all we had was shared cluster costs in a black box. Now, I have a slider inside the CAST AI platform that lets me choose between on-demand and spot instances.Jason Sanghi, Staff Software Engineer, SRE
Currently, two-thirds of nodes are fully onboarded with the bin packing service and two-thirds on spot instances. Iterable is planning to move to as many spot instances as possible to achieve even greater savings.
Making the most of AWS Savings Plans
Like many companies that heavily rely on cloud services, Iterable purchased AWS Savings Plans to cover a part of its services at a discounted rate. As a result of implementing CAST AI, the company started saving so much that the capacity covered by Savings Plans allows more flexibility.
This is a good place to be in, but if you’re paying for certain reliability, you should use it. Now we can scale any feature, knowing that we have the ability to offset that growth with spot instances. And it’s cheaper growth – a third of the price. All the while, we keep our Savings Plans 100% percent utilized, not overutilized or underutilized. This gives us an extra lever that we didn’t have before to really control our spend.Jason Sanghi, Staff Software Engineer, SRE
Incentivizing engineers to find cost savings
Another benefit of implementing CAST AI was increasing the accessibility of cost-effective Kubernetes solutions among engineers who would otherwise have no time to explore and set such tools up on their own.
“A lot of people want to use the advanced features of tools like Kubernetes, advanced features in the cloud, but they don’t have any access to them or it’s past the scope of their job. But when you bring stuff like this to them – spot instances, dynamic node types – in the form of just a service configuration value, they’re more likely to test them, to use them in deployments or A/B tests and to see what effect it has. They also feel more empowered in the ownership of their service and the infrastructure it runs on,” said Jason Sanghi.
The advanced reporting capabilities of CAST AI play a role in raising the awareness of costs as well. “We’re excited about the namespace and workload cost reporting because that will fully give engineers insight into what they’re spending,” he added.
I’d recommend CAST AI to anybody who doesn’t have spot instances in their Kubernetes cluster. I guarantee there’s at least one service that can handle spot interruptions, with the 2-minute interval for that pod to be rescheduled on a new node without interrupting your availability. People running Kubernetes clusters that don’t utilize spot instances would benefit a lot from CAST AI – or anyone who feels like they’re overprovisioning their infra when they don’t need to be so. It’s a really unique solution.Jason Sanghi, Staff Software Engineer, SRE
Get results like Iterable – book a demo with CAST AI now
CAST AI features used
- Spot instance automation
- Real-time autoscaling
- Instant Rebalancing
- Full cost visibility