Bud Achieved 90%+ Utilization and Higher Engineering Productivity

Company

Bud Financial (“Bud”) enhances financial data by identifying merchant, category, location, and transaction frequency, providing actionable insights and comprehensible inputs for LLMs in the financial services sector. With tens of billions of transactions handled, Bud’s market-leading transactional enrichment, categorization, and analysis lets financial organizations fully leverage their customer data and identify new, data-driven growth possibilities.

Challenge

Bud initially aimed to gain more visibility into the allocation and spending of cloud costs. However, once the team started taking action to optimize these expenses, the focus shifted to preventing the need for manual checks in the future while increasing resource utilization. That led to the idea of using an automation solution.

Solution

Bud initially tested a combination of various open-source tools, but their management proved challenging, and the cost of maintaining that platform was high. The company turned to Cast AI as an automation solution that offers visibility into its mechanisms and various safeguards. Using Cast, Bud automatically scales resources up and down and sets workload requests and limits to boost resource utilization and eliminate cloud waste.

Results

47% of cost savings thanks to node hibernation
Dramatic improvement of CPU and Memory utilization of up to 93%
100% utilization of Committed Use Discounts

Node hibernation

Cast lets users pause and resume a Kubernetes cluster on a predefined schedule. By shrinking cluster nodes during weekends and at night, Bud saved 80 hours of operation per week, which translates to 47% cost savings.

Rebalancing and bin packing nodes

Rebalancing helps your cluster achieve the most optimum and up-to-date state by automatically replacing inefficient nodes with new ones that are more cost-effective and use the most recent configuration parameters.

By running regular rebalancing, Bud reduces the number of compute instances in use by efficiently packing all the workloads into fewer nodes.

Cast maintained the optimal cost by running a Daily Scheduled rebalance job.

This is how Cast helps Bud achieve better resource utilization across CPU and Memory.

Boosting CPU and Memory utilization

Bud also uses Workload Autoscaler to rightsize workloads and reduce CPU requests and unlock new cost savings.

The image above shows the impact of Workload Autoscaler on the compute cluster cost.

The image above shows the drop in requested CPU per hour after integrating Cast.

Nearly 100% Committed Use Discounts utilization

Since Bud uses Committed Use Discounts (CUDs), Cast also helps the team maximize its use of these resources to make the cloud setup more efficient, reaching an average utilization rate of almost 100%.

Enabling the Workload Autoscaler brought instant savings. Before Cast AI, I spent about six months trying to educate people on how to properly set their CPU, memory, and scaling configurations.
That involved setting up dashboards and automating insights, and even after giving teams the right settings, getting them actioned, and tuned across all our environments takes time.
With the Workload Autoscaler, all of that happens automatically, so the development teams can focus on coding instead of tweaking configurations.
Dan Udell, Director of Foundations Engineering at Bud

Running a financial services application in the cloud

What is the core use case of Bud’s platform?

We’re a business-to-business financial data intelligence platform: we analyze transactions that happen, categorize them, and enrich them with additional data, using machine learning to add insights.

It all starts with the enrichment process – taking in millions of transactions, enriching them quickly, and providing extra insights. With that data, we can help businesses understand their customers without needing to be experts in data analysis. That’s the core of what Bud does as a platform.

What are the biggest infrastructure challenges you face?

The challenges for infrastructure are quite varied.

First, there’s a large volume of data coming in that we need to process quickly, so having solid APIs is crucial. We also need to connect to services and handle different methods of ingesting transactions, whether it’s through open banking, first-party ingestion, or other means. The first challenge is figuring out how people connect to us and provide that data.

Then, we face the challenge of processing and storing that data. One of the considerations here is dealing with regulations. Since we work with banks across the world – not just in the UK – different regions have different storage requirements. Not all data can be stored on the same clusters. For example, we now have around 25 clusters because some need to be ring-fenced while others need to be in different zones or regions.

On top of that, there’s the data intelligence aspect, where we’re using machine learning and analytics. We have clusters running tools like Kubeflow for building and training our models. So, our infrastructure covers a wide range of needs – data storage, production APIs, processing, and discovery spaces. We’re working to manage all of this as efficiently as possible.

Controlling cloud costs became a challenge

How did you approach understanding and managing your company’s spending on the cloud?

Initially, the challenge was understanding where we were spending the money. It’s always difficult to get a bill and then figure out where the money has gone, especially when you have 25 different clusters with various product sets. Questions like “Is it all going to discovery, production, or staging?” arise, along with “Which domains are costing us the most?” and “How is the spending allocated across the business?”

The first step was identifying where the money was being spent. Once we knew that, the next question was, “What actions can we take to reduce this?” Are we being as efficient as possible? Do we really need 10 nodes for a certain product? Have we done a performance analysis to confirm we need all of that? Are we even using those resources efficiently?

When we started taking action to optimize, the focus shifted to preventing the need for manual checks in the future. How could we ensure we’re only using the resources we need instead of overprovisioning and later realizing we overspent for a month?

This led to the idea of automating these mitigations.

In my first two or three months at Bud, I manually went project by project, cluster by cluster, and managed to save tens of thousands per month. But this was more than a full-time job. That’s why we looked into solutions like Cast AI – to offload that work to something that could monitor and make smart decisions automatically, much faster than I could.

Looking for the right automation solution

How did you approach evaluating different Kubernetes automation solutions?

We followed the usual path of exploring open-source solutions and considering what we could piece together from offerings like Google Cloud’s tools and other open-source options to try and build something in-house. We also looked into Kubecost.

Our thought process was to use different technologies in combination – using a descheduler for bin packing, Google’s auto-node provisioning, and Kubecost for visibility. But when you start piecing together all these technologies, managing them becomes a real headache.
Additionally, the cost of maintaining that platform can be surprising. Even though it’s all open source and installed on our cluster, it takes up space, time, and resources.

That’s when one of my colleagues pointed me to Cast.

How does automation fit into your approach to managing cost and infrastructure?

As someone who’s been a software engineer for 10 to 15 years, I’ve come to love automation. We try to automate every aspect of software engineering as much as possible, like continuous integration and continuous development, so it didn’t feel foreign to think that cost management should be automated too.

When you look at it, what I was doing manually was identifying rule sets and applying them repeatedly. Having a system that can apply rule sets automatically made sense to me. The challenge, though, was proving to others that it was doing the right thing and that it was safe. That’s where Cast has been helpful – it offers a lot of visibility into what it’s doing, along with safeguards like “break glass” options and spending caps.

We take advantage of all those features to ensure we’re managing things safely. We’ve also been able to prove it out by testing it in pre-production environments and rolling it out gradually after seeing success.

Automation, especially in the SRE and platform engineering space, is something we strive for in every aspect. We adopted tools like Terraform quickly because navigating consoles is cumbersome. Similarly, we switched to CircleCI, GitHub Actions, and other such tools. Automating the cost management process just makes sense.

How was the onboarding process with Cast, and what kind of results did you see early on?

The process was really smooth. We started with an initial trial where we took one of our pre-production clusters to see if we could save money, just to prove everything out. We used the installation scripts – just ran a command line, and Cast was up and running. Then we adjusted the UI to get the performance we wanted and ensured all the settings were correct.

We repeated this on a few more pre-production clusters, and it was quick. We started seeing value right away. We had dashboards that showed us how much we were spending day to day, where we were saving, and where there were opportunities to save more.

Once we contracted with Cast, we integrated it into our workflow using Terraform and GitOps to automate the process and ensure everything running on our clusters was properly managed.

The onboarding process took around two to three weeks. In that time, we had all of our staging environments onboarded and even our first production cluster, where we were already starting to see benefits.

Overall, it was a quick process, and we gradually onboarded more clusters and adopted additional cost-saving features over time.

Getting $2 back for every dollar spent

What kind of return on investment have you seen from using Cast?

I find it tricky to measure results because we’ve been growing while using Cast AI. If I had to give a conservative estimate, for every dollar we’ve spent on Cast, we’ve probably gotten somewhere between $1.30 and $1.50 back.

And that’s without even considering features like hibernating clusters, which provide huge benefits. For example, no developers are working on weekends or at night, so we can turn off 10 pre-production clusters during those times. That alone has saved us a significant amount, so we’re likely getting closer to $2 back for every dollar spent, if not more.

I think there’s still plenty of room to push the savings even further.

What specific features of Cast have had the most impact on these cost savings?

Cluster Autoscaler

The core autoscaling and bin packing product is fantastic. When you compare it to the standard autoscaler on Google Kubernetes Engine, it’s so much more efficient and flexible.

For example, I wanted to enable image streaming on a node pool, and with GKE, you’d have to delete and recreate the entire node pool, meaning every single node would have to go offline and return. That’s a big process. But with Cast AI, I could swap out nodes one at a time, and it just did it for me. This not only saved costs and time but also allowed the nodes to come online much quicker.

We saw real improvements when we had sudden spikes in load from customers. Instead of taking 5-10 minutes to scale up, it was just a couple of minutes. That was awesome.

Workload Autoscaler

One of the first things I noticed when I logged into the Cast dashboard was how inefficiently we were using our clusters. We were provisioning over 100 cores because we requested them, but only using 30 or 40 in some clusters, sometimes even less. Enabling the workload autoscaler brought instant savings.

Before Cast, I spent about six months working with people to properly set their CPU, memory, and scaling configurations. That involved setting up dashboards and automating insights, and even after giving teams the right settings, it took time to get these set and tuned across our whole estate.
With the Workload Autoscaler, all of that happens automatically, so the development teams can focus on coding instead of tweaking configurations.

Node hibernation

Another major benefit was the node hibernation feature. We have nine or ten pre-production clusters, which might sound like a lot, but they need to be separate because some of our production clusters are so unique, and we have to test them thoroughly before going live.

With Cast hibernation and Spot nodes. these environments cost around 10% of what a production cluster is, dropping from thousands to hundreds of dollars a month. That’s been a huge win for us.

How has using Cast impacted your time, engineering focus, and overall cost management?

I actually enjoyed the challenge of managing costs manually, but it’s really hard to do. For the first three months, it was a full-time job, and it started to affect my sanity. Every night, I would go to bed checking Google’s cost billing page, and every weekend, I would be monitoring for spikes and running downstairs to fix things. Now, with Cast, I trust that it’s in control, so I get a much better night’s sleep. The time saved has been massive.

There’s also savings in our engineers’ time. We use the Workload Autoscaler on many of our smaller workloads and clusters where we don’t have as much performance data, or people aren’t constantly watching them. It takes over, so they can focus on developing instead of worrying about cost and performance.

A year ago, I would have needed a team of two or three people just to look after clusters, but now it’s pretty simple thanks to Cast. That’s been a huge shift.

In terms of money saved, I often joke that every dollar I save, my counterpart in development spends, but that’s the advantage. We’re seeing growth in our products and user base, so it’s great that the savings help fund that growth.

Time-wise, platform engineers often try to balance cost efficiency, reliability, and productivity. If I had to focus all my time on cost management, other areas like incident handling or development speed would suffer. Cast frees up my time, so I can work on incident response – where we’ve made significant improvements – and also build tools to help developers work faster and safer.

Building a FinOps culture at Bud

How has the ability to gain deeper insight into costs and control them affected your team’s approach to cost management?

I would say the time saved and cost reduced are equally important for us, both huge wins. But the third major benefit for me has been the ability to democratize cost management. By tagging things according to different workloads and team ownership, we’ve moved away from centralizing cost management. Now, each team can manage its own costs, see how their workloads are performing, and drill down to understand the specific details.

Teams can tag resources to their own work and see which teams are driving the most costs, where spikes happen, and more. This has empowered others to take control of their costs, not just my team. It’s been a game-changer for us!

It’s essentially created a culture where engineers are incentivized to be more cost-conscious, which is great for overall efficiency and accountability.

A real partnership

How has your experience with Cast evolved over time, and how has their responsiveness contributed to your success?

It’s definitely been a journey and one that Cast has been deeply involved in, which has been amazing. When we started, we focused on using committed use discounts to lower our costs. At that point, optimizing those costs wasn’t even something Cast. But just six months later, they launched a product that addressed that very use case, asking if we could try it out and if it worked for us. That kind of responsiveness has been fantastic.

There have been other instances, too, like discussions around the workload autoscaler, where I mentioned it would be useful to support the Istio sidecar. A couple of weeks later, that feature was available for us to test. It’s clear that the product is constantly evolving, and Cast is really receptive to feedback.

The support has also been great—whenever I’ve had a problem, whether it’s been a technical issue or a bug in Terraform, they’ve been quick to help, offering hands-on guidance.
Overall, it’s felt less like a typical vendor relationship and more like a partnership, which is why I’m happy to sit down and have these kinds of discussions. It’s been a great experience!

Next steps

Have you experimented with AI Enabler or any similar features? What potential do you see for them in your use case?

We’ve started exploring both the AI Enabler and some of the security components. So far, we’ve mainly just played around with them. We’re not using them for our core AI functions, which are already built on other models, but we’re exploring how AI can improve our dev productivity.

For example, before pushing a batch of changes to production, we’re looking into using generative AI to summarize the changes. By integrating this into the AI proxy, we can ensure that queries are sent to the right engine at the right cost, while also getting visibility on those queries. So, the current focus is on leveraging AI, particularly generative AI, to improve dev velocity and manage costs and load through Cast AI’s proxy.

It’s been an interesting experiment, and we’re excited to see where this could go!

Results

Enabling the Workload Autoscaler brought instant savings. Before Cast AI, I spent about six months trying to educate people on how to properly set their CPU, memory, and scaling configurations.

That involved setting up dashboards and automating insights, and even after giving teams the right settings, getting them actioned, and tuned across all our environments takes time.

With the Workload Autoscaler, all of that happens automatically, so the development teams can focus on coding instead of tweaking configurations.

Our thought process was to use different technologies in combination – using a descheduler for bin packing, Google’s auto-node provisioning, and Kubecost for visibility. But when you start piecing together all these technologies, managing them becomes a real headache.

Additionally, the cost of maintaining that platform can be surprising. Even though it’s all open source and installed on our cluster, it takes up space, time, and resources.

The onboarding process took around two to three weeks. In that time, we had all of our staging environments onboarded and even our first production cluster, where we were already starting to see benefits.

I find it tricky to measure results because we’ve been growing while using Cast AI. If I had to give a conservative estimate, for every dollar we’ve spent on Cast, we’ve probably gotten somewhere between $1.30 and $1.50 back.

The core autoscaling and bin packing product is fantastic. When you compare it to the standard autoscaler on Google Kubernetes Engine, it’s so much more efficient and flexible.

Before Cast, I spent about six months working with people to properly set their CPU, memory, and scaling configurations. That involved setting up dashboards and automating insights, and even after giving teams the right settings, it took time to get these set and tuned across our whole estate.

With the Workload Autoscaler, all of that happens automatically, so the development teams can focus on coding instead of tweaking configurations.

A year ago, I would have needed a team of two or three people just to look after clusters, but now it’s pretty simple thanks to Cast. That’s been a huge shift.

Teams can tag resources to their own work and see which teams are driving the most costs, where spikes happen, and more. This has empowered others to take control of their costs, not just my team. It’s been a game-changer for us!

The support has also been great—whenever I’ve had a problem, whether it’s been a technical issue or a bug in Terraform, they’ve been quick to help, offering hands-on guidance.

Overall, it’s felt less like a typical vendor relationship and more like a partnership, which is why I’m happy to sit down and have these kinds of discussions. It’s been a great experience!

Cut your cloud costs in half

Solutions

Resources

Company

Book a demo

Download the PDF

Bud achieved 90%+ resource utilization, reduced costs, and increased engineer productivity

Company

Challenge

Solution

Results

Node hibernation

Rebalancing and bin packing nodes

Boosting CPU and Memory utilization

Nearly 100% Committed Use Discounts utilization

Enabling the Workload Autoscaler brought instant savings. Before Cast AI, I spent about six months trying to educate people on how to properly set their CPU, memory, and scaling configurations.

That involved setting up dashboards and automating insights, and even after giving teams the right settings, getting them actioned, and tuned across all our environments takes time.

With the Workload Autoscaler, all of that happens automatically, so the development teams can focus on coding instead of tweaking configurations.

Running a financial services application in the cloud

What is the core use case of Bud’s platform?

What are the biggest infrastructure challenges you face?

Controlling cloud costs became a challenge

How did you approach understanding and managing your company’s spending on the cloud?

Looking for the right automation solution

How did you approach evaluating different Kubernetes automation solutions?

Our thought process was to use different technologies in combination – using a descheduler for bin packing, Google’s auto-node provisioning, and Kubecost for visibility. But when you start piecing together all these technologies, managing them becomes a real headache.

Additionally, the cost of maintaining that platform can be surprising. Even though it’s all open source and installed on our cluster, it takes up space, time, and resources.

How does automation fit into your approach to managing cost and infrastructure?

How was the onboarding process with Cast, and what kind of results did you see early on?

The onboarding process took around two to three weeks. In that time, we had all of our staging environments onboarded and even our first production cluster, where we were already starting to see benefits.

Getting $2 back for every dollar spent

What kind of return on investment have you seen from using Cast?

I find it tricky to measure results because we’ve been growing while using Cast AI. If I had to give a conservative estimate, for every dollar we’ve spent on Cast, we’ve probably gotten somewhere between $1.30 and $1.50 back.

What specific features of Cast have had the most impact on these cost savings?

Cluster Autoscaler

The core autoscaling and bin packing product is fantastic. When you compare it to the standard autoscaler on Google Kubernetes Engine, it’s so much more efficient and flexible.

Workload Autoscaler

Before Cast, I spent about six months working with people to properly set their CPU, memory, and scaling configurations. That involved setting up dashboards and automating insights, and even after giving teams the right settings, it took time to get these set and tuned across our whole estate.

With the Workload Autoscaler, all of that happens automatically, so the development teams can focus on coding instead of tweaking configurations.

Node hibernation

How has using Cast impacted your time, engineering focus, and overall cost management?

A year ago, I would have needed a team of two or three people just to look after clusters, but now it’s pretty simple thanks to Cast. That’s been a huge shift.

Building a FinOps culture at Bud

How has the ability to gain deeper insight into costs and control them affected your team’s approach to cost management?

Teams can tag resources to their own work and see which teams are driving the most costs, where spikes happen, and more. This has empowered others to take control of their costs, not just my team. It’s been a game-changer for us!

A real partnership

How has your experience with Cast evolved over time, and how has their responsiveness contributed to your success?

The support has also been great—whenever I’ve had a problem, whether it’s been a technical issue or a bug in Terraform, they’ve been quick to help, offering hands-on guidance.

Overall, it’s felt less like a typical vendor relationship and more like a partnership, which is why I’m happy to sit down and have these kinds of discussions. It’s been a great experience!

Next steps

Have you experimented with AI Enabler or any similar features? What potential do you see for them in your use case?

Cut your cloud costs in half

More customer stories

Boost Kubernetes performance, security, and cost optimization

Book a demo

Download the PDF