Bede Gaming automatically optimizes K8s workloads with no risk to performance

Company

Bede Gaming (part of the Merkur Group) provides a leading digital platform for the online gaming industry, powering some of the market’s most well-known companies with a scalable and secure solution. The platform manages data for nearly 6 million players worldwide, processing over 8 billion transactions worth over £22 billion annually.

Challenge

Facing massive user traffic peaks regularly, development teams overprovisioned Kubernetes workloads to deliver a fantastic experience to end users. But that was costly, which prompted Bede Gaming to search for a solution to optimize resource utilization at the workload level without impacting performance or adding extra work to the platform team.

Solution

Bede Gaming turned to Cast AI’s Workload Autoscaler with its automated rightsizing capabilities. The solution automatically sets workload requests and limits to increase resource utilization, eliminate cloud waste, and balance cost and performance.

Automation is important because in a perfect world – and I think any engineer would agree – you want to automate everything as much as possible. You want to remove the level of risk around human error. And ideally, you want something that’s going to run at a minimal cost. 

Having the ability to automate what we do with rules and thresholding, I’m certain that the machine will work way faster than we possibly could with any human being, 24/7.

It’s all about striking that balance between cost and performance and, ultimately, becoming more efficient for our customers. Cast AI helps us achieve that.”

Dan Whiteley
Chief Technology Officer at Bede Gaming

Scalability and the risk of overprovisioning

What is the most prominent business use case your cloud infrastructure supports?

We provide backend platform services for iGaming companies, powering some of the sector’s biggest brands across lottery, casino, sportsbetting and bingo. With those types of environments, you’re running a lot of high-availability, high-performance infrastructure that needs to be able to respond to very high traffic levels. 

Imagine a scenario where you’ve got a major event coming up, like the Super Bowl. People want to make a sports bet, and suddenly, hundreds of thousands of users will potentially visit your site. Online casinos are a little more consistent in traffic, but you still get users who make a lot of transactions over a period of time. 

The infrastructure needs to respond to that demand. So, we get the scale and elasticity of the infrastructure, but without good cost control, it can get very expensive very quickly. 

E-commerce companies experience this as well during Black Friday. Maybe some of their teams are provisioning more cloud resources than necessary just to have a quieter life, but it comes at a very big cost to the business. 

Balancing cost and performance

What is your approach to cloud cost management? 

When I joined the company, one of the things I became aware of was the cost of running our platform. It’s a large and detailed system with lots of integrations, tools and features, and this was adding some high operational run costs.

We’ve taken the first step to move over to Kubernetes, but how do we ensure that we optimize that infrastructure? 

My objective was twofold:

  1. Gaining better control and visibility over cost
  2. Bringing the cost down to make our platforms more cost-effective

And from the end-player perspective, all of this should make no difference. They should have a great, responsive experience. They shouldn’t care how the clock works, so to speak.

We’ve got a really powerful infrastructure, which we were scaling down manually after peak traffic periods ended, but a human being can only go so far versus a machine. It’s important to strike the right balance between maintaining a high quality of service while keeping our expenses down.

As it turns out, some algorithms can tune that and point out potential savings for us. We want to always drive efficiencies by leveraging technologies like Cast AI. 

What was your first step on the journey to optimizing your cloud costs?

When I first asked how much our cloud bill was, I instantly felt it was expensive based on my experience with the type and size of the organization. And it turned out I was right. 

Our procedure was to overprovision workload during peak periods, ensuring good customer service by scaling resources beyond what was actually needed to ensure quality coverage for exceptional / unexpected spikes. 

I’m refreshing these procedures with a mindset adjustment to FinOps. The idea is that you need to run it lean without breaking the service. 

That’s where Cast AI was a really good eye-opener. I’m not relying on a human being just to turn a virtual dial in Azure and we don’t have to risk any human error with the resource calculations causing impact to the customer experience. Instead, algorithms are making the best choice for me. It gives us visibility and the ability to rightsize our infrastructure at the workload level. 

Automated rightsizing was the solution

How can a solution like Cast AI help companies that face peak times so often?

When dealing with high volumes of traffic, we need to have faith in the infrastructure and any technology we put behind it. We have to have faith that’s not going to disrupt the service. 

The experience with Cast AI was quite interesting because we could put it in a read-only mode and safely see how it could reduce our costs. Then, there are the levels of how aggressive we want to be in terms of optimizing workloads. 

For us, being less aggressive works well – we can still generate benefits by applying optimization without risking service stability. For example, I’m personally not comfortable putting our entire production environment on Spot VMs, but Cast AI’s features give us a lot of room to optimize – for example, picking the right virtual machine sizes or setting workload requests and limits that match actual resource requirements.

Overprovisioning is the easy way out. Anyone can do that. I’ll chuck some massive infrastructure at the problem, and the problem goes away. Yes, but it comes at a significant cost. So, it’s all about striking that balance between cost and performance, and, ultimately, becoming more efficient for our customers. Cast AI helps us achieve that.

Why can only automation solve this problem?

Automation is important because in a perfect world – and I think any engineer would agree – you want to automate everything as much as possible. You want to remove the level of risk around human error. And ideally, you want something that’s going to run at a minimal cost. Having the ability to automate what we do with rules and thresholding, I’m certain that the machine will work way faster than we possibly could with any human being, 24/7.

We have multiple environments for customers with full isolation between services, which is crucial for maintaining optimal stability. However, there’s considerable overhead in terms of environment management. 

So, if we can remove the people aspect from that equation and the potential mistakes they might make, that’s great. This is where Cast AI, with its level of automation, allows us to just point at these particular clusters and set the thresholds up. 

What level of savings were you able to achieve using the least aggressive settings from Cast AI?

We achieved cost savings in the 10% to 15% range, but we could easily go higher if we changed our thresholds. I’m quite happy with sort of those orders of magnitude since 10% on a big number is still a big number.

My approach is to roll Cast AI out to our customers’ environments step by step to build our initial baseline and then, over time, start to fine-tune it.

Cast AI has been what I would call one of those low-hanging fruits to get our workloads under control, get the cost visibility, and drive efficiencies that are ultimately low-risk and low-cost for us to implement.

Anyone who’s got a fairly expensive cloud bill should consider using Cast AI.

If you’re running Kubernetes, you need to go beyond just running it from a scaling perspective because you’re probably not rightsizing the workloads and VMs. Any organization that gets high volumes of traffic across gaming, financial services, and e-commerce could use automation and avoid overprovisioning a lot of hardware. 

What was the impact of Cast AI on the engineering workload? 

We probably don’t measure that per se, but in my mind, it’s one less thing to worry about, so teams can be focused on other things of potentially higher value. 

Having CAST AI running in the background with a good level of confidence that we’re running as efficiently as we can, balancing the service we’re providing – that’s great. Then, periodically, someone can check on Cast AI to verify that it’s doing what it’s supposed to – which is something we see anyway in our cost usage profile and monthly billing.

A solid partnership

What was the implementation process of Cast AI like? 

I’m quite instinctive as a CTO. If I look at a product and think, “This looks like it can do what we need to do,” I won’t spend days or weeks looking at other options. This worked with Cast AI since we ran a great POC; that was my success measure. 

The integration took 8 weeks to set up, with Cast AI assisting us through the process. This product delivers on what it says it can do. It was a no-brainer. Additionally, it was relatively low-risk from an implementation point of view for us and relatively low-cost. 

The final rollout has been gradual. My hope is that all of our customers will opt to enable this integration for their environments, as it certainly provides benefits to the service we offer them. 

How does the Cast AI team support you during the process?

It’s quite interesting for Cast AI to work with us because we bring the team on calls with our customers. We bought the tool; we’re good at it, but we’re not experts. So, it’s a kind of joint partnership as the Cast AI team runs mini POCs for our customers. 

It’s all about bringing value back to our customers and making them run as efficiently as possible with the least amount of change and risk involved. The Cast AI team understands this goal and supports us in achieving it.

One of your architects is a genius. He knows both Kubernetes and Cast AI inside and out, which is why I’ve got a lot of confidence in this approach. The POC is a great tool to demonstrate the product’s value. When you’re pairing that with people who are competent in what they do and how they understand the platform and your business, that’s very good. 

What other Cast AI features are you planning to use?

We have just started exploring the Kubernetes security aspect of Cast AI, and our principal DevOps engineer thought vulnerability scanning and other features were useful security features. This was a kind of side benefit to getting the tool, but an important one. 

Our primary goal was to achieve better cost control, but we were pleased to discover that there’s additional value we can take advantage of. 

Cast AICase StudiesBede Gaming

251-500

iGaming

EMEA

AKS

Automate and maintain your clusters.


Download the PDF
By submitting this form, you acknowledge and agree that Cast AI will process your personal information in accordance with the Privacy Policy.
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is hidden when viewing the form
This field is for validation purposes and should be left unchanged.