How OpenX Autoscales Spot VMs for Major Cost Savings

Serving more than 30,000 brands, OpenX is a leading real-time ad exchange that helps advertisers get the highest value for any trade. To build a platform that meets the demands of this high-paced adtech sector, OpenX uses Google Cloud and its managed Kubernetes service, Google Kubernetes Engine, to run over 200,000 CPUs across a dozen regions.

To balance cost and capacity, OpenX used the native GKE cluster autoscaler, but as the sophistication of infrastructure grew, the solution turned out to be insufficient due to complexity and limitations of native cluster autoscaler.

I’d recommend Cast AI to any company. Because no matter how many clusters you have – one or a hundred – I think there is always room for improvement. Especially if your company does care about the cost of compute resources. Cast just simplifies so many things that would normally require special attention.
Ivan Gusev
Principal Cloud Architect at OpenX

By implementing Cast, OpenX benefits from accurate autoscaling that considers variable pricing and capacity availability, spot VM automation with fallback to on-demand mechanism, and real-time visibility into cloud costs.

A modern adtech application needs the cloud

To build a platform that meets the demands of the adtech industry, OpenX moved to Google Cloud services and started using its managed Kubernetes service, Google Kubernetes Engine. The company has a five-year agreement totaling more than $110 million, and the team manages over 200,000 GKE cores at peak running across a dozen of regions.

Before the cloud, capacity planning was challenging since we had to plan six months in advance. And with the adtech industry, that basically never works out because things are changing much faster than that. With the cloud, all the capacity is on-demand as soon as you need it. We are going through application modernization and have standardized on GCP managed services where possible, trying to eliminate any kind of legacy that we might have brought with us from the data center times. It has been more than three years in the cloud.
Ivan Gusev
Principal Cloud Architect at OpenX

OpenX soon encountered two challenges: cost and capacity

With the cloud comes scale and ease of provisioning virtual machines – and the challenges related to it: cost and capacity. “Striking the balance between how much your infrastructure costs versus how much revenue it brings is a big challenge,” said Ivan Gusev.

The second challenge more directly related to scale is capacity. The big secret of cloud providers is that they don’t offer infinite capacity, and large workloads like OpenX may encounter limits in the network, compute, or storage capacity.

The primary capacity limit that we’re often touching is spot-preemptible compute capacity. And that’s why we’ve been developing ways of mitigating that. Often, you don’t have the ability to provision the most cost efficient capacity. So, mixing these two requirements – running highly efficient, using the cheapest possible compute and large capacity in the cloud – makes it probably the biggest challenge for us.
Ivan Gusev
Principal Cloud Architect at OpenX

OpenX addresses this issue using native functionality GKE provides, cluster autoscaler and a variety of node pools are just some of them. There are several dimensions that need to be accounted for and often optimized simultaneously: cost, capacity, size of VM, location (zone). In addition, there is also the state of the cluster and workloads that need to be running on it. Decisions that were optimal two hours ago become sub-optimal as the situation across multiple dimensions changes and needs to be reconciled.

Over time, the infrastructure and configurations have become increasingly complex and hard to understand and troubleshoot, with multiple pieces responsible for different aspects of capacity pricing, planning, acquisition, and continuous optimization. “So, we started looking into alternative options, and sure enough, Cast offered one of the best options that we considered,” said Ivan Gusev.

I think the assurance that Cast is performing better and will be outperforming going forward comes from the conversations with engineers and support. It felt like I had somebody backing me in our team struggles with capacity and cost challenges. I got the whole team supporting me and trying to deliver the best solution possible.
Ivan Gusev
Principal Cloud Architect at OpenX

Real-time cluster autoscaling got OpenX a quick win

Available on the Google Cloud Marketplace, Cast comes with autoscaling mechanisms and policies that are particularly important to OpenX. “When I look at the autoscaler settings in Cast, it checks many boxes I’ve been developing. So many of these policy features are achievable. Outside, they probably need three to five components that work together and are often unaware of each other. In Cast, it’s just a matter of configuring the policy and watching how it performs – necessary data is also presented to you,” said Ivan Gusev.

The advantage of using the Cast autoscaler vs. the native one provided by GKE is that it can handle variable pricing – which is especially important as almost 100% of compute at OpenX runs on spot VMs and each region has its own price structure for compute.

“GKE’s cluster autoscaler uses pricing for different node types, node families, or instance types. If you look deep into the code, it has fixed pricing, which is not really pricing. It’s ratios – it knows that n1 is cheaper than n2. With more dynamic pricing and variable spot pricing, it becomes very inaccurate. So, your cost optimization on GKE is really limited by whatever is hardcoded on the cluster native autoscaler,” said Ivan Gusev.

The Cast autoscaler, on the other hand, relies on real-time pricing information.

With Cast AI, you have very accurate and specific information that comes from the pricing API. It accounts for the region, the instance type, and so on, making decisions based on very accurate pricing.
Ivan Gusev
Principal Cloud Architect at OpenX

Kubernetes cost optimization

Monitor organization-wide and cluster-level resource spending. Automate resource allocation and scale instantly with zero downtime.

Learn more

Maximizing the benefits of committed cloud spend with spot VMs

OpenX runs its infrastructure solely on Google Cloud and has signed a multi-year contract with them to ensure the most cost-effective access to cloud services.

“But this also puts pressure on the capacity because for processing all requests in a single cloud, you have less options to obtain capacity, and we’ve been pretty close to maximum capacity in many of the US regions,” added Ivan Gusev.

Since OpenX runs nearly 100% of its compute on spot VMs, it needs to balance these capacity requirements with cost savings. The company can do that thanks to Cast’s seamless automation and the spot fallback feature that ensures workloads always have a place to run by moving them to on-demand resources in case of a spot drought.

We certainly have spot fallback always enabled, and it’s a normal situation for us to be unable to obtain spot capacity at the moment. But the capacity situation at Google Cloud is very dynamic. If you can’t obtain the spot capacity now, you might be able to in 10 minutes. That’s why spot fallback works great for us – we can expect Cast to maintain the best possible cost for the cluster by constantly attempting to replace the on-demand capacity with spot.
Ivan Gusev
Principal Cloud Architect at OpenX

Reduction of engineering workload

One of the key drivers behind OpenX’s decision to implement Cast was limiting the engineering workload around the configuration and provisioning of cloud resources.

“The number of components and different tweaks and settings that we had to maintain just continued to grow as there are more instance types, compute families, regions, capacity issues, and multiple ways of routing traffic. Basically, it came to the point where it was probably getting pretty close to one headcount to maintain and monitor all these variables across multiple clusters. Offloading that to Cast is certainly helping us to be able to spend time on different problems,” said Ivan Gusev.

Another aspect of Cast that made a real difference to OpenX was the increased visibility around the current capacity situation, how it translates into the cost of running infrastructure, and the centralization of the Kubernetes cost management toolkit.

I’m not saying that Cast AI is fully hands-free. There is still sometimes a need to oversee and tune up things. But centralization of all the knobs and cost visibility makes it much less time-consuming than managing multiple tools, aggregating information from multiple sources, monitoring logs, container errors, and so on. That’s probably the best way I can describe our savings in the number of people who we would need to watch for capacity.
Ivan Gusev
Principal Cloud Architect at OpenX

Next steps

OpenX looks forward to expanding its use of Cast with new cost monitoring and forecasting features that are continually released by the team. “I feel like Leon [Cast AI’s CTO] is driving lots of interesting product features forward. That was another reason why we signed up. While Cast is already valuable as a product, I’m even more excited about what’s to come,” said Ivan Gusev.

Cast is also set to contribute to improving the environmental footprint of OpenX. The company is the first and only ad exchange to achieve CarbonNeutral® certification, and it has signed the SBTi pledge to reduce carbon emissions by more than 90%. “Looking at our long-term vendor relationship, we feel like there could be ways of incorporating and improving our carbon footprint using Cast and the decisions it makes,” added Ivan Gusev.

Cast feels like a very modern approach both from the technical and UI perspective. Add to that excellent support team resources to answer any questions, and you’ve got the three biggest things that stand out for me. It feels like driving a Tesla.
Ivan Gusev
Principal Cloud Architect at OpenX