Company
Foretellix is the leading provider of data automation for AI-powered autonomy. Its Foretify™ toolchain offers a measurable, efficient, and reliable path to safe, scalable autonomy, empowering its customers to launch with confidence. Leveraging its industry-leading verification and validation technology, Foretellix is driving the AI autonomy revolution.
Challenge
As the Foretellix team and solution grew, its usage of cloud resources increased as well – generating significant costs that spurred Foretellix to search for a solution to optimize cloud resources, reduce cloud costs, and improve its energy efficiency.
Solution
Foretellix adopted Cast AI to improve cloud resource utilization and increase energy efficiency, making the system more sustainable and environmentally friendly. Using the same infrastructure to do more, this approach will benefit everyone and contribute to a greener, more efficient future.
By integrating Cast AI, Foretellix dramatically reduced its overprovisioning rate, resulting in energy savings of over 199 thousand kWh, which roughly translates into planting 3,823 trees for one year of CO₂ absorption.
Cast helps Foretellix optimize its cloud resources and cut costs in two ways. First, the company’s Kubernetes applications can automatically switch from Spot Instances to Reserved Instances during high-demand periods. Second, the autoscaling feature enables Foretellix to efficiently scale its spiky CI workflows up and down based on demand, reducing waste and maintaining fast developer feedback without compromising performance.
Results
- 30% cloud cost savings
- Improved the energy efficiency of its cloud infrastructure by boosting utilization
- Seamless automated management of Spot Instances
- Automatic downscaling of compute resources for extra cost savings
Thanks to implementing Cast, Foretellix runs its infrastructure at the lowest recorded cost ever. The company runs more workloads and tests new features without increasing the time or costs involved.
By boosting resource utilization and shutting down idle compute instances during low activity periods, Foretellix achieved its goal of improving the energy efficiency of its cloud infrastructure.
Node autoscaling
The Cast autoscaler efficiently manages spiky usage patterns of short-lived jobs by provisioning the compute capacity in less than two minutes. The image below shows how the cluster rapidly scaled up from zero to 2000 CPUs and then down from 2000 CPUs to zero.
Here is a similar rapid node scaling pattern for a 30-day period that resulted in improved resource efficiency and optimal compute costs.
Workload bin-packing
The node list below shows how the Cast autoscaler carries out workload bin-packing to effectively eliminate resource waste at the node level.
Spot Instance automation
Cast eliminates the complexity of handling Spot Instance interruptions via an automatic fallback mechanism that moves workloads to on-demand instances when needed – all the while maintaining optimal cost and high service availability.
The image below shows the impact of Cast on a small part of Fortellix’s infrastructure.
We implemented Cast AI and managed to cut at least 30% of our compute costs, which was huge for us. Not all of our compute costs come from Kubernetes, so the reduction is even higher if we focus only on Kubernetes.
This was a direct win for me, as it allowed me to hire more engineers with the savings.
Ron Grosberg, VP, Research & Development at Foretellix
Running hyperscale simulations in the cloud
How does your infrastructure support the technical requirements of your business?
At Foretellix, we help companies develop autonomous functions within vehicles to deploy them faster and more safely. We do this by creating scenarios at hyperscale. We combine the analysis of driven miles where customers record real-world drives, with KPIs extracted from millions of scenarios we automatically generate from abstract models.
Our technology requires running a huge number of workloads in the cloud at any given moment. This is where our cloud requirements come from, and as a result, our customers and we use a lot of cloud resources.
When did cloud costs become an issue for Foretellix?
Our solution involves running millions of simulations at hyperscale. Each simulation requires the autonomous vehicle software, a simulator, and our software to run in the cloud – sometimes using many GPUs.
Like our customers, we also run many simulations internally as part of our CI testing. This comes with a high price tag, so we have a Kubernetes-based solution that orchestrates cloud workloads, and we used Spot Instances to reduce costs.
But as our team and solution continued to grow, we’ve been using more and more resources. We noticed that costs were increasing and felt we weren’t running things optimally. So we wanted to find something that would benefit us and our customers.
Beyond cost, we also wanted to increase energy efficiency to make the system more sustainable and environmentally friendly, using the same infrastructure to do more. This approach will benefit everyone and contribute to a greener, more efficient future.
Finding the right solution
Have you considered an open-source solution for optimizing cloud costs?
I think the natural approach is to always start with open-source solutions, as they help save costs without increasing expenses. But by the time we looked into Cast AI, we had already done everything we could with open source.
Cost optimization isn’t really our main focus. Our primary goal is to help customers who develop autonomous vehicles.
I don’t want our company to become one that focuses on cloud resource utilization – just like car companies don’t want to build simulation tools themselves. I’d rather pay for something that does the job well so we don’t have to divert our developers to work on something that’s not part of our core competencies or technology.
Where did you start looking for a solution?
We experienced a bit of a perfect storm. One of our DevOps engineers noticed that we were spending a lot of money, and he had a gut feeling that we could reduce costs without reducing the workload.
At the same time, we were working with a company called Develeap on tailoring a DevOps course for our employees. As part of that process, they reviewed our infrastructure to assess where we were and to tailor the course accordingly.
They were the ones who suggested we try Cast AI. So, we decided to do a head-to-head comparison between Cast and another competitor, and Cast was a clear winner here. So it was a no-brainer for us to proceed.
What was onboarding Cast like?
I have to say, this is where people might think I’m being paid to say this, but I was genuinely impressed by the onboarding process. We started by doing a simulation, and as I mentioned, it does take time to convince us to fully commit to using a third-party solution. So, we began the onboarding with a simulation of Cast.
I was really impressed by how quickly it worked – it took maybe a day or two to get everything up and running. The ability to simulate exactly how the process would work when connecting to Cast was incredibly helpful because it gave us a clear view of how much we could save using Cast. And, by the way, the estimates were quite accurate.
Before we started working with Cast, I didn’t realize how easy it would be to see potential savings with almost zero effort on our part, and the results were spot on.
This gave us the confidence that Cast could really benefit us. After completing the simulation, it took another week or two to get everything running in production. Once live, we saw that the initial calculations were accurate and experienced a significant cost reduction. It was a great outcome.
Saving 30% on compute costs
What level of cost savings did you achieve thanks to Cast?
We implemented everything and managed to cut at least 30% of our compute costs, which was huge for us. Not all of our compute costs come from Kubernetes, so the reduction is even higher if we focus only on Kubernetes. This was a direct win for me, as it allowed me to hire more engineers with the savings.
Since we’re very data-driven, it was easy for us to see the potential savings without sacrificing performance. We were measuring everything, and when we introduced Cast AI, it was simple to look at the KPIs and see that we were performing at least as well as before, but with a massive reduction in costs. That was a key factor in our decision to move forward with Cast.
It all happened pretty quickly. It took about two or three weeks to fine-tune things – it’s never perfect right out of the box – but once we did, everything just worked. I remember we were in the middle of a workshop with the managers, and one of my engineers managed to set this up as a side project during the workshop. It was a really low effort, and the Cast AI team was very helpful and engaged throughout, which made the setup process fast and smooth.
Could you share more about your use of Spot Instances?
We were always at the forefront of using Spot Instances – people were often impressed by how much we relied on them. But in the past, we also had Reserved Instances because some types of compute resources were harder to get, so we had to manage them carefully.
When we switched to Cast AI, we saw the feature that automatically reverts to Reserved Instances when no Spot Instances were available. We actually were able to get rid of all these Reserved Instances that we kept and just use Cast as is. We now know that whenever there are no Spot Instances it will automatically switch to Reserved Instances without any effort on our part. It’s a no-brainer for us.
This was especially helpful at the end of last year when there’s typically a spike in demand for Spot Instances. Cast AI automatically switched to Reserved Instances on demand, without us having to do anything, which was a great benefit. We no longer had to move workloads from Spot Instances to Reserved Instances manually in preparation for a Spot drought.
Is resource autoscaling an impactful feature for your infrastructure?
We have our job orchestration system. We operate at a high scale almost constantly, but our workload is very spiky. Unlike others with long-running workloads, ours typically takes just a few minutes, but we have many of them. We need to either recycle the same machines to run more workloads or spin up additional machines.
Since we use this as part of our CI workflow, the demand fluctuates depending on how many developers push code simultaneously. For us, it’s crucial to provide feedback as quickly as possible, so when multiple developers push code simultaneously, we need to scale up and run many jobs in parallel to ensure fast feedback.
At the same time, during off-hours or at night – although we have employees globally, the majority are in certain regions – we scale down quickly. This cost-saving strategy ensures we’re not keeping machines running longer than needed while also making sure some machines are always ready so developers don’t have to wait too long when they need to run jobs.
Trusting automation to do the job
What was your approach to delegating resource management to an automation solution?
As I mentioned, we’re not easily convinced about giving up control. Still, we found that almost everything we asked for or needed to tweak was either already available or included in the next version. That was reassuring, and we’ve seen that Cast continues to progress. In fact, some features now give us control over things we didn’t even know we wanted before. We’ve been really happy with how things are evolving.
When we had to choose between Cast and a competitor, we found that Cast was always very engaged with us. Even a year later, when competitors approached us again to reconsider, we were satisfied enough to say, “I don’t think there’s a need.” So yes, we’ve been very happy with the level of control we’ve gained.
What was the support like throughout the process?
Cast has always been very responsive. We have a Slack channel where I regularly suggest ideas or ask for improvements and always get fast responses.
What I really appreciate about working with Cast is their hunger to constantly improve. Even if I don’t get an immediate answer, they return to me after reviewing the data, showing me data points to explain whether I’m right or wrong and how we can optimize things further.
At one point, I felt things weren’t as optimized as they could be, but we quickly worked together to improve, and we were able to reduce costs again. Now, we’re at an all-time low in terms of costs – actually lower than when we started – and the company has grown significantly. It’s impressive to see how this helps us run more tests and improve the quality of our tools without increasing time or cost. It’s been really great.