ALLEN Digital Increases GPU Utilization, Saves 71%

Company

ALLEN Digital is reinventing learning with a technology-first approach, powered by top engineering talent and its partnership with Bodhi Tree Systems. The company addresses two core challenges in education – the need for holistic learning and the limitations of one-size-fits-all classrooms – by developing an AI-driven platform that tailors to each student’s unique needs. Its mission is to deliver personalized, end-to-end learning experiences that enhance outcomes and unlock the full potential of every learner in the digital era.

Challenge

ALLEN Digital’s AI-powered platform relied on multiple machine learning models deployed on Amazon SageMaker. However, as the platform scaled, costs became a concern – GPU instances were underutilized, yet the team paid for full capacity even during idle periods.

This prompted a search for alternatives to SageMaker, but each compromised on latency or performance. Already familiar with Cast AI through their Kubernetes optimization initiatives, the team discovered another Cast offering, Kimchi Inference. Its GPU time-slicing capabilities enable multiple models to run on the same instance while maintaining performance.

Solution

Allen Digital deployed several open-source and custom-built models using Kimchi Inference. The solution’s GPU time-slicing capabilities enabled the team to run multiple models on the same GPU instance, maximizing utilization and reducing costs with no impact on performance.

By letting ALLEN run production workloads using a 50/50 ratio between on-demand and Spot Instances, Kimchi Inference ensures high availability for production while capturing significant savings. Node bin-packing further optimizes resource allocation by automatically selecting workloads and placing them efficiently across fewer, right-sized nodes.

Results

Dramatic increase in GPU utilization
71% cost savings compared to Amazon SageMaker

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Kimchi Inference is a great solution.
Karthik Bhat, DevOps Engineer 2 at ALLEN Digital

Running a well-optimized AI infrastructure

How does Allen Digital use AI and LLMs to drive customer value?

Allen Digital is an e-learning platform that provides educational solutions to students. As a tech-focused digital platform, we primarily use AI to solve student doubts and provide relevant answers to their questions. We primarily use the AI to respond to user queries, which they submit through images or text on our platform.

What prompted you to look for alternatives to Amazon SageMaker? What were your main pain points?

We initially used SageMaker because it provided a straightforward platform for deploying our models, and since we were already on AWS, it was our go-to option. It served our use case and helped us achieve what we wanted, so that was the main reason we went with it. However, we found that SageMaker was becoming inefficient.

We weren’t able to fully utilize the underlying machine capacity to deploy all our models – most machines were underutilized, yet we were still paying for them even when they weren’t in use.
We explored several other infra solutions on the market, searching for ways to reduce costs while still serving users with the same latencies and performance levels. Eventually, we discovered Cast AI’s Enabler and learned that we could host multiple models on the same machine, utilizing resources more efficiently while saving costs. That’s why we decided to go with Cast AI.

Since we were already using Cast AI for our EKS cluster optimization and had seen success with it, we were familiar with the platform. When Kimchi Inference launched around that time, it became a natural option for us – especially since the POCs with the other alternatives didn’t go well.

How did Kimchi Inference help solve your challenges around GPU utilization?

The ability to host multiple models within the same GPU instance was a major reason we decided to go with Cast Kimchi Inference.

With SageMaker, we would upload or deploy the code for our custom models. We had around four custom models built on top of base models that we deployed directly, along with a few public models that we could also deploy without issue.

During our tests with X and Y, we achieved cost savings, but at the expense of latency and performance. Our goal was to maintain the same level of performance and latency while reducing costs, which is why we ultimately dropped those options. Additionally, since we were already on the AWS platform, we wanted to continue with that ecosystem.

Integrating Kimchi Inference

What was the onboarding and integration process like?

The onboarding process went very smoothly. Since we had already enabled Cast AI’s EKS components – including the workflow, autoscaler, and other components – in our staging and production clusters, setting up the node templates and configurations was straightforward.

Installing the proxy component in our cluster was also seamless. The onboarding script provided through the console made it essentially a single-click process, and all components were installed while automatically checking for required dependencies.

Before onboarding our models, we were given context on how everything works, including GPU time slicing. Prior to creating the node templates and configurations, the team gave us a demo that walked us through how Kimchi Inference operates, what performance to expect, and how the optimizer proxy component helps achieve lower latencies. The demo was really helpful, and combined with the smooth onboarding process, it was a great experience overall.

How was the support throughout?

We had concerns about GPU time slicing, particularly with open-source models such as LlamaGuard and BGE-M3. Initially, BGE-M3 encountered issues, and the team needed to apply optimizations on the proxy side.

The Cast AI team released the fix within two days, and after that, the latency dropped significantly – to a level much lower than what we had with SageMaker. That was a great optimization to see.

With LlamaGuard, the Cast AI team was extremely helpful in providing support in properly configuring and segregating the workloads. Their knowledge about these models, time slicing, and the various configurations made a real difference.

At every step, they provided excellent support, and we successfully onboarded all our models within a short timeframe. It was a great experience overall – no complaints about the support team. It was great to have such a team working with us.

How did you integrate Kimchi Inference with your existing GitOps/ArgoCD workflow?

We onboarded a total of seven models – three open-source and four custom-built. Since we already use a GitOps-based approach with Argo CD, the first step was to install the Kimchi Inference proxy component, which we accomplished through our existing Argo CD workflow.

For the three open-source models – BGE-M3, LlamaGuard, and Multilingual E5 Large – we onboarded them via the console. The other four custom models were deployed using our standard process for microservices. We built our Docker images and, via Helm charts synced through Argo CD, deployed them to the cluster using the node templates and configurations we had created.

Currently, we’re using two different node templates. One is dedicated to LlamaGuard since it’s GPU-heavy, while our custom models don’t require as much GPU capacity, so we consolidated those onto a separate template.

To monitor GPU utilization, we installed the GPU metrics tooling. The documentation was readily available in the Cast AI docs, and we were able to get that information into our console as well.

Initially, we deployed everything on our AWS EKS cluster. Later, since we had some credits remaining in our GCP account, we migrated to a GKE cluster. The Cast AI team provided outstanding support during that transition, and we were able to migrate from EKS to GKE within a short timeframe while maintaining the same performance levels.

Overall, the DevOps process was pretty smooth. There were a few minor hiccups here and there, but we resolved them quickly.

Reducing infrastructure costs by 70%+

What cost savings did you achieve compared to SageMaker, and how quickly did you see these benefits materialize?

As I mentioned earlier, we initially wanted to deploy on our existing AWS EKS clusters. When we first deployed the models, we started with a basic deployment to see how everything worked.

Right away, we saw around 20% savings just by using the GPU time-sharing configurations.

Once everything was stable – no crash reports or latency issues – we gradually started consolidating models onto the same instance with GPU time slicing enabled. With that configuration, we achieved savings of between 30% and 40% compared to SageMaker.

We then examined the utilization metrics for those models and noticed that we had lower CPU and memory utilization than expected. By reducing some of those resources, we were able to launch smaller nodes. That pushed us past 40% savings compared to SageMaker.

This was the optimization process we followed from start to finish. As of now, I’d say we’ve achieved more than 70% savings compared to what we were spending on SageMaker.

Were there any concerns about reliability when using Spot Instances? How did you address them?

In our staging environment, we use Spot Instances exclusively since availability isn’t a major concern. If something goes wrong with Spot, we can investigate the issue, spin up on-demand instances, and bring the models back up.

However, for production workloads, we were more focused on maintaining high availability. Cast AI provides a component – I believe it’s called the pod mutator – that we enabled in our cluster. This allows us to split our workloads evenly between on-demand and Spot Instances.
By keeping 50% of the pods on on-demand instances, we ensure they’re always available, while the other 50% runs on Spot Instances for cost savings. This is the approach we took to deploy all seven models, and we’ve achieved meaningful cost savings as a result.

We’re currently running our microservices with this configuration as well, resulting in significant cost savings. Since we’re on AWS, we’ve been able to achieve 50% cost savings for workloads running on Spot Instances. This was a major cost impact we had already seen with our existing EKS clusters, and we wanted to apply the same approach to our models.

Kimchi Inference vs. Amazon Sagemaker: GPU utilization, latency, and reliability

How did latency and GPU utilization compare between SageMaker and Kimchi Inference?

For the three open-source models – BGE-M3, LlamaGuard, and Multilingual E5 Large – the latency was either on par with or lower than with SageMaker.

We had a dashboard tracking the exact numbers, though I don’t have the specific figures on hand right now. But overall, we didn’t observe latency being significantly higher than SageMaker – it was either comparable or better. That was one of the positive outcomes we saw with these model deployments on Cast AI.

The performance was solid with Kimchi Inference, even with GPU time slicing enabled on those instances.

What are the three most impactful features or capabilities of Cast AI for your use case?

In terms of key features, there were a few that made a significant impact:

The GPU time-sharing configuration,
The Spot split capability, which allowed us to easily achieve that 50/50 split between on-demand and Spot Instances – that alone saved us a considerable amount.
Another important feature was node bin-packing, which Cast AI handles very well. We saw additional savings from that, along with better instance selection when deploying these models.

These were the major areas where we saw real impact, and they were all very helpful.

Aside from what you’re already doing, do you see any other opportunities to extend your AI optimization with Cast AI?

For these models, we’re still exploring options, but at this point, I don’t think we need any immediate optimizations. We’ve already bin-packed all the models, and all the pods and nodes are running with minimal configurations.

Everything has been stable and reliable so far. Going forward, we may look at further optimizations based on usage patterns, but as of now, things are in a good place.

What advice would you give to other companies considering moving from SageMaker to a self-hosted solution?

Kimchi Inference is a really great product. Compared to SageMaker, we were able to achieve GPU time slicing, which wasn’t something we could easily do before.

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Cast Kimchi Inference is a great solution.

It allows you to achieve time-slicing configurations for your models. Since most models can be containerized now, this is definitely worth exploring. I hope other customers will try it out and see significant savings on their bills as well.

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Kimchi Inference is a great solution.

Cut your cloud costs in half

Solutions

Resources

Company

Book a demo

Download the PDF

ALLEN Digital dramatically increased GPU utilization, saving 71%

Company

Challenge

Solution

Results

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Kimchi Inference is a great solution.

Running a well-optimized AI infrastructure

How does Allen Digital use AI and LLMs to drive customer value?

What prompted you to look for alternatives to Amazon SageMaker? What were your main pain points?

How did Kimchi Inference help solve your challenges around GPU utilization?

Integrating Kimchi Inference

What was the onboarding and integration process like?

How was the support throughout?

How did you integrate Kimchi Inference with your existing GitOps/ArgoCD workflow?

Reducing infrastructure costs by 70%+

What cost savings did you achieve compared to SageMaker, and how quickly did you see these benefits materialize?

Were there any concerns about reliability when using Spot Instances? How did you address them?

Kimchi Inference vs. Amazon Sagemaker: GPU utilization, latency, and reliability

How did latency and GPU utilization compare between SageMaker and Kimchi Inference?

What are the three most impactful features or capabilities of Cast AI for your use case?

Aside from what you’re already doing, do you see any other opportunities to extend your AI optimization with Cast AI?

What advice would you give to other companies considering moving from SageMaker to a self-hosted solution?

Cut your cloud costs in half

More customer stories

Boost Kubernetes performance, security, and cost optimization

Book a demo

Download the PDF