Why Cast AI Is Best for Running AI/LLM Workloads in Kubernetes

AI and large language models (LLMs) require substantial infrastructure. Running such workloads in Kubernetes offers flexibility and scalability, but without automation, it can quickly become a resource management nightmare.

That’s where Cast AI comes in. Purpose-built to optimize cloud-native environments, Cast automates everything from GPU autoscaling to choosing the best LLM for the job, making it the ideal platform for deploying and scaling AI workloads efficiently.

Whether you’re fine-tuning models or serving inference at scale, Cast ensures your clusters are always right-sized, high-performing, and cost-effective. Let’s dive into Cast AI features that are game-changers for ML teams.

GPU autoscaling and bin-packing

Managed Kubernetes solutions from major cloud providers such as AWS, Google Cloud Platform, and Azure typically provide autoscaling for GPU node pools. However, Kubernetes GPU autoscaling is challenging to manage due to the need for manual configuration. Moreover, the nodes may remain in place for an extended period, increasing your cluster costs.

The Cast AI autoscaling and bin-packing engine provisions GPU instances on demand and scales them down as needed, utilizing Spot Instances and their pricing benefits to further reduce expenses.

The Cast AI autoscaler simplifies the control of GPU workloads while minimizing expenses. Smart bin-packing can quickly identify individual workload requirements and select the best instances for your needs.

Real-life example: Fairgen

Fairgen combines advanced generative AI with established market research methods to provide sophisticated predictive models that enable accurate, scalable, and trustworthy insights into niche audiences.

The company sought to optimize resource utilization across both its standard SaaS infrastructure for the platform and its AI learning infrastructure, without compromising performance or user experience. Provisioning cloud resources manually was impossible to scale, especially with AI workloads that required more powerful machines, larger clusters, and dynamically adjustable capacity.

Using Cast, Fairgen automated the management of its clusters, optimized resource allocation in real time, and reduced operational costs by 70%. The cost benefits were achieved without sacrificing performance, flexibility, or the user experience of Fairgen’s customers.

“When I first heard about Cast AI, the main selling point that caught my attention was workload rightsizing. And yes, that’s exactly what Cast does—and it does it well. But honestly, the real power of Cast AI depends on how you use it and how you integrate it into your workflows.
For me, Cast AI is much more than just a rightsizing or cloud cost savings tool. It has become a key part of my diagnostic system and the core of my infrastructure management.
In the end, we saved around 70% on operations while still delivering the same experience to our clients.”
-Mati Konen, VP of Engineering at Fairgen

Learn more: Kubernetes GPU Autoscaling: How To Scale GPU Workloads With Cast AI

ML teams often face challenges such as insufficient GPU resources and rising costs because the GPUs you do get aren’t fully utilized.

Sharing GPUs is possible thanks to two methods: Multi-Instance GPU (MIG) and GPU time-slicing.

GPU time-slicing allocates a single GPU’s processing capacity to numerous workloads by rapidly switching between them. This means that numerous processes can use the same GPU, sharing resources in bursts.

Multi-Instance GPU (MIG) divides a physical GPU into several smaller, totally isolated instances, each with its own compute cores, memory, and cache. This means that several parallel tasks can execute independently on a single GPU while ensuring performance.

Cast provides an automated platform that simplifies the deployment and management of these GPU sharing methods. It integrates seamlessly with your autoscaler to ensure optimal resource allocation, and combining the two methods eliminates the need to choose between cost-efficiency and performance isolation.

Cast enables teams to run more GPU workloads with fewer resources, resulting in significant cost savings without compromising the performance or reliability that your essential applications require.

Learn more: GPU Sharing in Kubernetes: How to Cut Costs and Boost GPU Utilization with Cast AI

Kimchi Inference

The number of open-source and commercial LLMs for generative AI is rapidly increasing. ML teams are often confused about which model best meets their requirements, and they don’t have the time to evaluate all available models. Furthermore, the cost of running these resource-intensive models presents a challenge for both large and small organizations.

Here’s why LLM costs are so hard to control:

Market complexity – Many organizations default to expensive, resource-heavy LLMs without exploring alternatives better suited to specific tasks, resulting in unnecessary spending and inefficiencies.
Poor cost visibility – MLOps and DevOps teams often lack tools to track real-time costs (e.g., compute, data, API usage), making it difficult to optimize and manage budgets effectively.
Cloud infrastructure challenges – The vast array of compute options, especially in Kubernetes environments, makes it hard to choose the best setup. Without automation or intelligent guidance, engineers may select inefficient configurations that increase costs and reduce performance.

Kimchi Inference was designed to address these challenges. It dynamically sends requests to the most cost-effective LLM for each task, employing Cast AI’s Kubernetes infrastructure optimization capabilities.

With features such as a comprehensive cost monitoring dashboard, automatic selection of optimal LLMs (both OSS and commercial), and no additional configuration, Kimchi Inference significantly reduces costs and operational overhead, making it easier than ever for businesses to integrate AI into their applications at a fraction of the cost.

How does the Kimchi Inference work?

Cost Monitoring for LLMs – The cost monitoring tool generates a comprehensive report that compares the costs of using the default LLM with the potential savings that consumers can realize. It features a detailed cost monitoring dashboard that collects data from multiple LLM providers, providing significant insights into expenditure patterns. Customers can also utilize the Cast AI playground to compare alternative LLMs and develop benchmarks, enabling them to determine the best configuration tailored to their individual requirements.
Automated LLM Cost Optimization – The LLM proxy intelligently selects the most appropriate LLM model for user queries, ensuring that companies achieve the best performance at the lowest cost. This technique maximizes savings by selecting and implementing an optimized LLM with lower inference costs.
Running LLMs on Cost-Optimized Infrastructure – Hugging Face, the top open-source platform for AI developers, recognizes the importance of automation and leverages it to reduce the cost of deploying large language models (LLMs) on the cloud. Hugging Face and Cast AI have collaborated to automatically run Hugging Face customer LLMs on Kubernetes clusters optimized by Cast AI’s automation technology.

Learn more: LLM Cost Optimization: How To Run Gen AI Apps Cost-Efficiently

Real-life example: ALLEN Digital

ALLEN Digital’s AI-powered platform relied on multiple machine learning models deployed on Amazon SageMaker. However, as the platform scaled, costs became a concern – GPU instances were underutilized, yet the team paid for full capacity even during idle periods.

This prompted a search for alternatives to SageMaker, but each compromised on latency or performance.

The company deployed several open-source and custom-built models using Kimchi Inference. The solution’s GPU time-slicing capabilities enabled the team to run multiple models on the same GPU instance, maximizing utilization and reducing costs with no impact on performance.

By letting ALLEN run production workloads using a 50/50 ratio between on-demand and Spot Instances, Kimchi Inference ensures high availability for production while capturing significant savings. Node bin-packing further optimizes resource allocation by automatically selecting workloads and placing them efficiently across fewer, right-sized nodes.

Results?

Dramatic increase in GPU utilization
71% cost savings compared to Amazon SageMaker

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Kimchi Inference is a great solution.
Karthik Bhat, DevOps Engineer 2 at ALLEN Digital

Conclusion

AI workloads are dynamic, compute-intensive, and unforgiving of inefficiencies. Cast AI addresses these challenges directly by automating the orchestration of Kubernetes clusters with intelligence tailored to meet the demands of AI and LLM. From GPU-aware scheduling to real-time autoscaling and cost optimization, Cast AI removes the friction from running cutting-edge models in production.

LLM optimization for AIOps

Test and deploy the most optimal LLM model for performance, cost and security.

Learn more

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Kimchi Inference is a great solution.

LLM optimization for AIOps

Simplify AIOps

Kubernetes Control Plane: 10 Tips for Airtight K8s Security

Karpenter Cost Optimization: Consolidation Benchmark Results (7-Day Run)

Kubernetes Cost Optimization: How to Reduce Cluster Waste Without Hurting Reliability

Solutions

Resources

Company

Book a demo

​​Why Cast AI Is Best for Running AI/LLM Workloads in Kubernetes

GPU autoscaling and bin-packing

Real-life example: Fairgen

GPU sharing: GPU time-slicing and NVIDIA Multi-Instance GPU (MIG)

Kimchi Inference

How does the Kimchi Inference work?

Real-life example: ALLEN Digital

If your models are underutilized, or if you’re trying to achieve higher utilization and fully leverage GPU capacity while reducing costs, I think Kimchi Inference is a great solution.

Conclusion

LLM optimization for AIOps

Simplify AIOps

More articles

Kubernetes Control Plane: 10 Tips for Airtight K8s Security

Karpenter Cost Optimization: Consolidation Benchmark Results (7-Day Run)

Kubernetes Cost Optimization: How to Reduce Cluster Waste Without Hurting Reliability

Boost Kubernetes performance, security, and cost optimization

Book a demo

Why Cast AI Is Best for Running AI/LLM Workloads in Kubernetes