​​Why Cast AI Is Best for Running AI/LLM Workloads in Kubernetes

AI and LLM workloads demand powerful infrastructure. Cast AI automates GPU autoscaling, sharing, and cost control to help ML teams scale efficiently in Kubernetes.

Laurent Gil Avatar
​​Why Cast AI Is Best for Running AI/LLM Workloads in Kubernetes

AI and large language models (LLMs) require substantial infrastructure. Running such workloads in Kubernetes offers flexibility and scalability, but without automation, it can quickly become a resource management nightmare. 

That’s where Cast AI comes in. Purpose-built to optimize cloud-native environments, Cast automates everything from GPU autoscaling to choosing the best LLM for the job, making it the ideal platform for deploying and scaling AI workloads efficiently. 

Whether you’re fine-tuning models or serving inference at scale, Cast ensures your clusters are always right-sized, high-performing, and cost-effective. Let’s dive into Cast AI features that are game-changers for ML teams.

GPU autoscaling and bin-packing

Managed Kubernetes solutions from major cloud providers such as AWS, Google Cloud Platform, and Azure typically provide autoscaling for GPU node pools. However, Kubernetes GPU autoscaling is challenging to manage due to the need for manual configuration. Moreover, the nodes may remain in place for an extended period, increasing your cluster costs.

The Cast AI autoscaling and bin-packing engine provisions GPU instances on demand and scales them down as needed, utilizing Spot Instances and their pricing benefits to further reduce expenses.

The Cast AI autoscaler simplifies the control of GPU workloads while minimizing expenses. Smart bin-packing can quickly identify individual workload requirements and select the best instances for your needs.

Real-life example: Fairgen

Fairgen combines advanced generative AI with established market research methods to provide sophisticated predictive models that enable accurate, scalable, and trustworthy insights into niche audiences. 

The company sought to optimize resource utilization across both its standard SaaS infrastructure for the platform and its AI learning infrastructure, without compromising performance or user experience. Provisioning cloud resources manually was impossible to scale, especially with AI workloads that required more powerful machines, larger clusters, and dynamically adjustable capacity. 

Using Cast, Fairgen automated the management of its clusters, optimized resource allocation in real time, and reduced operational costs by 70%. The cost benefits were achieved without sacrificing performance, flexibility, or the user experience of Fairgen’s customers. 

“When I first heard about Cast AI, the main selling point that caught my attention was workload rightsizing. And yes, that’s exactly what Cast does—and it does it well. But honestly, the real power of Cast AI depends on how you use it and how you integrate it into your workflows.

For me, Cast AI is much more than just a rightsizing or cloud cost savings tool. It has become a key part of my diagnostic system and the core of my infrastructure management.

In the end, we saved around 70% on operations while still delivering the same experience to our clients.”

-Mati Konen, VP of Engineering at Fairgen

Learn more: Kubernetes GPU Autoscaling: How To Scale GPU Workloads With Cast AI

GPU sharing: GPU time-slicing and NVIDIA Multi-Instance GPU (MIG)

ML teams often face challenges such as insufficient GPU resources and rising costs because the GPUs you do get aren’t fully utilized.

Sharing GPUs is possible thanks to two methods: Multi-Instance GPU (MIG) and GPU time-slicing. 

GPU time-slicing allocates a single GPU’s processing capacity to numerous workloads by rapidly switching between them. This means that numerous processes can use the same GPU, sharing resources in bursts.

Multi-Instance GPU (MIG) divides a physical GPU into several smaller, totally isolated instances, each with its own compute cores, memory, and cache. This means that several parallel tasks can execute independently on a single GPU while ensuring performance.

Cast provides an automated platform that simplifies the deployment and management of these GPU sharing methods. It integrates seamlessly with your autoscaler to ensure optimal resource allocation, and combining the two methods eliminates the need to choose between cost-efficiency and performance isolation. 

Cast enables teams to run more GPU workloads with fewer resources, resulting in significant cost savings without compromising the performance or reliability that your essential applications require.

Learn more: GPU Sharing in Kubernetes: How to Cut Costs and Boost GPU Utilization with Cast AI

AI Enabler

The number of open-source and commercial LLMs for generative AI is rapidly increasing. ML teams are often confused about which model best meets their requirements, and they don’t have the time to evaluate all available models. Furthermore, the cost of running these resource-intensive models presents a challenge for both large and small organizations.

Here’s why LLM costs are so hard to control:

  • Market complexity – Many organizations default to expensive, resource-heavy LLMs without exploring alternatives better suited to specific tasks, resulting in unnecessary spending and inefficiencies.
  • Poor cost visibility – MLOps and DevOps teams often lack tools to track real-time costs (e.g., compute, data, API usage), making it difficult to optimize and manage budgets effectively.
  • Cloud infrastructure challenges – The vast array of compute options, especially in Kubernetes environments, makes it hard to choose the best setup. Without automation or intelligent guidance, engineers may select inefficient configurations that increase costs and reduce performance.

Cast’s AI Enabler was designed to address these challenges. It dynamically sends requests to the most cost-effective LLM for each task, employing Cast AI’s Kubernetes infrastructure optimization capabilities.

With features such as a comprehensive cost monitoring dashboard, automatic selection of optimal LLMs (both OSS and commercial), and no additional configuration, AI Enabler significantly reduces costs and operational overhead, making it easier than ever for businesses to integrate AI into their applications at a fraction of the cost.

How does the AI Enabler work?

  • Cost Monitoring for LLMs – The cost monitoring tool generates a comprehensive report that compares the costs of using the default LLM with the potential savings that consumers can realize. It features a detailed cost monitoring dashboard that collects data from multiple LLM providers, providing significant insights into expenditure patterns. Customers can also utilize the Cast AI playground to compare alternative LLMs and develop benchmarks, enabling them to determine the best configuration tailored to their individual requirements. 
  • Automated LLM Cost Optimization – The LLM proxy intelligently selects the most appropriate LLM model for user queries, ensuring that companies achieve the best performance at the lowest cost. This technique maximizes savings by selecting and implementing an optimized LLM with lower inference costs.
  • Running LLMs on Cost-Optimized InfrastructureHugging Face, the top open-source platform for AI developers, recognizes the importance of automation and leverages it to reduce the cost of deploying large language models (LLMs) on the cloud. Hugging Face and Cast AI have collaborated to automatically run Hugging Face customer LLMs on Kubernetes clusters optimized by Cast AI’s automation technology.

Learn more: LLM Cost Optimization: How To Run Gen AI Apps Cost-Efficiently

Conclusion

AI workloads are dynamic, compute-intensive, and unforgiving of inefficiencies. Cast AI addresses these challenges directly by automating the orchestration of Kubernetes clusters with intelligence tailored to meet the demands of AI and LLM. From GPU-aware scheduling to real-time autoscaling and cost optimization, Cast AI removes the friction from running cutting-edge models in production. 

Cast AIBlog​​Why Cast AI Is Best for Running AI/LLM Workloads in Kubernetes