GPU Sharing in Kubernetes: Cut Costs & Boost Utilization

Think about your data science team: full of new ideas but constantly facing challenges like insufficient GPU resources and rising costs because the GPUs you manage to get aren’t fully utilized. In fact, only 5% of GPU capacity ends up utilized, on average.

This is a common issue when running machine learning and AI workloads on Kubernetes; high-performance GPUs are essential but expensive, and sharing them efficiently is challenging.

This article dives into two proven ways to share GPUs – Multi-Instance GPU (MIG) and GPU time-slicing – showing how Cast AI helps you use both to make your Kubernetes clusters highly efficient and cost-effective.

High GPU utilization is achievable

One cluster in our dataset – 136 H200s sustaining 49% GPU utilization – shows the ceiling isn’t theoretical. The fleet average is 5%. The gap is 10x. That gap is almost entirely technique, not hardware.

Graph showing a sustained high level of GPU utilization

GPUs power everything from complex AI training and real-time inference to scientific simulations, graphics rendering, and high-performance computing workloads. As AI and other GPU-accelerated applications rapidly grow, so does GPU demand – leading to skyrocketing cloud costs.

Low GPU utilization compounds this problem. Many workloads are intermittent, small, or bursty, and dedicating an entire physical GPU to a single job wastes compute power and money.

Our customers often share stories such as:

Machine learning teams are constrained by budget and forced to share a limited number of GPUs
Data science groups running bursty or overnight training jobs are stuck behind long queues from scarce GPU availability

Luckily, there is a way out of this. Sharing a single physical GPU across multiple workloads instead of dedicating one GPU per job unlocks tremendous cost savings, improves resource efficiency, and adapts dynamically to fluctuating workloads.

Let’s examine the cost impact of poor GPU utilization to see why GPU sharing techniques can make a big difference.

The hidden costs of underutilized GPUs

Consider a common scenario: an ML team running 10 inference jobs, each requiring only a fraction of a powerful A100 GPU. Without efficient sharing, they might provision 10 separate A100 GPUs, leading to massive overspending.

By contrast:

GPU time-slicing can run these 10 jobs on a single A100 GPU, delivering up to 90% cost savings on GPU infrastructure while maintaining acceptable latency for light workloads.
For workloads requiring guaranteed performance, Multi-Instance GPU (MIG) partitions a single GPU into multiple isolated instances, enabling several applications to run simultaneously on hardware that would otherwise be underutilized by a single, less demanding task.

For large-scale deployments, this intelligent GPU sharing can result in monthly savings of tens of thousands of dollars.

Deep dive: understanding GPU time-slicing vs. Multi-Instance GPU (MIG)

Cast AI supports both GPU time-slicing and NVIDIA Multi-Instance GPU (MIG). Here’s how each works, and when to choose one or both.

What is GPU time-slicing?

GPU time-slicing divides a single GPU’s processing power among multiple workloads by rapidly switching GPU time slices between them. This means multiple jobs can leverage the same GPU, sharing resources in bursts.

Why it’s great: If your workloads don’t need 100% GPU capacity all the time, time-slicing significantly increases utilization.

GPU time-slicing is an ideal approach for sharing GPUs among light inference workloads. NVIDIA’s benchmarks demonstrated that GPU time-slicing can increase GPU utilization by approximately threefold for light interference workloads, without impacting latency or throughput. See the diagrams below for details:

Applying GPU time-slicing to small batch jobs can increase GPU utilization to 100%, though this benefit comes at the cost of longer execution times, detailed in NVIDIA’s benchmarks:

Trade-offs: Time-slicing offers flexibility and broader GPU compatibility but lacks resource isolation, which can cause unstable performance. Shared memory risks and context switching may also impact execution time.

What is NVIDIA’s Multi-Instance GPU (MIG)?

Multi-Instance GPU (MIG) partitions a physical GPU into multiple smaller, fully isolated instances, each with dedicated compute cores, memory, and cache. This means multiple, parallel workloads can run independently on a single GPU with guaranteed performance.

MIG allows running up to 7 workloads on a single physical GPU, reducing underutilization when a workload cannot fully saturate the resources of the entire GPU.

Why it’s great: This approach is perfect for mixed workloads where predictable performance and fault isolation are critical, such as serving multiple AI inference requests simultaneously without performance degradation. MIG is well-suited for a wide range of applications, including light and heavy inference workloads, small GPU batch jobs, GPU-intensive tasks, and many others.

Trade-offs: MIG provides strong hardware-level isolation and predictable, guaranteed performance by partitioning a GPU into dedicated instances. However, it requires NVIDIA Ampere or newer GPUs to support it. MIG limits the number of instances per GPU (up to seven), which can constrain scalability compared to time-slicing. It also offers less flexibility in dynamically resizing partitions during runtime.

The best of both worlds: combining GPU time-slicing with MIG

You can maximize GPU utilization by applying GPU time-slicing within MIG instances, enabling multiple workloads to share each MIG partition. This hybrid approach balances MIG’s strong isolation with increased workload capacity from GPU time-slicing.

Unlocking GPU optimization in Kubernetes with Cast AI: a practical guide

Cast AI makes GPU sharing simple – and fully automated:

For time-slicing, simply enable it in the Node Template in the GPU-sharing section. No workload changes are needed. When GPU sharing is enabled, one GPU is treated as multiple shared GPUs.

For MIG, add a toleration and node selector to your pod specs requesting specific MIG partitions. Cast AI manages provisioning and scaling seamlessly; for example:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: gpu-test-mig
spec:
 replicas: 7
 selector:
   matchLabels:
     app: gpu-test-mig
 template:
   metadata:
     labels:
       app: gpu-test-mig
   spec:
     tolerations:
       - key: "nvidia.com/gpu.mig"
         operator: "Exists"
         effect: "NoSchedule"
     nodeSelector:
       nvidia.com/gpu.mig-partition-1g.5gb: "true"
     containers:
       - name: gpu-test-mig
         image: nvidia/cuda:11.0.3-base-ubi7
         command:
           - bash
           - -c
           - |
             /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
         resources:
           limits:
             nvidia.com/gpu: 1

GPU time-slicing & MIG with Cast AI

What happens if you enable GPU time-slicing in the Node Template and pods request MIG partitions? You can schedule even more GPU workloads since multiple containers or pods share each MIG partition using time-slicing inside it.

ALLEN Digital was running 7 models on SageMaker: 3 open-source and 4 custom. GPU instances ran continuously but served an intermittent load.

After moving to Kubernetes with GPU time-slicing enabled, a 50/50 on-demand/Spot split, and node bin-packing, utilization improved dramatically. This led to 20% savings immediately from time-slicing, 30–40% after consolidating models onto shared instances, and more than 70% total savings versus SageMaker after rightsizing CPU and memory alongside the GPU changes. Latency held throughout.

Conclusion

In the dynamic world of AI, optimizing GPU costs in Kubernetes is a necessity. By applying advanced GPU sharing techniques like time-slicing and MIG, teams can achieve unprecedented levels of utilization and efficiency.

Cast stands at the forefront of this revolution, offering an automated platform that simplifies the deployment and management of these complex GPU strategies and integrates seamlessly with your autoscaler to ensure optimal resource allocation.

With Cast, you don’t have to choose between cost efficiency and performance isolation. Our platform empowers teams to run more GPU workloads on fewer resources, unlocking significant cost savings without sacrificing the performance or reliability your critical applications demand.

Explore how Cast can help you automatically optimize your Kubernetes GPU infrastructure with intelligent sharing techniques – check out the GPU Optimization Feature Page to learn more or request a demo to see Cast in action.

Explore how Cast AI can optimize your GPU infrastructure

Request demo

Explore how Cast AI can optimize your GPU infrastructure

Simplify AIOps

The Hidden Shortcut to Increasing Fintech Gross Margins – Cloud Automation

How To Choose The Best VM Instance Types For The Job And Save On Your Cloud Bill

Winning the GPU Pricing Game: Flexibility Across Cloud Regions

Solutions

Resources

Company

Book a demo

GPU Sharing in Kubernetes: How to Cut Costs and Boost GPU Utilization with Cast AI

High GPU utilization is achievable

GPU sharing in Kubernetes: why it matters now more than ever

The hidden costs of underutilized GPUs

Deep dive: understanding GPU time-slicing vs. Multi-Instance GPU (MIG)

What is GPU time-slicing?

What is NVIDIA’s Multi-Instance GPU (MIG)?

GPU sharing at a glance: GPU time-slicing vs. MIG

The best of both worlds: combining GPU time-slicing with MIG

Unlocking GPU optimization in Kubernetes with Cast AI: a practical guide

GPU time-slicing & MIG with Cast AI

GPU sharing in real life: ALLEN Digital

Conclusion

Explore how Cast AI can optimize your GPU infrastructure

Simplify AIOps

More articles

The Hidden Shortcut to Increasing Fintech Gross Margins – Cloud Automation

How To Choose The Best VM Instance Types For The Job And Save On Your Cloud Bill

Winning the GPU Pricing Game: Flexibility Across Cloud Regions

Boost Kubernetes performance, security, and cost optimization

Book a demo