Think about your data science team: full of new ideas but constantly facing challenges like insufficient GPU resources and rising costs because the GPUs you manage to get aren’t fully utilized. In fact, only 5% of GPU capacity ends up utilized, on average.
This is a common issue when running machine learning and AI workloads on Kubernetes; high-performance GPUs are essential but expensive, and sharing them efficiently is challenging.
This article dives into two proven ways to share GPUs – Multi-Instance GPU (MIG) and GPU time-slicing – showing how Cast AI helps you use both to make your Kubernetes clusters highly efficient and cost-effective.
High GPU utilization is achievable
One cluster in our dataset – 136 H200s sustaining 49% GPU utilization – shows the ceiling isn’t theoretical. The fleet average is 5%. The gap is 10x. That gap is almost entirely technique, not hardware.

GPU sharing in Kubernetes: why it matters now more than ever
GPUs power everything from complex AI training and real-time inference to scientific simulations, graphics rendering, and high-performance computing workloads. As AI and other GPU-accelerated applications rapidly grow, so does GPU demand – leading to skyrocketing cloud costs.
Low GPU utilization compounds this problem. Many workloads are intermittent, small, or bursty, and dedicating an entire physical GPU to a single job wastes compute power and money.
Our customers often share stories such as:
- Machine learning teams are constrained by budget and forced to share a limited number of GPUs
- Data science groups running bursty or overnight training jobs are stuck behind long queues from scarce GPU availability
Luckily, there is a way out of this. Sharing a single physical GPU across multiple workloads instead of dedicating one GPU per job unlocks tremendous cost savings, improves resource efficiency, and adapts dynamically to fluctuating workloads.
Let’s examine the cost impact of poor GPU utilization to see why GPU sharing techniques can make a big difference.
The hidden costs of underutilized GPUs
Consider a common scenario: an ML team running 10 inference jobs, each requiring only a fraction of a powerful A100 GPU. Without efficient sharing, they might provision 10 separate A100 GPUs, leading to massive overspending.
By contrast:
- GPU time-slicing can run these 10 jobs on a single A100 GPU, delivering up to 90% cost savings on GPU infrastructure while maintaining acceptable latency for light workloads.
- For workloads requiring guaranteed performance, Multi-Instance GPU (MIG) partitions a single GPU into multiple isolated instances, enabling several applications to run simultaneously on hardware that would otherwise be underutilized by a single, less demanding task.
For large-scale deployments, this intelligent GPU sharing can result in monthly savings of tens of thousands of dollars.
Deep dive: understanding GPU time-slicing vs. Multi-Instance GPU (MIG)
Cast AI supports both GPU time-slicing and NVIDIA Multi-Instance GPU (MIG). Here’s how each works, and when to choose one or both.
What is GPU time-slicing?
GPU time-slicing divides a single GPU’s processing power among multiple workloads by rapidly switching GPU time slices between them. This means multiple jobs can leverage the same GPU, sharing resources in bursts.

Why it’s great: If your workloads don’t need 100% GPU capacity all the time, time-slicing significantly increases utilization.
GPU time-slicing is an ideal approach for sharing GPUs among light inference workloads. NVIDIA’s benchmarks demonstrated that GPU time-slicing can increase GPU utilization by approximately threefold for light interference workloads, without impacting latency or throughput. See the diagrams below for details:

Applying GPU time-slicing to small batch jobs can increase GPU utilization to 100%, though this benefit comes at the cost of longer execution times, detailed in NVIDIA’s benchmarks:

Trade-offs: Time-slicing offers flexibility and broader GPU compatibility but lacks resource isolation, which can cause unstable performance. Shared memory risks and context switching may also impact execution time.
What is NVIDIA’s Multi-Instance GPU (MIG)?
Multi-Instance GPU (MIG) partitions a physical GPU into multiple smaller, fully isolated instances, each with dedicated compute cores, memory, and cache. This means multiple, parallel workloads can run independently on a single GPU with guaranteed performance.
MIG allows running up to 7 workloads on a single physical GPU, reducing underutilization when a workload cannot fully saturate the resources of the entire GPU.

Why it’s great: This approach is perfect for mixed workloads where predictable performance and fault isolation are critical, such as serving multiple AI inference requests simultaneously without performance degradation. MIG is well-suited for a wide range of applications, including light and heavy inference workloads, small GPU batch jobs, GPU-intensive tasks, and many others.
Trade-offs: MIG provides strong hardware-level isolation and predictable, guaranteed performance by partitioning a GPU into dedicated instances. However, it requires NVIDIA Ampere or newer GPUs to support it. MIG limits the number of instances per GPU (up to seven), which can constrain scalability compared to time-slicing. It also offers less flexibility in dynamically resizing partitions during runtime.
GPU sharing at a glance: GPU time-slicing vs. MIG

The best of both worlds: combining GPU time-slicing with MIG
You can maximize GPU utilization by applying GPU time-slicing within MIG instances, enabling multiple workloads to share each MIG partition. This hybrid approach balances MIG’s strong isolation with increased workload capacity from GPU time-slicing.

Unlocking GPU optimization in Kubernetes with Cast AI: a practical guide
Cast AI makes GPU sharing simple – and fully automated:
For time-slicing, simply enable it in the Node Template in the GPU-sharing section. No workload changes are needed. When GPU sharing is enabled, one GPU is treated as multiple shared GPUs.

For MIG, add a toleration and node selector to your pod specs requesting specific MIG partitions. Cast AI manages provisioning and scaling seamlessly; for example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-test-mig
spec:
replicas: 7
selector:
matchLabels:
app: gpu-test-mig
template:
metadata:
labels:
app: gpu-test-mig
spec:
tolerations:
- key: "nvidia.com/gpu.mig"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
nvidia.com/gpu.mig-partition-1g.5gb: "true"
containers:
- name: gpu-test-mig
image: nvidia/cuda:11.0.3-base-ubi7
command:
- bash
- -c
- |
/usr/local/nvidia/bin/nvidia-smi -L; sleep 300
resources:
limits:
nvidia.com/gpu: 1GPU time-slicing & MIG with Cast AI
What happens if you enable GPU time-slicing in the Node Template and pods request MIG partitions? You can schedule even more GPU workloads since multiple containers or pods share each MIG partition using time-slicing inside it.
GPU sharing in real life: ALLEN Digital
ALLEN Digital was running 7 models on SageMaker: 3 open-source and 4 custom. GPU instances ran continuously but served an intermittent load.
After moving to Kubernetes with GPU time-slicing enabled, a 50/50 on-demand/Spot split, and node bin-packing, utilization improved dramatically. This led to 20% savings immediately from time-slicing, 30–40% after consolidating models onto shared instances, and more than 70% total savings versus SageMaker after rightsizing CPU and memory alongside the GPU changes. Latency held throughout.
Conclusion
In the dynamic world of AI, optimizing GPU costs in Kubernetes is a necessity. By applying advanced GPU sharing techniques like time-slicing and MIG, teams can achieve unprecedented levels of utilization and efficiency.
Cast stands at the forefront of this revolution, offering an automated platform that simplifies the deployment and management of these complex GPU strategies and integrates seamlessly with your autoscaler to ensure optimal resource allocation.
With Cast, you don’t have to choose between cost efficiency and performance isolation. Our platform empowers teams to run more GPU workloads on fewer resources, unlocking significant cost savings without sacrificing the performance or reliability your critical applications demand.
Explore how Cast can help you automatically optimize your Kubernetes GPU infrastructure with intelligent sharing techniques – check out the GPU Optimization Feature Page to learn more or request a demo to see Cast in action.



