,

Kubernetes GPU Optimization: How to Cut GPU Waste Without Slowing Workloads

Learn how to optimize Kubernetes GPU utilization with proven strategies for MIG, time-slicing, and Dynamic Resource Allocation. This guide explains how to eliminate GPU waste, improve scheduling efficiency, and reduce GPU costs by up to 90% without sacrificing workload performance.

Laurent Gil Avatar
Kubernetes GPU Optimization featured image

GPU optimization in Kubernetes means extracting maximum useful compute from each physical GPU through partitioning, sharing, intelligent scheduling, and node lifecycle automation – without degrading workload performance or reproducibility. In practice, most clusters are nowhere near that standard.

The Cast AI 2026 State of Kubernetes Optimization report, drawn from tens of thousands of production clusters, puts average GPU utilization at 5%. CPU averages 8%, memory 20%. GPU underperforms both by a wide margin. The well-tuned best case – a 136-node H200 LLM inference fleet – sustains 49% average utilization. Most clusters never get close. And with H200 Capacity Blocks up 15% in January 2026 (the first GPU price increase in two decades), the cost of that waste is compounding.

What This Post Covers

  • A diagnostic framework – The Four GPU Money Leaks – that maps each structural waste pattern to its specific technical fix
  • Clear distinction between MIG partitioning and GPU time-slicing, which are meaningfully different techniques
  • Copy-pasteable YAML for GPU Operator time-slicing config, DCGM alert rules, and a DRA ResourceClaimTemplate
  • A decision table: when to use MIG, time-slicing, or DRA-based scheduling
  • What changes with Dynamic Resource Allocation going GA in Kubernetes 1.34
  • How teams have achieved ~90% GPU cost reductions without degrading throughput

The Four GPU Money Leaks

GPU waste isn’t random. It clusters around four structural patterns. Fix all four and you’ve addressed most of the utilization gap. Miss any one and you’re leaving significant compute cost unrecovered.

Leak 1: Idle GPU Nodes (No Node Lifecycle Automation)

The most common GPU waste pattern: a developer spins up an H100 instance for an experiment, finishes the run, and moves on. The node keeps running until someone notices the bill. At AWS p5 pricing (~$6.88/GPU/hr as of mid-2025), a single idle H100 costs roughly $4,954/month. Scale that to a team of 20 data scientists with similar habits and you’re burning ~$99K/month before a single model trains.

Pricing reflects AWS P5 on-demand rates post the June 2025 price reduction. Verify aws.amazon.com/ec2/pricing for current rates.

The fix is scale-to-zero autoscaling with GPU node lifecycle management. When no pods are scheduled to a GPU node, it should be terminated, not drained and left running. This requires an autoscaler that understands GPU node boot latency (typically 3–8 minutes for a DL-optimized AMI with driver initialization) and can pre-provision nodes ahead of scheduled demand.

# Surface GPU nodes by age to find long-running idle candidates
kubectl get nodes -l accelerator=gpu \
  -o custom-columns=\
NODE:.metadata.name,\
GPU_ALLOC:.status.allocatable."nvidia\.com/gpu",\
AGE:.metadata.creationTimestamp \
  | sort -k3

Pair this with DCGM utilization metrics to confirm zero workload activity before triggering termination. A node with allocated GPU resources but zero utilization for 30 minutes is a strong scale-down candidate.

Leak 2: Oversized GPU Allocation (No Partitioning)

Naive single-request inference rarely exceeds 40% GPU utilization. vLLM with continuous batching and PagedAttention can sustain 70–85% – but this requires specific serving framework configuration, not just a GPU optimization tool. Training runs hit 85–95% during the forward/backward pass but average far less across a full epoch. Despite this, the default Kubernetes GPU model is binary: a pod requests nvidia.com/gpu: 1 and gets the entire physical GPU, regardless of whether it needs 10% or 90% of the device.

MIG (Multi-Instance GPU) is the hardware-level answer. Available on A100, A30, H100, and H200, MIG partitions a physical GPU into up to 7 isolated instances, each with dedicated compute cores, memory bandwidth, and L2 cache. No noisy-neighbor effects. One MIG instance cannot starve another of memory bandwidth. An H100 80GB can be partitioned as:

  • 1g.10gb: 7 instances × 10 GB each — embedding models, small classifiers, lightweight inference
  • 2g.20gb: 3 instances × 20 GB — 7B model inference, fine-tuning jobs
  • 3g.40gb: 2 instances × 40 GB — 13B model inference, parallel training
  • 7g.80gb: 1 instance, full GPU — large model training, no partitioning

Note on 7B model sizing: 7B parameter models with FP16 weights require ~14 GB for weights alone. Add KV cache for production serving, for 4K context and batch size 8, KV cache adds ~4–8 GB. A 2g.20gb profile is tight for production 7B serving; prefer 3g.40gb for inference at batch size >4, or use a 2g.20gb profile for dev/testing only.

Configure MIG profiles via the NVIDIA GPU Operator:

apiVersion: v1
kind: ConfigMap
metadata:
  name: default-mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 2
            "2g.20gb": 1
            "3g.40gb": 1

Warning: Changing MIG profiles requires draining the node first all running pods will be evicted. Plan this during a maintenance window or ensure workloads have pod disruption budgets.

Drain the node before applying a profile change, then label it to activate – the GPU Operator’s MIG Manager DaemonSet picks up the change and reconfigures hardware:

kubectl drain  --ignore-daemonsets --delete-emptydir-data
kubectl label node  nvidia.com/mig.config=all-1g.10gb --overwrite

Pods then request MIG instances directly: nvidia.com/mig-1g.10gb: 1. The node that previously ran one inference endpoint now runs seven.

Leak 3: One Workload Per Physical GPU (No Time-Slicing)

Time-slicing is a separate technique from MIG, and the distinction matters operationally. MIG partitions the physical GPU at the hardware level with full memory isolation. Time-slicing creates virtual GPU replicas through CUDA context switching, with no memory isolation between replicas. A pod using a time-sliced virtual GPU can see and potentially affect memory used by other replicas on the same physical device.

That limitation makes time-slicing the right tool for developer environments and light inference where memory isolation isn’t a requirement. Configure it via the GPU Operator:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4
kubectl patch clusterpolicy/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

Note on context-switch latency: Time-slicing works via CUDA context switching. At 4–8 replicas for short-running inference requests, overhead is minimal. At 12+ replicas with long-running kernels, context-switch latency can add 5–20ms to p99 tail latency. For latency-sensitive inference, benchmark at your target replica count before enabling in production.

vLLM users note: vLLM pre-allocates 90% of GPU memory by default (--gpu-memory-utilization=0.90). On a time-sliced GPU, this will OOM other virtual replicas immediately. Set --gpu-memory-utilization to 1/N of the number of time-sliced replicas (e.g., for 4 replicas: --gpu-memory-utilization=0.22 to leave headroom). Adjust based on your model size and KV cache needs.

With 4 replicas per GPU, four pods each requesting nvidia.com/gpu: 1 share one physical device. The per-developer economics: 4 developers sharing one H100 at $6.88/hr means $1.72/hr per developer — a 75% reduction before touching instance pricing.

Critical observability caveat: DCGM metrics on time-sliced clusters report at the physical GPU level, not per container. DCGM_FI_DEV_GPU_UTIL gives you aggregate utilization for the physical GPU. It cannot attribute compute usage to individual virtual replicas. Per-container cost attribution on time-sliced clusters requires a separate attribution layer on top of DCGM.

Leak 4: Everything On-Demand (No Spot Automation)

Fewer than 2% of GPUs in enterprise Kubernetes clusters run on Spot, according to the Cast AI 2026 report. AWS Spot saves 60–91% on GPU instances, GCP 60–80%, Azure 60–90%. For training jobs with checkpointing, batch inference, and data preprocessing pipelines — workloads that can tolerate interruption — that discount is available right now.

The barrier isn’t technical feasibility. It’s operational: Spot interruptions require handling. Most teams either don’t have interruption-aware scheduling wired up, or they had a training job fail mid-epoch on Spot and switched back to on-demand. The answer is infrastructure that treats Spot interruption handling as an atomic operation — interrupt signal, cordon, drain, replacement provision — not a collection of shell scripts.

Stack Spot on top of time-slicing and the math compounds. Four developers sharing an H100 on Spot (assuming a 60% discount): from $6.88/hr on-demand to roughly $2.75/hr Spot, split four ways, equals ~$0.69/hr per developer (~$497/month). Compare that to $6.88/hr per developer solo on-demand (~$4,954/month). That ~90% per-developer reduction requires no changes to the workload itself.

Decision Table: MIG vs. Time-Slicing vs. DRA

These three approaches serve different isolation and workload patterns. Use this to pick the right one:

CriterionMIGTime-SlicingDRA (K8s 1.34+)
Hardware supportA100, A30, H100, H200Any NVIDIA GPUAny (via driver plugin)
Memory isolationYes — hardware-enforcedNo — shared physical memoryDepends on underlying resource
Max density per GPU7 instancesUp to 48 virtual replicasDynamic, workload-declared
Noisy-neighbor riskNoneYes — memory contention possibleConfigurable
Best workload fitProduction inference, multi-tenant servingDev environments, light inferenceMulti-model, multi-cloud, mixed workloads
Scheduling modelLabel-based node selectionLabel-based node selectionAttribute-based (architecture, memory, CUDA)
Operational complexityHigh — profile management per node, requires node drainLowModerate — new API objects, driver support required
Can combine with the otherMIG + time-slicing: 7 × 4 = 28 virtual devicesSee MIG columnNot directly combined

Observability: DCGM Metrics and Alert Rules

You can’t optimize what you can’t measure. Deploy the NVIDIA DCGM Exporter as a DaemonSet; it exposes metrics at :9400/metrics. The four metrics that matter most for GPU optimization decisions:

  • DCGM_FI_DEV_GPU_UTIL — overall GPU compute utilization percentage
  • DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — framebuffer memory used and free
  • DCGM_FI_DEV_SM_CLOCK — streaming multiprocessor clock rate (useful for detecting thermal throttling)
  • DCGM_FI_DEV_POWER_USAGE — GPU power draw in watts

Start with Grafana dashboard ID 12239 (the NVIDIA DCGM Exporter Dashboard) for a baseline view. Then add these two alert rules to surface the most actionable waste conditions:

groups:
- name: gpu-optimization
  rules:
  - alert: GPUMemoryHigh
    expr: >
      DCGM_FI_DEV_FB_USED /
      (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU {{ $labels.gpu }} memory above 90% for 5 minutes"
      description: "Consider MIG partitioning or moving to a larger instance type"

  - alert: GPUUtilizationLow
    expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
    for: 30m
    labels:
      severity: info
    annotations:
      summary: "GPU {{ $labels.gpu }} averaging below 20% utilization for 30 minutes"
      description: "Candidate for time-slicing, MIG, or scale-down"

The GPUUtilizationLow alert is your primary waste signal. On a cluster averaging 5% GPU utilization, this alert fires constantly — which is the point. Let it surface which nodes and workloads are the worst candidates for optimization before you change anything.

The DRA Evolution: Attribute-Based GPU Scheduling

Dynamic Resource Allocation (DRA) graduated to GA in Kubernetes 1.34. The core architectural shift: instead of managing GPU resources through node labels and device plugin counters, DRA introduces structured API objects that let workloads declare what they need from a GPU at the attribute level.

The key DRA API objects:

  • ResourceSlice: driver-published inventory of available devices and their attributes
  • DeviceClass: cluster-wide policy for a class of devices (e.g., “any Hopper GPU with 80 GB memory”)
  • ResourceClaimTemplate: per-namespace template for requesting GPU resources with attribute selectors
  • ResourceClaim: the binding object created at pod scheduling time

Here’s a ResourceClaimTemplate requesting any GPU with Hopper architecture and at least 40 GiB of framebuffer memory (uses the GA resource.k8s.io/v1 API introduced in Kubernetes 1.34):

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: hopper-gpu-40gb
  namespace: ml-workloads
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: >
              device.attributes["gpu.nvidia.com"].architecture == "hopper" &&
              device.attributes["gpu.nvidia.com"].memory.isGreaterThan(quantity("40Gi"))

Reference it from a Pod spec using the source: field introduced in the GA API:

apiVersion: v1
kind: Pod
metadata:
  name: inference-worker
  namespace: ml-workloads
spec:
  resourceClaims:
    - name: gpu
      source:
        resourceClaimTemplateName: hopper-gpu-40gb
  containers:
  - name: inference
    image: my-inference-image:latest
    resources:
      claims:
      - name: gpu

The practical implication for multi-cloud ML platforms: DRA eliminates the label taxonomy problem. Instead of maintaining per-cluster node label schemes like nvidia.com/gpu.memory=40960MiB across AWS, GCP, and Azure (where label formats differ by cloud provider), you write one ResourceClaimTemplate that works wherever the NVIDIA driver publishes the right device attributes. Your manifests become genuinely portable across clouds and GPU generations.

DRA also lets the Kubernetes scheduler make placement decisions based on actual device capabilities rather than proxy labels. For a mixed-generation cluster running A100s and H100s side by side, DRA routes workloads to the right device class without node affinity rules for every hardware combination.

One caveat: DRA driver support and ecosystem tooling are still maturing post-GA. For new multi-cloud deployments, build with DRA from the start. For existing clusters on 1.28–1.32, close the other four leaks first and track DRA adoption in parallel.

The Coordination Problem: Four Techniques, Zero Integration

Here’s the real issue ML platform teams run into: MIG, time-slicing, Spot automation, and DRA are independently documented, independently deployed, and independently operated. Getting all four working together requires coordinating the GPU Operator, the cluster autoscaler, Spot interruption handlers, and DRA drivers – each with separate failure modes, upgrade paths, and day-2 operational concerns.

In practice, most teams implement one or two techniques and leave the rest on the table. Node lifecycle gets handled by Karpenter or the standard cluster autoscaler. MIG gets configured manually with nvidia-smi and breaks when nodes cycle. Time-slicing goes in for dev environments but never production inference. Spot stays off because nobody wants to debug an interrupted training job at 2am. These aren’t failures of understanding. They’re failures of operational capacity.

The 5% average utilization figure reflects this reality. The 49% best-case figure reflects what happens when all four techniques are operating together. The gap between them isn’t a single configuration change.

How Cast AI Addresses the Coordination Gap

The problem isn’t that these techniques are hard to understand in isolation. It’s that maintaining all four, at cluster scale, with automated lifecycle management, is a substantial engineering investment – and most ML platform teams are already stretched keeping training pipelines running and inference latency in budget.

Cast AI’s GPU optimization stack handles this coordination layer automatically – rather than requiring separate tools for each optimization, a single control plane manages the full lifecycle:

  • Managed GPU time-slicing: configures 1–48 virtual replicas per GPU and automates ConfigMap lifecycle as node pools scale. No per-node GPU Operator reconfiguration when new nodes join.
  • Automated MIG partitioning: manages MIG profile selection and lifecycle changes, including node draining, without requiring manual nvidia-smi -mig operations. Profile changes propagate automatically as workload patterns shift.
  • Spot GPU automation: interruption-aware scheduling that handles cordon, drain, and replacement provisioning as one operation. Supports per-cloud interruption signal handling on AWS, GCP, and Azure.
  • DRA-native autoscaling: reads ResourceClaims directly to make scaling decisions. When a pod declares its GPU requirements through DRA, the autoscaler understands those requirements and provisions the right device – not a generic GPU node that happens to match on label.
  • GPU-aware bin-packing: places workloads with awareness of GPU topology (NVLink, PCIe bandwidth), not just core and memory count. Standard bin-packing schedulers don’t model GPU topology.
  • GPU cost attribution: DCGM doesn’t provide per-container attribution on time-sliced clusters. Cast AI adds an attribution layer that allocates physical GPU costs across virtual replica consumers – closing the observability gap that exists at the DCGM level.

OMNI Compute also pools GPUs across clouds and regions, exposing remote capacity as native Kubernetes nodes. When your primary cloud region is out of H100 inventory – a real constraint in 2025–2026 -the autoscaler can pull from a secondary region or cloud without modifying workload manifests.

The outcome data: up to ~90% per-developer GPU cost reduction by combining time-slicing with Spot (the math from Leaks 3 and 4 — actual savings vary with Spot availability and workload pattern); 70%+ savings versus SageMaker for ML inference workloads; Akamai, a Cast AI customer, achieved 40–70% overall cloud cost reductions across their infrastructure after deploying Cast AI’s automated optimization. These aren’t single-technique results. They’re what happens when all four leaks are closed simultaneously under one control plane.

For a detailed technical walkthrough, the Cast AI GPU optimization documentation covers configuration steps for time-slicing, MIG, and DRA integration.

Where to Start: A Sequenced Approach

Don’t try to implement all four techniques at once. Sequence matters — each step builds on the previous one’s visibility and control.

  1. Instrument first. Deploy DCGM Exporter as a DaemonSet and import Grafana dashboard 12239. Collect two weeks of baseline utilization data before changing anything. The GPUUtilizationLow alert identifies your worst-performing nodes.
  2. Scale-to-zero autoscaling. Highest-impact, lowest-risk change. Set node scale-down thresholds from your utilization baseline. Idle GPU nodes cost the same as active ones.
  3. Time-slicing for dev and light inference. Start with your developer GPU pool. Use 4 replicas per GPU. Monitor for OOM kills. If you see memory pressure, reduce replicas or move to MIG. Benchmark p99 latency before expanding to production — context-switch overhead is measurable at high replica counts.
  4. MIG for production inference on H100/H200/A100. Match profile to model size: 3g.40gb for production 7B inference at batch size >4; 2g.20gb for dev/testing 7B models or batch size ≤4; 1g.10gb for embedding models and classifiers. Schedule profile changes during a maintenance window — node drain required.
  5. Spot for training and batch workloads. Add checkpointing to training jobs first, then enable Spot for those node pools. Don’t enable Spot for stateful inference endpoints without an interruption handler in place.

DRA is worth tracking and evaluating for new deployments, but don’t let it block closing the other four leaks. The biggest utilization gains come from steps 2–5.

Frequently Asked Questions

Why are GPUs underutilized in Kubernetes?

The default Kubernetes GPU model allocates an entire physical GPU per pod, regardless of how much compute the workload actually uses. Most inference workloads consume 5–40% of a GPU’s compute capacity. Training jobs hit 85–95% during the forward/backward pass but average far less across a full epoch. The result: the Cast AI 2026 State of Cloud report shows average GPU utilization at just 5% across production clusters-lower than CPU (8%) and far below memory (20%). Whole-GPU allocation is the root cause. Until a workload requests less than a full GPU, the scheduler has no mechanism to co-locate a second workload on the same device.

What is MIG vs MPS vs time-slicing?

MIG (Multi-Instance GPU) partitions a physical GPU at the hardware level into up to 7 isolated instances, each with dedicated memory bandwidth and compute-full isolation, no noisy-neighbor effects. Time-slicing creates virtual GPU replicas via CUDA context switching with no memory isolation between replicas (pods sharing a time-sliced GPU can access each other’s VRAM). MPS (Multi-Process Service) is an older CUDA mechanism for sharing a GPU across multiple CPU processes on a single node, with shared memory. For Kubernetes production inference, prefer MIG where hardware isolation is required. Time-slicing suits developer environments and lightweight, short-running inference. MPS is rarely the right choice in multi-tenant Kubernetes clusters.

How do I share a GPU across pods?

Two practical methods: (1) Time-slicing-configure a time-slicing ConfigMap in the NVIDIA GPU Operator with a replica count (e.g., 4), apply it via ClusterPolicy patch, and pods requesting nvidia.com/gpu: 1 share the physical device. No memory isolation between replicas. (2) MIG partitioning-configure MIG profiles via the GPU Operator ConfigMap (e.g., all-1g.10gb), drain the target node, apply the label, and pods request nvidia.com/mig-1g.10gb: 1 for hardware-isolated slices. Time-slicing is faster to deploy; MIG gives full memory isolation but requires supported hardware (A100, A30, H100, H200) and a node drain for profile changes.

How do I reduce GPU cost in Kubernetes?

Close the four GPU money leaks in order: (1) Enable scale-to-zero autoscaling-idle GPU nodes cost the same as active ones. Terminate nodes when no pods are scheduled. (2) Enable time-slicing or MIG to share physical GPUs across multiple workloads. Time-slicing at 4 replicas per GPU cuts per-workload GPU cost by 75%. (3) Run training and batch jobs on Spot instances-AWS saves 60–91%, GCP 60–80%, Azure 60–90%. Add checkpointing to training jobs first. (4) Rightsize GPU requests by measuring actual utilization via DCGM Exporter before setting resource requests. Combining time-slicing with Spot can reduce per-developer GPU costs by ~90%.

What is a fractional GPU?

A fractional GPU is a portion of a physical GPU’s compute capacity and memory allocated to a single workload. Kubernetes does not support fractional GPU requests natively—nvidia.com/gpu is an integer resource. Fractional access is achieved through time-slicing (which divides compute time across virtual replicas sharing the same VRAM) or MIG (which creates hardware-isolated GPU partitions with a defined fraction of memory and compute). Neither is a true Kubernetes fractional resource in the CPU/memory sense, but both allow multiple pods to use a single physical GPU. Dynamic Resource Allocation (GA in Kubernetes 1.34) enables attribute-based GPU selection that moves closer to true fractional scheduling.

Can you autoscale GPUs and scale to zero?

Yes. GPU nodes can scale to zero when no pods are scheduled to them, provided your cluster autoscaler supports GPU node pool scale-down. The key operational requirement: account for GPU node boot latency (3–8 minutes for DL-optimized AMIs with driver initialization). A well-configured autoscaler pre-provisions nodes ahead of scheduled demand and terminates idle nodes after a configurable cooldown. Scale-to-zero is the highest-impact, lowest-risk GPU optimization change-an idle GPU node costs the same as an active one. For inference endpoints that require sub-second cold start, maintain a minimum node count; for batch and training workloads, full scale-to-zero is appropriate.

How do I monitor GPU cost?

Deploy DCGM Exporter as a DaemonSet and use Grafana dashboard 12239 for baseline utilization visibility. Key metrics: DCGM_FI_DEV_GPU_UTIL (compute utilization), DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_FREE (memory). Set a GPUUtilizationLow alert (avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20) to surface idle GPU candidates. One caveat: DCGM reports at the physical GPU level. On time-sliced clusters, it cannot attribute compute to individual virtual replicas-per-container cost attribution requires a separate layer on top of DCGM metrics. For cross-team GPU cost allocation, track both GPU utilization and pod-level resource requests to build a chargeback model.

Cast AIBlogKubernetes GPU Optimization: How to Cut GPU Waste Without Slowing Workloads