For years, asking Kubernetes for a GPU meant writing `nvidia.com/gpu: 1` and hoping for the best. You get a GPU. Which GPU? Whatever happens to be available.
That’s exactly how the NVIDIA device plugin works. A pod requests one GPU, and Kubernetes schedules it to any node with a free one. An expensive GPU sitting idle? That’s what you get. Meanwhile, your lightweight inference job – which would have run perfectly on a cheaper GPU – just burned money.
The common approach was node labels – applied by additional tooling deployed in the cluster to discover GPU properties. nodeSelector or nodeAffinity could then reference those labels, but every team did it differently, and there was no unified standard.
Kubernetes 1.34 made Dynamic Resource Allocation (DRA) generally available, and it changes this entirely – finally letting you say what you actually need: “I want a GPU with Ampere architecture, at least 20 GB of memory, and compute capability 8.0.0.” The scheduler finds the right device. The autoscaler provisions the right node. No more guessing, no more labels.
This post walks through what DRA is, how it works, how to deploy a real GPU workload using DRA, and how to efficiently share that GPU across multiple replicas of the same app.
DRA in three minutes
DRA introduces four core objects. Two are device-facing, two are workload-facing.
ResourceSlice (device-facing) is created by the DRA driver – for GPUs, by the NVIDIA DRA driver. It describes what’s actually available on each node with real, structured attributes:
attributes:
architecture: Ampere
brand: Nvidia
productName: NVIDIA A100-SXM4-40GB
cudaComputeCapability: 8.0.0
capacity:
memory: 40GiInstead of “this node has 1 GPU”, Kubernetes now has: “this node has an NVIDIA A100, Ampere architecture, 40 Gi memory, CUDA 8.0.0.”
That’s a completely different level of information.
DeviceClass (device-facing) groups devices into categories. The NVIDIA DRA driver creates standard classes like gpu.nvidia.com and mig.nvidia.com. Platform teams can add their own – high-memory-gpu, budget-inference-gpu.
ResourceClaimTemplate (workload-facing) is the per-pod template. For each pod, Kubernetes creates a corresponding ResourceClaim from the template, and the claim’s lifecycle is tied to the pod.
ResourceClaim (workload-facing) is used when multiple pods share a single device – more on this below.
How scheduling works
When a pod with a ResourceClaimTemplate is deployed:
- Kubernetes creates a ResourceClaim from the template
- Kubernetes scheduler reads all ResourceSlices and evaluates device requirements in ResourceClaim
- When a matching device is found, the ResourceClaim is allocated, and the pod is scheduled to that node
When no matching device is available in the cluster, the pod remains `Pending`.
This is where CAST AI comes in – the autoscaler reads the structured requirements from the ResourceClaim, simulates the same DRA allocation checks the Kubernetes scheduler uses, finds the cheapest instance type that satisfies the ResourceClaim, and provisions it.
When the node joins, and the NVIDIA DRA driver publishes a ResourceSlice, the scheduler completes the allocation, and the pod runs.
No custom logic. The CAST AI autoscaler and Kubernetes scheduler speak the same DRA language.
The demo workload
Let’s use a real GPU workload: GPU Demo App, CUDA-powered Mandelbrot fractal renderer written in C++.
It generates a new image every 10 seconds with randomized zoom, center coordinates, and one of eight color schemes.
Example Mandelbrot fractal generated by GPU Demo App:

To test the scenarios described below, you’ll need a Kubernetes cluster with a GPU node available and the NVIDIA DRA driver installed.
Deploying with DRA with precise GPU requirements
With DRA, GPU requirements are specified in a ResourceClaimTemplate.
Since our GPU Demo APP generates Mandelbrot fractals and requires CUDA support, let’s request an NVIDIA GPU with an Ampere architecture and at least 20 GB of memory. This may not be the most optimal choice for such a small demo app, but optimization will be described later.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-demo-app
namespace: gpu-demo
spec:
spec:
devices:
requests:
- name: gpu-0
exactly:
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: |
device.attributes["gpu.nvidia.com"].architecture == "Ampere" &&
device.capacity["gpu.nvidia.com"].memory.isGreaterThan(quantity("20Gi"))Unlike the device plugin model, where nvidia.com/gpu: 1 tells Kubernetes nothing about the GPU required by the workload, this ResourceClaimTemplate precisely describes the workload’s requirements, allowing the scheduler to find the right device rather than just any available one.
An A100 (Ampere, 40 Gi) would match. A T4 (Turing, 16 GB) would not. Kubernetes understands exactly what the workload needs.
The Deployment references the ResourceClaimTemplate:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-demo-app
namespace: gpu-demo
spec:
replicas: 1
selector:
matchLabels:
app: gpu-demo-app
template:
metadata:
labels:
app: gpu-demo-app
spec:
resourceClaims:
- name: gpu-0
resourceClaimTemplateName: gpu-demo-app
containers:
- name: gpu-demo-app
image: ghcr.io/castai/gpu-demo-app:latest
ports:
- containerPort: 5000
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
claims:
- name: gpu-0
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: gpu-demo-service
namespace: gpu-demo
spec:
selector:
app: gpu-demo-app
ports:
- port: 80
targetPort: 5000Apply and verify:
kubectl apply -f resourceclaimtemplate.yaml -f deployment.yaml
kubectl get resourceclaim -n gpu-demo -w
# NAME STATE AGE
# gpu-demo-app-gpu-abc123 allocated,reserved 8s
kubectl port-forward -n gpu-demo svc/gpu-demo-service 8080:80
open http://localhost:8080It works. But the single replica gets a dedicated GPU – and for a small app like this, that’s overkill. DRA also supports GPU sharing, which lets multiple pods share a single GPU and opens up interesting cost-saving options.
Sharing the GPU
The GPU Demo App is small. It doesn’t need exclusive access to an expensive GPU. Running multiple replicas, each generating a different fractal simultaneously, is a much better idea. DRA gives us three ways to share a GPU, and choosing between them matters.
Time-slicing: multiple replicas, processes taking turns
Time-slicing lets multiple pods share a single GPU through rapid context switching, with each pod taking turns using it. With DRA, GPU time-slicing is as simple as pointing multiple pods at the same ResourceClaim :
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: gpu-demo-app
namespace: gpu-demo
spec:
devices:
requests:
- name: ts-gpu
exactly:
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: |
device.attributes["gpu.nvidia.com"].architecture == "Ampere" &&
device.capacity["gpu.nvidia.com"].memory.isGreaterThan(quantity("20000Mi"))
config:
- requests: ["ts-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: TimeSlicing
timeSlicingConfig:
interval: LongThe Deployment references this shared ResourceClaim by name:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-demo-app
namespace: gpu-demo
spec:
replicas: 3
selector:
matchLabels:
app: gpu-demo-app
template:
metadata:
labels:
app: gpu-demo-app
spec:
resourceClaims:
- name: ts-gpu
resourceClaimName: gpu-demo-app
containers:
- name: gpu-demo-app
image: ghcr.io/castai/gpu-demo-app:latest
ports:
- containerPort: 5000
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
claims:
- name: ts-gpu
request: ts-gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: gpu-demo-service
namespace: gpu-demo
spec:
selector:
app: gpu-demo-app
ports:
- port: 80
targetPort: 5000Three replicas, one GPU, one node.
But there’s a catch. Time-slicing is achieved through rapid context switches: at any given moment, only one process uses the GPU.
For our use case – generating multiple fractal images at the same time – this is not ideal. We’re not actually running concurrently, just interleaving. If the goal is to generate images faster by running more replicas, time-slicing is not the best choice here.
MPS: multiple processes access the GPU at the same time
MPS (Multi-Process Service) enables multiple CUDA processes to access the GPU simultaneously. Instead of taking turns, they can run their workloads concurrently. NVIDIA DRA driver supports MPS as a GPU sharing technique, and this is exactly what we need to generate multiple fractals.
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: gpu-demo-app
namespace: gpu-demo
spec:
devices:
requests:
- name: mps-gpu
exactly:
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: |
device.attributes["gpu.nvidia.com"].architecture == "Ampere" &&
device.capacity["gpu.nvidia.com"].memory.isGreaterThan(quantity("20000Mi"))
config:
- requests: ["mps-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: MPS
mpsConfig:
defaultActiveThreadPercentage: 33
defaultPinnedDeviceMemoryLimit: 5GiWith defaultActiveThreadPercentage: 33, each of three replicas gets roughly a third of the GPU’s compute.
When the NVIDIA DRA driver sees this claim, it automatically starts the MPS Control Daemon on the node – no manual setup required.
Now all three replicas generate their fractals concurrently.
MIG: hardware isolation, seven images at once
For the strongest isolation – dedicated memory, dedicated compute – we can use MIG (Multi-Instance GPU), another GPU sharing technique. Unlike time-slicing and MPS, it works at the hardware level: the GPU is divided into independent partitions.
The A100 40GB can be sliced into up to seven 1g.5gb partitions, giving each one 5 GB of memory and a dedicated fraction of compute.
That means seven pods, each with a completely isolated MIG partition, all running on a single A100.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-demo-mig
namespace: gpu-demo-1
spec:
spec:
devices:
requests:
- name: mig-1g-5gb
exactly:
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: |
device.attributes["gpu.nvidia.com"].profile == "1g.5gb"This uses a ResourceClaimTemplate (not a shared ResourceClaim) – each pod gets its own dedicated MIG partition.
The Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-demo-mig
namespace: gpu-demo
spec:
replicas: 7
selector:
matchLabels:
app: gpu-demo-mig
template:
metadata:
labels:
app: gpu-demo-mig
spec:
resourceClaims:
- name: mig-1g-5gb
resourceClaimTemplateName: gpu-demo-mig
containers:
- name: gpu-demo-app
image: ghcr.io/castai/gpu-demo-app:latest
ports:
- containerPort: 5000
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
claims:
- name: mig-1g-5gb
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: gpu-demo-service
namespace: gpu-demo
spec:
selector:
app: gpu-demo-mig
ports:
- port: 80
targetPort: 5000Seven replicas, each with a dedicated 1g.5gb MIG partition, all on one A100. All seven generate fractal images simultaneously and independently due to hardware isolation.
What CAST AI adds
DRA makes the workload side of GPU management clean and portable. CAST AI closes the loop on the infrastructure side.
When a pod with a ResourceClaimTemplate or ResourceClaim is pending, CAST AI reads the related ResourceClaim and runs the exact same DRA allocation checks the Kubernetes scheduler uses. For each candidate instance type, it simulates whether the ResourceClaim would be satisfied. Then it picks the cheapest passing option and provisions it.
This means you don’t choose instance types manually. A claim requesting Ampere architecture and 20 GB of memory gets the cheapest possible GPU, without any additional configuration.
In the MIG example, the autoscaler determines that the workload needs a GPU that supports MIG, provisions one, and automatically creates MIG partitions.
Multiple fractal generators can run in parallel on a single GPU with full hardware isolation, at a fraction of the cost. You don’t need to choose between cost efficiency and isolation – you combine both.
From one GPU to seven MIG partitions
DRA changes how you express GPU requirements – from a count to a description. That description is what makes intelligent scheduling and cost optimization possible.
The fractal demo starts simple: one replica, one dedicated GPU. Then, by scaling to seven replicas and switching to MIG, we run seven isolated fractal generators concurrently on a single A100. The application code didn’t change – only the ResourceClaim did. And we went from generating one image at a time to seven, without adding a single GPU node.



