Deploying GPU workload with Dynamic Resource Allocation

For years, asking Kubernetes for a GPU meant writing `nvidia.com/gpu: 1` and hoping for the best. You get a GPU. Which GPU? Whatever happens to be available.

That’s exactly how the NVIDIA device plugin works. A pod requests one GPU, and Kubernetes schedules it to any node with a free one. An expensive GPU sitting idle? That’s what you get. Meanwhile, your lightweight inference job – which would have run perfectly on a cheaper GPU – just burned money.

The common approach was node labels – applied by additional tooling deployed in the cluster to discover GPU properties. nodeSelector or nodeAffinity could then reference those labels, but every team did it differently, and there was no unified standard.

Kubernetes 1.34 made Dynamic Resource Allocation (DRA) generally available, and it changes this entirely – finally letting you say what you actually need: “I want a GPU with Ampere architecture, at least 20 GB of memory, and compute capability 8.0.0.” The scheduler finds the right device. The autoscaler provisions the right node. No more guessing, no more labels.

This post walks through what DRA is, how it works, how to deploy a real GPU workload using DRA, and how to efficiently share that GPU across multiple replicas of the same app.

DRA in three minutes

DRA introduces four core objects. Two are device-facing, two are workload-facing.

ResourceSlice (device-facing) is created by the DRA driver – for GPUs, by the NVIDIA DRA driver. It describes what’s actually available on each node with real, structured attributes:

attributes:
  architecture: Ampere
  brand: Nvidia
  productName: NVIDIA A100-SXM4-40GB
  cudaComputeCapability: 8.0.0
capacity:
  memory: 40Gi

Instead of “this node has 1 GPU”, Kubernetes now has: “this node has an NVIDIA A100, Ampere architecture, 40 Gi memory, CUDA 8.0.0.”

That’s a completely different level of information.

DeviceClass (device-facing) groups devices into categories. The NVIDIA DRA driver creates standard classes like gpu.nvidia.com and mig.nvidia.com. Platform teams can add their own – high-memory-gpu, budget-inference-gpu.

ResourceClaimTemplate (workload-facing) is the per-pod template. For each pod, Kubernetes creates a corresponding ResourceClaim from the template, and the claim’s lifecycle is tied to the pod.

ResourceClaim (workload-facing) is used when multiple pods share a single device – more on this below.

How scheduling works

When a pod with a ResourceClaimTemplate is deployed:

Kubernetes creates a ResourceClaim from the template
Kubernetes scheduler reads all ResourceSlices and evaluates device requirements in ResourceClaim
When a matching device is found, the ResourceClaim is allocated, and the pod is scheduled to that node

When no matching device is available in the cluster, the pod remains `Pending`.

This is where CAST AI comes in – the autoscaler reads the structured requirements from the ResourceClaim, simulates the same DRA allocation checks the Kubernetes scheduler uses, finds the cheapest instance type that satisfies the ResourceClaim, and provisions it.

When the node joins, and the NVIDIA DRA driver publishes a ResourceSlice, the scheduler completes the allocation, and the pod runs.

No custom logic. The CAST AI autoscaler and Kubernetes scheduler speak the same DRA language.

The demo workload

Let’s use a real GPU workload: GPU Demo App, CUDA-powered Mandelbrot fractal renderer written in C++.

It generates a new image every 10 seconds with randomized zoom, center coordinates, and one of eight color schemes.

Example Mandelbrot fractal generated by GPU Demo App:

To test the scenarios described below, you’ll need a Kubernetes cluster with a GPU node available and the NVIDIA DRA driver installed.

Deploying with DRA with precise GPU requirements

With DRA, GPU requirements are specified in a ResourceClaimTemplate.

Since our GPU Demo APP generates Mandelbrot fractals and requires CUDA support, let’s request an NVIDIA GPU with an Ampere architecture and at least 20 GB of memory. This may not be the most optimal choice for such a small demo app, but optimization will be described later.

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: gpu-demo-app
  namespace: gpu-demo
spec:
  spec:
    devices:
      requests:
        - name: gpu-0
          exactly:
            deviceClassName: gpu.nvidia.com
            selectors:
              - cel:
                  expression: |
                    device.attributes["gpu.nvidia.com"].architecture == "Ampere" &&
                    device.capacity["gpu.nvidia.com"].memory.isGreaterThan(quantity("20Gi"))

Unlike the device plugin model, where nvidia.com/gpu: 1 tells Kubernetes nothing about the GPU required by the workload, this ResourceClaimTemplate precisely describes the workload’s requirements, allowing the scheduler to find the right device rather than just any available one.

An A100 (Ampere, 40 Gi) would match. A T4 (Turing, 16 GB) would not. Kubernetes understands exactly what the workload needs.

The Deployment references the ResourceClaimTemplate:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-demo-app
  namespace: gpu-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-demo-app
  template:
    metadata:
      labels:
        app: gpu-demo-app
    spec:
      resourceClaims:
        - name: gpu-0
          resourceClaimTemplateName: gpu-demo-app
      containers:
        - name: gpu-demo-app
          image: ghcr.io/castai/gpu-demo-app:latest
          ports:
            - containerPort: 5000
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "1"
            claims:
              - name: gpu-0
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: gpu-demo-service
  namespace: gpu-demo
spec:
  selector:
    app: gpu-demo-app
  ports:
    - port: 80
      targetPort: 5000

Apply and verify:

kubectl apply -f resourceclaimtemplate.yaml -f deployment.yaml

kubectl get resourceclaim -n gpu-demo -w

# NAME                       STATE               AGE

# gpu-demo-app-gpu-abc123    allocated,reserved  8s

kubectl port-forward -n gpu-demo svc/gpu-demo-service 8080:80

open http://localhost:8080

It works. But the single replica gets a dedicated GPU – and for a small app like this, that’s overkill. DRA also supports GPU sharing, which lets multiple pods share a single GPU and opens up interesting cost-saving options.

The GPU Demo App is small. It doesn’t need exclusive access to an expensive GPU. Running multiple replicas, each generating a different fractal simultaneously, is a much better idea. DRA gives us three ways to share a GPU, and choosing between them matters.

Time-slicing: multiple replicas, processes taking turns

Time-slicing lets multiple pods share a single GPU through rapid context switching, with each pod taking turns using it. With DRA, GPU time-slicing is as simple as pointing multiple pods at the same ResourceClaim :

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: gpu-demo-app
  namespace: gpu-demo
spec:
  devices:
    requests:
      - name: ts-gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          selectors:
            - cel:
                expression: |
                  device.attributes["gpu.nvidia.com"].architecture == "Ampere" &&
                  device.capacity["gpu.nvidia.com"].memory.isGreaterThan(quantity("20000Mi"))
    config:
      - requests: ["ts-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: TimeSlicing
              timeSlicingConfig:
                interval: Long

The Deployment references this shared ResourceClaim by name:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-demo-app
  namespace: gpu-demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu-demo-app
  template:
    metadata:
      labels:
        app: gpu-demo-app
    spec:
      resourceClaims:
        - name: ts-gpu
          resourceClaimName: gpu-demo-app
      containers:
        - name: gpu-demo-app
          image: ghcr.io/castai/gpu-demo-app:latest
          ports:
            - containerPort: 5000
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "1"
            claims:
              - name: ts-gpu
                request: ts-gpu
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: gpu-demo-service
  namespace: gpu-demo
spec:
  selector:
    app: gpu-demo-app
  ports:
    - port: 80
      targetPort: 5000

Three replicas, one GPU, one node.

But there’s a catch. Time-slicing is achieved through rapid context switches: at any given moment, only one process uses the GPU.

For our use case – generating multiple fractal images at the same time – this is not ideal. We’re not actually running concurrently, just interleaving. If the goal is to generate images faster by running more replicas, time-slicing is not the best choice here.

MPS: multiple processes access the GPU at the same time

MPS (Multi-Process Service) enables multiple CUDA processes to access the GPU simultaneously. Instead of taking turns, they can run their workloads concurrently. NVIDIA DRA driver supports MPS as a GPU sharing technique, and this is exactly what we need to generate multiple fractals.

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
 name: gpu-demo-app
 namespace: gpu-demo
spec:
 devices:
   requests:
   - name: mps-gpu
     exactly:
       deviceClassName: gpu.nvidia.com
       selectors:
       - cel:
           expression: |
             device.attributes["gpu.nvidia.com"].architecture == "Ampere" &&
             device.capacity["gpu.nvidia.com"].memory.isGreaterThan(quantity("20000Mi"))
   config:
   - requests: ["mps-gpu"]
     opaque:
       driver: gpu.nvidia.com
       parameters:
         apiVersion: resource.nvidia.com/v1beta1
         kind: GpuConfig
         sharing:
           strategy: MPS
           mpsConfig:
             defaultActiveThreadPercentage: 33
             defaultPinnedDeviceMemoryLimit: 5Gi

With defaultActiveThreadPercentage: 33, each of three replicas gets roughly a third of the GPU’s compute.

When the NVIDIA DRA driver sees this claim, it automatically starts the MPS Control Daemon on the node – no manual setup required.

Now all three replicas generate their fractals concurrently.

MIG: hardware isolation, seven images at once

For the strongest isolation – dedicated memory, dedicated compute – we can use MIG (Multi-Instance GPU), another GPU sharing technique. Unlike time-slicing and MPS, it works at the hardware level: the GPU is divided into independent partitions.

The A100 40GB can be sliced into up to seven 1g.5gb partitions, giving each one 5 GB of memory and a dedicated fraction of compute.

That means seven pods, each with a completely isolated MIG partition, all running on a single A100.

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
 name: gpu-demo-mig
 namespace: gpu-demo-1
spec:
 spec:
   devices:
     requests:
     - name: mig-1g-5gb
       exactly:
         deviceClassName: mig.nvidia.com
         selectors:
         - cel:
             expression: |
               device.attributes["gpu.nvidia.com"].profile == "1g.5gb"

This uses a ResourceClaimTemplate (not a shared ResourceClaim) – each pod gets its own dedicated MIG partition.

The Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-demo-mig
  namespace: gpu-demo
spec:
  replicas: 7
  selector:
    matchLabels:
      app: gpu-demo-mig
  template:
    metadata:
      labels:
        app: gpu-demo-mig
    spec:
      resourceClaims:
        - name: mig-1g-5gb
          resourceClaimTemplateName: gpu-demo-mig
      containers:
        - name: gpu-demo-app
          image: ghcr.io/castai/gpu-demo-app:latest
          ports:
            - containerPort: 5000
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "500m"
            claims:
              - name: mig-1g-5gb
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: gpu-demo-service
  namespace: gpu-demo
spec:
  selector:
    app: gpu-demo-mig
  ports:
    - port: 80
      targetPort: 5000

Seven replicas, each with a dedicated 1g.5gb MIG partition, all on one A100. All seven generate fractal images simultaneously and independently due to hardware isolation.

What CAST AI adds

DRA makes the workload side of GPU management clean and portable. CAST AI closes the loop on the infrastructure side.

When a pod with a ResourceClaimTemplate or ResourceClaim is pending, CAST AI reads the related ResourceClaim and runs the exact same DRA allocation checks the Kubernetes scheduler uses. For each candidate instance type, it simulates whether the ResourceClaim would be satisfied. Then it picks the cheapest passing option and provisions it.

This means you don’t choose instance types manually. A claim requesting Ampere architecture and 20 GB of memory gets the cheapest possible GPU, without any additional configuration.

In the MIG example, the autoscaler determines that the workload needs a GPU that supports MIG, provisions one, and automatically creates MIG partitions.

Multiple fractal generators can run in parallel on a single GPU with full hardware isolation, at a fraction of the cost. You don’t need to choose between cost efficiency and isolation – you combine both.

From one GPU to seven MIG partitions

DRA changes how you express GPU requirements – from a count to a description. That description is what makes intelligent scheduling and cost optimization possible.

The fractal demo starts simple: one replica, one dedicated GPU. Then, by scaling to seven replicas and switching to MIG, we run seven isolated fractal generators concurrently on a single A100. The application code didn’t change – only the ResourceClaim did. And we went from generating one image at a time to seven, without adding a single GPU node.

Improve cloud efficiency:

The Cloud Waste Problem: How to Stop Overprovisioning Resources in 2026

Intelligent Spot Instance Availability: How Machine Learning Reduces Interruptions by up to 94%

Karpenter Cost Optimization: Consolidation Benchmark Results (7-Day Run)

Solutions

Resources

Company

Book a demo