GPU Sharing, Now Native: Cast AI Adds DRA Support

Most Kubernetes GPU clusters show the same pattern. A small number of GPUs run hot, while the rest sit mostly idle. Teams spend weeks securing capacity, only to struggle with utilization once it is provisioned. Finance teams eventually notice the gap between spend and value.

Capacity itself is unpredictable. Availability varies by region and time. When capacity exists, making efficient use of it often requires complex manual configuration of GPU Sharing techniques such as MIG or time-slicing. Developers manually group workloads, adjust the underlying infrastructure, and revisit those decisions whenever requirements change.

As AI usage grows, this overhead compounds. More models, more regions, more inference endpoints. GPU management becomes an operational problem that scales faster than the workloads it supports.

Dynamic Resource Allocation support changes that.

Intent-based GPU allocation with DRA

Cast AI now supports Dynamic Resource Allocation for GKE and EKS clusters running Kubernetes 1.34 and above.

DRA is a Kubernetes-native way to request and share hardware accelerators, such as GPUs. Instead of specifying GPU counts directly in Pod specs, workloads reference resource claims that describe their needs. Teams define intent once, and Kubernetes handles allocation through a consistent abstraction.

If you are familiar with StorageClasses and PersistentVolumeClaims, the model will feel natural. Resource claims decouple workload requirements from infrastructure configuration.

Why DRA matters

DRA shifts GPU management from static configuration to workload intent.

Workloads no longer need hardcoded GPU counts scattered across manifests: When requirements change, teams update a resource claim rather than touching every Pod spec.
Device selection becomes more precise: Resource claims can filter on GPU attributes, such as memory or performance characteristics, rather than relying on node selectors and naming conventions.
Sharing improves: DRA gives Kubernetes a clearer view of which workloads actually need resources, enabling better bin packing and reducing idle GPUs.
Different workloads can express different priorities: Batch jobs can target cost-optimized GPUs, while latency-sensitive inference can request higher performance. Scheduling decisions align with workload intent rather than infrastructure assumptions.

How Cast AI closes the loop

DRA defines what workloads need. Cast AI turns that intent into running infrastructure.

The Cast AI Autoscaler reads DRA ResourceClaims and automatically provisions the right nodes. It selects instance types, finds available capacity, and scales infrastructure to match demand without manual node template management.

This works alongside existing GPU capabilities.

GPU Sharing strategies remain available where they make sense, now coordinated through a single Autoscaler. Spot optimization is applied by default, with automatic fallback to on-demand capacity when Spot is unavailable. OMNI expands the search for GPU capacity across regions and clouds when local supply is constrained.

Allocation, provisioning, placement, and cost are handled together.

From resource claim to running workload

When a workload references a DRA resource claim, the Autoscaler evaluates available capacity and automatically provisions the appropriate nodes.

If Spot capacity is available in the preferred region, it is used. If Spot is unavailable, the Autoscaler falls back to on-demand. If the region is constrained, OMNI provisions GPU capacity from another region or cloud and presents it as native Kubernetes nodes.

The workload runs without manual intervention, even as availability and pricing change.

Stop managing GPUs. Start using them.

Dynamic Resource Allocation removes the need for constant GPU tuning. Cast AI handles provisioning and placement automatically, while OMNI ensures capacity and Spot optimization keeps costs under control.

Together, this extends the Cast AI Autoscaler to manage GPUs end-to-end so teams can focus on models and applications instead of infrastructure mechanics.

DRA support is available now for GKE and EKS on Kubernetes 1.34 and above. AKS support is coming soon.

Want to explore how Cast can help you automatically optimize your Kubernetes GPU infrastructure with intelligent sharing techniques? Check out our documentation to learn more or request a demo to see Cast in action.

Explore how Cast AI can optimize your GPU infrastructure

Request demo

Explore how Cast AI can optimize your GPU infrastructure

Improve cloud efficiency:

A Conversation with Benjamin Caouren on Why APA Makes Engineers More Essential, Not Less

AWS Commitments: How Cast AI Maximizes Reserved Instances and Savings Plans

Introducing Container Live Migration: Zero Downtime for Stateful Kubernetes Workloads

Solutions

Resources

Company

Book a demo

GPU Sharing, Now Native: Cast AI Adds DRA Support

Intent-based GPU allocation with DRA

Why DRA matters

How Cast AI closes the loop

From resource claim to running workload

Stop managing GPUs. Start using them.

Explore how Cast AI can optimize your GPU infrastructure

Improve cloud efficiency:

More articles

A Conversation with Benjamin Caouren on Why APA Makes Engineers More Essential, Not Less

AWS Commitments: How Cast AI Maximizes Reserved Instances and Savings Plans

Introducing Container Live Migration: Zero Downtime for Stateful Kubernetes Workloads

Boost Kubernetes performance, security, and cost optimization

Book a demo