Scale AI workloads on any hardware with Cast AI support for TPUs

GPU lead times and capacity constraints have shifted the focus toward hardware diversification. For many teams, the question is no longer how to find more H100s, but how to move workloads to TPUs without rebuilding the entire operational stack.

TPU adoption has moved beyond internal Google projects. Organizations training foundation models have increasingly relied on TPU architecture to bypass the global GPU supply bottleneck. This shift is now moving into the broader enterprise market as teams realize that staying on a single hardware type introduces significant platform risk.

While TPUs are first-class citizens on GKE, the operational hurdle remains the manual effort required to provision and scale these specialized nodes. For most organizations, managing a diverse hardware fleet feels prohibitively complex.

Cast AI now automates the provisioning and scale-to-zero of TPU v5e and v5p slices. This lets you treat TPUs as a standard compute resource rather than an operational outlier.

Maximize utilization through automated hardware lifecycle management

You should not have to choose between keeping your researchers happy and keeping your cloud bill under control. This is how the integration changes your daily operations:

Automated Lifecycle Management. TPU slices are high-performance resources that require precise handling. When a training job finishes at 2:00 AM, the infrastructure should not remain active until a human intervenes at 9:00 AM. Cast AI identifies the idle state and terminates the node immediately, ensuring your cluster runs only what is necessary.

Self-Service Infrastructure. Cast AI removes the friction between model development and hardware execution by automating the entire lifecycle. Your teams no longer need to manage node pools or wait on manual provisioning tickets. Once your policies are defined, the platform ensures the right TPU resources are available exactly when the workload requires them, and gone the moment they are no longer needed.

Operational Consistency. TPU support extends the same automation to GKE that was previously available for GPU and CPU clusters. Your platform provides the right compute for the right workload regardless of the underlying accelerator, managed through a single control plane.

Unified Automation for Specialized AI Hardware

The addition of TPU support rounds out Cast AI’s coverage of the hardware required to run AI workloads at scale: standard CPUs, GPUs, Google Cloud TPUs on GKE, and AWS Trainium/Neuron on EKS.

By automating the selection and provisioning of these resources, you remove the manual toil that typically bogs down engineering teams. Instead of researching instance types or configuring complex node templates, your team can focus on model performance while the platform handles the underlying compute.

How It Works: One Manifest, No New Plugins

Unlike GPU nodes, TPUs on GKE do not require a separate device plugin. GKE manages TPU drivers and the google.com/tpu extended resource automatically on every TPU node. Cast AI does not require any device plugin to be deployed before autoscaling.

When your pod requests google.com/tpu, the Autoscaler reads your existing node selectors and provisions the exact slice required.

Your pod spec drives the infrastructure:

spec:
  nodeSelector:
    # Cast AI reads these to pick the right ct5lp or ct5p machine type
    cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
    cloud.google.com/gke-tpu-topology: 2x2
  tolerations:
    - key: "google.com/tpu"
      operator: Exists
      effect: NoSchedule
  containers:
    - image: your-tpu-model:latest
      resources:
        requests:
          cpu: 4
          memory: 8Gi
          google.com/tpu: 4
        limits:
          cpu: 4
          memory: 8Gi
          google.com/tpu: 4

On taints: GKE automatically applies a google.com/tpu=true:NoSchedule taint to every TPU node. Your pod needs a matching toleration; without it, the pod won’t schedule, and Cast AI won’t provision a node. The google.com/tpu resource must appear in both requests and limits, with equal values.

Technical Scope and Availability

We are launching with support for a specific set of configurations and will expand our coverage throughout the year to include all major TPU topologies and instance types.

Single-host slices only. Support covers TPU v5e and v5p, where all chips reside on a single host. Multi-host pod slices that require cross-node networking are not supported in this release. For those topologies, continue using manual configurations for now.

GKE only. TPU support is a GKE-exclusive feature at launch. If you’re on EKS, Trainium/Neuron support is already available. See the AWS Neuron documentation.

Why not the native GKE Cluster Autoscaler? GKE’s built-in autoscaler will provision TPU nodes, but Cast AI gives you more aggressive scale-down logic, unified cost visibility across your entire heterogeneous fleet, and a single control plane for every accelerator type. You don’t switch between cloud-native tools based on the hardware your workload runs on.

Take the Toil Out of AI Infrastructure

Your AI platform should scale based on your workload, not your manual effort. By bringing TPUs into the Cast AI ecosystem, you ensure that your researchers have immediate access to the hardware they need while efficiency remains the default. You get the performance of specialized accelerators without the typical trade-off of a growing cloud bill or increased platform headcount.

Full configuration details, supported topologies, and additional examples are in the TPU documentation. Not a customer yet? To see scale-to-zero in action on your GKE environment, request a technical demo.

Improve cloud efficiency:

How to Solve the 3 Top Cloud Cost Optimization Challenges with CAST AI and Usage AI

The No-Nonsense Guide To IaC And Terraform

Gen AI and Large Language Model Training and Inference: How To Reduce Your AWS Bill

Solutions

Resources

Company

Book a demo