,

Kubernetes Cost Optimization: How to Reduce Cluster Waste Without Hurting Reliability

Most Kubernetes clusters waste significant compute resources before optimization even begins. Learn a practical FinOps framework to reduce Kubernetes costs through measurement, allocation, rightsizing, autoscaling, governance, and continuous optimization – without sacrificing reliability.

Leon Kuperman Avatar
Kubernetes Cost Optimization Reduce Your Cloud Bill

Kubernetes cost optimization means reducing waste across pods, nodes, autoscaling, purchasing, storage, networking, and governance without hurting performance or reliability. Teams over-request CPU and memory, autoscaling is left at defaults, and no one owns the cost of a namespace. This guide walks a repeatable model: Measure, Allocate, Rightsize, Autoscale, Govern, Review.

Most Kubernetes clusters are paying for resources they never use. Our 2026 analysis of tens of thousands of clusters found CPU utilization averages 8% – down from 10% the year before – while 69% of clusters are actively over-provisioning CPU. That is not a billing problem. It is an engineering feedback-loop problem.

What you’ll take away from this guide

  • Why clusters waste money by default – and the six structural causes engineers rarely talk about
  • Where the money actually goes – compute, storage, networking, GPUs, and control-plane fees itemized
  • A repeatable 6-step loop – Measure → Allocate → Rightsize → Autoscale → Govern → Review
  • Copy-pasteable YAML and kubectl commands for LimitRanges, ResourceQuotas, and PVC audits
  • A 30/60/90-day roadmap from baseline visibility to full automated optimization
  • The six anti-patterns that silently kill your optimization efforts

What Kubernetes cost optimization is

Definition and scope

Kubernetes cost optimization is the continuous practice of aligning resource consumption with actual demand across every layer of the stack: pod resource requests and limits, node selection and purchasing strategy, autoscaling configuration, storage class and lifecycle, network traffic routing, and the governance model that keeps teams accountable. The scope is deliberately broad because savings at one layer often expose headroom at another. Cutting compute waste without fixing networking can mean the networking bill climbs as more traffic flows through under-patched paths.

It is also a continuous practice, not a one-time audit. Workload patterns shift, new teams deploy services, and cloud pricing changes. A cluster optimized six months ago drifts back toward waste without a governance loop to catch it.

How it differs from generic cloud cost

Generic cloud cost optimization targets instance type selection, idle VMs, oversized databases, and reserved capacity purchasing. Kubernetes adds a scheduling and abstraction layer that breaks those assumptions. A single EC2 node might run 40 pods for 12 different teams. The “node is idle” signal is meaningless without per-pod attribution. Worse, many cost waste patterns in Kubernetes – overprovisioned requests, HPA thrashing, cross-AZ sidecar traffic – are invisible to standard cloud billing tools because they look like normal utilization at the infrastructure layer.

The FOCUS Specification v1.3, ratified December 4, 2025 by the FinOps Foundation, establishes a common billing data schema that starts to bridge this gap, but per-workload Kubernetes cost attribution still requires cluster-native tooling like OpenCost or Kubecost to surface meaningfully.

The cost-reliability trade-off

Every cost lever has a reliability floor. Cutting resource requests too aggressively causes OOMKilled pods. Disabling cluster autoscaler scale-down keeps nodes idle but prevents scheduling failures during surges. The engineering question is always: what is the reliability cost of this savings measure, and is there a safer path to the same result?

The answer almost always involves better data collection before acting. Rightsizing based on p95 usage over 7 days is much safer than rightsizing based on a single day’s peak. The optimization loop described in this guide is designed to force that data collection step before any action is taken.

Why Kubernetes costs spiral out of control

Overprovisioning

CPU utilization across 23,000+ clusters averages 8% in 2026 – down from 10% the prior year, meaning the gap is widening, not closing. Memory sits at 20% (down from 23%). These are request-weighted averages: clusters provision capacity in blocks tied to resource requests, not actual consumption. When an engineer sets requests.cpu: 2 on a pod that uses 160m on average, they’ve allocated 2 vCPUs of node space for something that needs 0.16. That pattern repeated across 500 pods fills nodes with ghost capacity.

69% of clusters over-provision CPU today, up from 40% a year ago. The root cause is not negligence. Engineers set high requests because the consequence of running out (throttling, OOMKill) is immediate and painful, while the cost of overprovisioning accrues silently on a monthly bill no one in the team owns.

Default autoscaling

The Cluster Autoscaler’s default behavior adds nodes aggressively and removes them conservatively. The default scale-down-delay-after-add is 10 minutes; scale-down-unneeded-time is also 10 minutes. Most teams never touch these. The result: nodes scale up instantly under load and drain slowly after it, accumulating idle capacity in the gap. Add HPA with default thresholds that over-replicate pods, and the node count grows faster than it shrinks.

No allocation or ownership

Only ~14% of engineering teams implement chargeback for Kubernetes costs, according to the FinOps Foundation’s 2024 State of FinOps Report. Without cost ownership at the team or namespace level, there is no incentive to right-size requests or clean up unused services. Platform teams see one aggregated bill; application teams see nothing. That structure produces exactly the waste patterns described above.

Idle, orphaned, zombie resources

When a namespace is deleted or a service decommissioned, the associated PersistentVolumeClaims, LoadBalancer Services, and static IPs frequently survive. PVCs in a Released or Pending state still incur storage charges. LoadBalancers charge hourly whether traffic flows through them or not. A single unused NLB on AWS costs ~$16/month minimum; 30 of them is nearly $500/month for nothing. These orphans accumulate faster in large, multi-team clusters where no one has a complete picture of what’s running.

Bursty workloads

Workloads with high peak-to-baseline ratios – batch processing, CI pipelines, ML training jobs – force teams to provision for peak if autoscaling is too slow to respond. Node provisioning latency with Cluster Autoscaler averages 3–5 minutes. A CI job that lasts 8 minutes sees half its runtime on nodes that weren’t necessary 4 minutes earlier. Event-driven autoscaling (KEDA) and faster node provisioning (Karpenter at ~30 seconds, Cast AI at comparable speed) dramatically change this calculus.

Multi-tenancy without chargeback

Shared clusters amplify every waste pattern. Team A provisions generous requests because they share node cost with Team B. Team B notices the cluster is “always almost full” and requests more capacity. The platform team adds nodes. Nobody tracks the resulting waste back to the requesting team. This cycle repeats until someone looks at the monthly bill and schedules an emergency cost-cutting sprint – which typically produces arbitrary limits rather than data-driven rightsizing.

The anatomy of a Kubernetes bill (where the money goes)

Compute and nodes (the pod-to-node gap)

Compute dominates. Datadog’s State of Cloud Costs 2024 found that 83% of container costs are idle resources – capacity reserved but not used. The pod-to-node gap is the core problem: Kubernetes schedules pods based on resource requests, not actual usage. A node that appears “full” based on requested CPU may run at 8% actual utilization. Every dollar spent on that node, 92 cents is wasted.

Node instance type selection compounds this. General-purpose instances are convenient but rarely optimal. A workload running on an m5.4xlarge at 15% CPU utilization would run identically on an m5.xlarge with headroom to spare, at 60% lower node cost.

Storage

EBS volumes in AWS default to gp2 in many cluster configurations. Migrating to gp3 delivers 20% cost savings at equivalent or better performance (gp3 has higher baseline IOPS and throughput). Beyond storage class, PVC lifecycle management is the bigger issue: volumes sized for a peak that passed six months ago continue billing at full size indefinitely. Most teams have no process to audit PVC utilization.

Networking

Cross-AZ data transfer is one of the most underestimated Kubernetes cost drivers. AWS charges $0.01/GB for traffic crossing availability zones. A service mesh or sidecar-heavy architecture with pods scattered across three AZs can produce millions of gigabytes of cross-AZ traffic monthly. One real example: 991,980 GB/month at $0.01/GB = $9,919/month in network charges – fixed by a single route table update that kept traffic within-AZ. That is a five-figure annual saving from a one-line config change.

NAT Gateway charges ($0.045/GB processed) hit clusters that route egress through a central NAT. Service meshes and CNI plugins vary significantly in their cross-AZ traffic generation. Topology-aware routing, available in Kubernetes 1.27+, directs traffic to endpoints in the same zone when available and often eliminates the majority of cross-AZ charges without application changes.

Control-plane fees

EKS charges $0.10/hour per cluster (~$73/month). GKE charges for the control plane on non-Autopilot clusters unless you use the zonal free tier. AKS has no control-plane fee. These are minor individually but relevant at scale: 50 clusters on EKS = $3,650/month in control-plane fees alone, before a single workload runs. Cluster consolidation – merging underutilized clusters – is sometimes the highest-ROI action available.

Observability and logging

Logging and monitoring pipelines can represent 10–20% of total cluster costs at scale. Every container emits logs; large, verbose applications emit enormous volumes. Log ingestion at $0.50/GB (CloudWatch Logs standard) becomes significant fast. Filtering low-value logs at the agent (Fluent Bit, Vector) before they hit expensive backends, setting log retention policies, and sampling high-frequency debug logs are cost actions many teams overlook because they feel like “observability” rather than “cost” problems.

GPU and accelerators

An H100 on AWS costs approximately $3.90/GPU-hour (on-demand pricing following AWS’s June 2025 44% price reduction; Spot pricing is lower). At 5% average GPU utilization across clusters, that means $3.70 of every $3.90 is wasted each hour. GPU waste is structurally worse than CPU waste: GPUs can’t be time-shared by the OS scheduler the way CPUs can. One pod claiming a GPU blocks all others from using it, even at 2% utilization. This is addressable with GPU time-slicing and MIG partitioning, but most teams don’t configure either.

The Kubernetes Optimization Loop (the framework)

Measure, Allocate, Rightsize, Autoscale, Govern, Review

The Kubernetes Optimization Loop is a six-step cycle that turns cost reduction from a one-time sprint into a continuous capability:

Measure → Allocate → Rightsize → Autoscale → Govern → Review

  1. Measure: Get real utilization data – actual CPU and memory consumption per pod, per namespace, per cluster. Without this, every other step is guesswork.
  2. Allocate: Map spend to owners. Which team, product, or environment generated which costs? Attribution is the prerequisite to accountability.
  3. Rightsize: Adjust resource requests and limits to match observed usage. This single step typically recovers the most money in the shortest time.
  4. Autoscale: Configure HPA, VPA, KEDA, and node autoscalers to match capacity to real-time demand rather than worst-case estimates.
  5. Govern: Enforce LimitRanges, ResourceQuotas, and admission policies so new workloads deploy into guardrails, not an open field.
  6. Review: Monthly cost review against baseline KPIs. Catch new waste before it compounds. Update policies as workload patterns evolve.

Why it is a loop

Workloads change. Teams deploy new services. Cloud pricing changes. A static optimization produces diminishing returns within weeks because the cluster state it was optimized for no longer exists. The loop’s Review step generates the input for the next Measure step – new baselines, new waste patterns, new owners. Teams that treat cost optimization as a quarterly project see costs climb between sprints. Teams that run the loop continuously hold their gains and keep improving.

Measure – get cost and utilization visibility

You cannot optimize what you cannot see. The first step is deploying a cost visibility tool that ties Kubernetes workload identity (namespace, pod, label, deployment) to infrastructure cost. Two primary open-source options exist:

kubectl top gives you a starting point:

# Actual CPU and memory usage per pod across all namespaces
kubectl top pods --all-namespaces --sort-by=cpu

# Node-level utilization
kubectl top nodes

This shows instantaneous usage only. For cost optimization you need time-series data – at minimum 7 days, ideally 30 – to understand p50, p95, and p99 usage patterns. A pod that averages 200m CPU but spikes to 2 vCPUs for 30 seconds per hour needs different treatment than one that consistently uses 1.8 vCPUs.

OpenCost (CNCF incubating project) provides per-namespace, per-deployment, per-pod cost allocation at no cost. Install it with a single Helm chart; it reads from your cloud billing API and Prometheus metrics to produce per-workload cost breakdowns. Kubecost builds on OpenCost with enterprise features including multi-cluster views, budget alerts, and savings recommendations.

Cast AI Cost Monitoring connects read-only to your cluster in minutes and provides real-time utilization visibility with a free tier – no agents to size, no Prometheus to manage. For teams that want immediate insight without operational overhead, it is the fastest path from zero to a cost dashboard.

Whichever tool you choose, establish baseline KPIs at week one. You cannot demonstrate savings without a documented starting point.

Allocate – attribute spend to owners

Cost visibility without attribution is just a number on a dashboard. Attribution creates the feedback loop: teams see what they spend, which creates the incentive to spend less. The mechanics are straightforward.

Start with namespace-level tagging. Every namespace should carry labels for team, cost center, environment, and product. These labels propagate into OpenCost/Kubecost reports and into cloud billing exports if your nodes carry matching EC2/GCE/AKS tags.

kubectl label namespace team-alpha \
  team=alpha \
  cost-center=platform \
  environment=production \
  product=api-gateway

Once labels are in place, generate a weekly or monthly report per namespace. Start with showback: share the data with teams without billing consequences. Teams that see their namespace costs change behavior faster than you’d expect, even without chargeback. After two to three months of showback, moving to chargeback – where teams are actually billed for their usage via internal allocation – has a larger and more durable effect on request hygiene.

Only ~14% of teams implement chargeback today. The remaining 86% are leaving the most powerful behavioral lever in cost optimization unused.

Rightsize – match requests to real usage

Rightsizing is the highest-ROI action in the optimization loop for most clusters. It means setting requests and limits based on observed usage rather than guesses. The process:

  • Collect p50 and p95 usage data for each workload over at least 7 days (14–30 preferred)
  • Set requests at p50–p75 usage for stable workloads, p75–p90 for bursty ones
  • Set memory limits at 1.2–1.5x p99 memory usage
  • Leave CPU limits uncapped or set them generously – CPU throttling is almost always worse than the cost of leaving some headroom
  • Apply changes gradually, watching error rates and latency for at least 24 hours after each workload is resized

Throttling vs OOMKilled

These are the two failure modes from incorrect limits, and they behave very differently. CPU throttling occurs when a container exceeds its CPU limit: the Linux CFS scheduler forces it to wait, increasing latency with no error signal. Your pods stay running but respond slowly. CPU throttling is silent, hard to detect without the right metrics (container_cpu_cfs_throttled_seconds_total), and is caused almost entirely by CPU limits that are set too close to requests.

OOMKilled is the opposite: the container exceeds its memory limit and the kernel kills it immediately. The pod restarts. This is loud and obvious – you’ll see it in events and logs – but it means you have a hard ceiling problem. OOMKills during rightsizing mean your memory limit is below the actual working set. The fix is always more memory headroom above p99, not aggressive limit reduction.

Practical rule: never set CPU limits equal to CPU requests. The requests == limits pattern (a common Burstable QoS shortcut) guarantees CPU throttling under any burst. Use Guaranteed QoS only when you genuinely need reserved CPU and are willing to pay for guaranteed scheduling priority. For most stateless services, set requests conservatively and leave limits uncapped or at 4–8x requests.

Cast AI Workload Autoscaler handles this continuously and without pod restarts – applying rightsized requests as workload patterns evolve. Bud (case study) saw a 93% improvement in CPU and memory utilization after enabling it, without a single application outage during the transition.

Autoscale – tune capacity to demand

Autoscaling in Kubernetes operates at two levels: pod scaling (how many replicas) and node scaling (how much cluster capacity). Getting both right simultaneously is where most teams struggle.

Here is a comparison of the main autoscaling tools:

ToolScalesTriggerBest forLimitation
HPA (Horizontal Pod Autoscaler)Pod replicasCPU, memory, custom metricsStateless, traffic-driven servicesSlow to respond to sudden spikes; can thrash
VPA (Vertical Pod Autoscaler)Pod resource requestsHistorical usage analysisStateful workloads, jobs, singletonsRequires pod restart to apply; conflicts with HPA
KEDAPod replicas (to zero)External events (queues, cron, HTTP)Event-driven, batch, jobs, scale-to-zeroRequires ScaledObject config per workload
Cluster AutoscalerNodesPending pods / underutilized nodesAny cluster; baseline node autoscalingSlow (3–5 min); fixed node groups; less flexible
KarpenterNodesPending pods / NodePool rulesAWS EKS; flexible, fast node provisioningOriginated on AWS; Azure support added via AKS provider; multi-cloud coverage expanding

Karpenter typically delivers 20–40% cost reduction vs Cluster Autoscaler in benchmark comparisons, primarily because it right-sizes node types to actual pod requirements rather than fitting pods into fixed node groups. It provisions nodes in roughly 30 seconds versus 3–5 minutes for Cluster Autoscaler, which eliminates the need to pre-provision capacity for burst workloads.

One hard rule: do not enable HPA and VPA simultaneously on the same Deployment targeting the same metric. They will fight each other — VPA increases requests, HPA scales down replicas, VPA recalculates, HPA scales back up. Use HPA for replica count, and either VPA in recommendation-only mode or Cast AI Workload Autoscaler for request rightsizing.

Node efficiency and purchasing

Once pods are right-sized, node selection becomes the next lever. Three purchasing strategies combine for large savings:

  • Spot/Preemptible Instances: 60–90% cheaper than on-demand for interruptible workloads. Stateless services, batch jobs, CI runners, and ML training are all good candidates. Interruption rates on AWS Spot average 5–10% depending on instance family and region.
  • Savings Plans / Reserved Instances: 30–60% savings for predictable baseline capacity. Most effective when combined with Spot for burst: reserve the floor, Spot the ceiling.
  • ARM/Graviton instances: 20–40% cheaper than equivalent x86 instances on AWS. Graviton4 (r8g, m8g families) supports most containerized workloads transparently with multi-arch images. GKE Arm (T2A) and Azure Cobalt 100 (Cax series) offer similar savings.

Instance family selection matters as much as pricing tier. Memory-optimized instances for memory-heavy workloads, compute-optimized for CPU-bound jobs, and general-purpose for mixed workloads. Karpenter’s NodePool rules can specify a prioritized list of instance types to try, enabling bin-packing across families rather than locking into one type.

Cast AI’s cluster autoscaler (KENT (Cast AI’s Karpenter integration layer)) operates across AWS, GCP, and Azure, predicts Spot interruptions up to 30 minutes ahead, and proactively migrates workloads before interruptions occur. The Cast AI benchmark across production clusters shows 43% average compute cost reduction — measured across tens of thousands of clusters in Cast AI’s 2026 State of Kubernetes Optimization Report — when combining rightsizing, Spot optimization, and autonomous node management. Individual results depend on starting state, workload mix, and which optimization layers are enabled.

Storage, networking, and hidden costs

Storage quick wins:

# Find PVCs that are not in Bound state (orphaned/unused)
kubectl get pvc --all-namespaces | grep -v Bound

Run this and you will almost certainly find PVCs in Released or Pending state that are still incurring charges. Delete them after confirming the data is either backed up or genuinely not needed. Then migrate your default StorageClass from gp2 to gp3 on AWS for an immediate 20% storage cost reduction with no performance penalty:

# Patch the default StorageClass to gp3
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

Right-size PVC claims when provisioning. Kubernetes does not automatically shrink a PVC if usage falls below the claim. Enforce maximum claim sizes via ResourceQuota (requests.storage) at the namespace level.

Networking quick wins:

Enable topology-aware routing to keep traffic within availability zones. Audit services of type LoadBalancer — every unused LoadBalancer is a direct cost with no benefit. Use NodePort or Ingress where external LoadBalancers aren’t required. For NAT Gateway optimization, consider deploying VPC endpoints for S3 and DynamoDB (free for gateway endpoints) to eliminate NAT charges for those traffic patterns.

GPU and AI workload cost

GPU cost optimization deserves its own section because the economics are fundamentally different from CPU. A single H100 node at $3.90/GPU-hour (on-demand; Spot pricing is lower) running at 5% utilization burns $3.70/hour for nothing. Scale that to a 32-GPU training cluster idle over a weekend and you have $4,800 in wasted spend for two days.

Three levers apply:

  • GPU time-slicing: NVIDIA’s MPS and time-slicing APIs allow multiple pods to share a single GPU. Inference workloads with moderate throughput requirements rarely need a full GPU. Time-slicing lets 4–8 inference pods share one A10G, cutting GPU cost per inference pod by 75–87%.
  • Multi-Instance GPU (MIG): H100 and A100 GPUs support MIG partitioning into up to 7 independent GPU instances with guaranteed isolation. Use MIG when tenant isolation matters (multi-team inference APIs).
  • Spot for training: Training jobs are checkpoint-resumable. Running them on Spot at 60–90% discount vs on-demand is the single largest GPU cost lever for most ML teams. ALLEN Digital (case study) moved GPU-heavy AI training to optimized Kubernetes-native scheduling and achieved 70%+ cost savings vs running equivalent workloads on SageMaker.

Cast AI’s GPU optimization stack handles time-slicing configuration, MIG setup, and Spot-aware scheduling in a unified control plane, with visibility into per-GPU utilization that most monitoring stacks don’t provide by default.

Kubernetes FinOps and governance

Governance is what makes optimization stick. Without enforced guardrails, every new deployment resets progress. Two Kubernetes-native objects — LimitRange and ResourceQuota — are the foundation.

LimitRange sets default requests and limits for any container that doesn’t specify them, and enforces min/max bounds. This single object eliminates the “no requests set” anti-pattern for any new workload in the namespace:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-alpha
spec:
  limits:
  - type: Container
    default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    max:
      cpu: "4"
      memory: 8Gi
    min:
      cpu: 50m
      memory: 64Mi

ResourceQuota caps total resource consumption for the entire namespace, preventing any single team from consuming an outsized share of cluster resources:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "50"
    persistentvolumeclaims: "20"
    requests.storage: 200Gi

The FinOps Foundation’s three-phase maturity model — Crawl (Inform), Walk (Optimize), Run (Operate) — maps directly to this loop: visibility and allocation come first, then active optimization, then continuous governance.

Beyond these objects, OPA/Gatekeeper (or Kyverno) policies can enforce higher-level rules: require all Deployments to specify resource requests, block container images from unvetted registries, or prevent Spot-incompatible workloads from running on Spot nodes. These policies belong in CI/CD pipelines as well — a workload that fails a resource policy check at admission is much cheaper to fix than one that deploys and drives a cost incident.

Manual vs automated cost optimization

Manual optimization works at small scale. One platform engineer can track 20 namespaces, review VPA recommendations weekly, and apply rightsizing changes with reasonable coverage. At 200 namespaces across 10 clusters, the math breaks: a single engineer would need to review thousands of pod-level recommendations, apply changes, validate stability, and repeat — constantly, because workloads change continuously.

Automation platforms shift the model. Instead of an engineer reviewing recommendations, the system applies changes within policy guardrails and alerts on anomalies. The engineer’s job becomes setting policy, reviewing monthly KPIs, and handling exceptions.

Comparison of approaches:

ToolTypeCost modelStrengthsBest fit
OpenCostOpen-source visibilityFreeCNCF-standard, vendor-neutral, integrates with PrometheusTeams that want visibility and will act manually
KubecostCommercial visibility + recommendationsFree tier; paid for enterpriseMulti-cluster views, budget alerts, savings recommendationsTeams that want recommendations and some automation
Cast AIAutonomous optimization platformFree monitoring; optimization priced on savingsAutomated rightsizing, Spot prediction, multi-cloud node autoscaling, GPU optimizationTeams at scale that need optimization without manual toil

In a benchmark across production clusters, Cast AI’s autonomous optimization produced a 43% average compute cost reduction – measured across tens of thousands of clusters in Cast AI’s 2026 State of Kubernetes Optimization Report – compared to 10–15% typical for teams running manual optimization programs. Individual results depend on starting state, workload mix, and which optimization layers are enabled. The gap is not because manual optimization doesn’t work; it’s because manual processes can’t keep up with the rate of change in a live cluster.

Cost optimization by cloud

AWS (EKS): Karpenter is the preferred node autoscaler. KENT is Cast AI’s integration layer that adds rightsizing, Spot prediction, and multi-cloud portability on top of Karpenter without replacing it. Graviton4 (m8g, r8g) delivers 20–40% savings. Spot is well-supported with large instance diversity pools. Savings Plans (Compute) cover on-demand baseline. EBS gp3 migration is a quick win. EKS control plane charges $73/month/cluster, consolidate clusters where feasible.

GCP (GKE): Autopilot mode is worth evaluating for variable or unpredictable workloads, it handles bin-packing automatically and only charges for requested pod resources, not entire node capacity. Spot VMs offer 60–91% discount. GKE’s built-in cost optimization recommendations surface in the Cloud Console. Committed Use Discounts cover predictable baseline.

Azure (AKS): No control-plane fees. Azure Spot VMs (eviction policy: Deallocate) work well for batch and CI workloads. Cobalt 100 (Cax series) ARM instances deliver 20–40% savings on general workloads. Azure Reservations and Savings Plans cover baseline compute.

For multi-cloud environments, tools that abstract across clouds, including Cast AI’s KENT autoscaler, provide unified optimization without per-cloud tooling sprawl.

A 30/60/90-day cost optimization roadmap

Days 1–30: Establish visibility

  • Deploy OpenCost or connect Cast AI Cost Monitoring in read-only mode
  • Identify top 5 namespaces by waste (requested vs. used)
  • Tag all namespaces with team/cost-center/environment labels
  • Run kubectl get pvc --all-namespaces | grep -v Bound and delete confirmed orphans
  • Audit unused LoadBalancer services
  • Document baseline KPIs: CPU utilization rate, memory utilization rate, monthly cost per namespace

Days 31–60: Act on quick wins

  • Rightsize the top 10 workloads by waste, using 14+ days of p95 data
  • Enable Karpenter or Cast AI CA for one non-critical cluster
  • Enable Spot for stateless services and batch jobs in non-production first
  • Apply LimitRanges to all active namespaces
  • Migrate EBS gp2 volumes to gp3
  • Enable topology-aware routing to reduce cross-AZ traffic

Days 61–90: Systematize and automate

  • Enable automated rightsizing on all workloads (with policy guardrails)
  • Deploy Karpenter or Cast AI across all production clusters
  • Apply ResourceQuotas per namespace
  • Deploy OPA/Gatekeeper policies enforcing resource request requirements
  • Add cost gate checks to CI/CD pipelines
  • Conduct first monthly cost review against baseline KPIs
  • Begin Savings Plans or Committed Use Discount evaluation for baseline compute

Common mistakes and anti-patterns

These six patterns silently undermine every other optimization effort:

  • No resource requests set. Pods without requests get scheduled based on node capacity rather than actual need, creating unpredictable bin-packing and preventing meaningful cost attribution. LimitRanges solve this at the namespace level; admission policies enforce it cluster-wide.
  • requests == limits. Setting CPU requests equal to limits guarantees throttling under any burst load. This pattern, often copied from examples or Helm defaults, turns normal traffic spikes into latency events. Set limits higher than requests or leave CPU limits uncapped.
  • HPA and VPA both enabled on the same Deployment. They conflict. HPA scales replicas based on per-pod CPU; VPA changes per-pod CPU requests. When VPA increases requests, HPA sees lower per-pod utilization and scales down replicas, which then load each pod more, which triggers VPA to increase requests further. The result is thrashing. Use HPA for scaling and Cast AI Workload Autoscaler or VPA in recommendation-only mode for request sizing.
  • Cluster Autoscaler scale-down disabled. Many teams disable scale-down after a single bad experience with pod rescheduling. The correct fix is PodDisruptionBudgets and pod scheduling rules – not turning off scale-down globally. A cluster that never scales down accumulates idle nodes indefinitely.
  • PDBs set to maxUnavailable=0. A PodDisruptionBudget with maxUnavailable: 0 prevents the cluster autoscaler or Karpenter from ever draining the node the pod runs on. This blocks node consolidation completely for any namespace using this setting. Use minAvailable instead of maxUnavailable: 0, and set it to a value that allows at least one pod to be evicted during consolidation.
  • GPUs allocated 1-per-pod with no sharing. The default NVIDIA device plugin allocates one GPU per requesting container. Without time-slicing or MIG configured, a low-throughput inference pod claims an entire A100 and runs at 3% utilization. Configure the NVIDIA time-slicing ConfigMap or enable MIG for all GPU node pools handling inference workloads.

How to measure success (KPIs)

Track these metrics monthly. Compare against your documented baseline. Target ranges are guidelines – the right target depends on workload type and reliability requirements.

KPIFormula / SourceIndustry avg (2026)Target range
CPU request utilization rateActual CPU used / CPU requested8%40–65%
Memory request utilization rateActual memory used / memory requested20%50–70%
Node bin-packing efficiencyTotal pod requests / total node capacity~45–55%70–85%
Spot percentage of fleetSpot node-hours / total node-hours~20%50–70% (stateless)
Cost per namespaceOpenCost / Kubecost allocation reportTrending down MoM
GPU utilization rateDCGM exporter: GPU active / GPU allocated5%40–60%
P99 pod scheduling latencykube-scheduler latency metricStable (not increasing)

The last KPI – P99 scheduling latency – is your reliability sentinel. If cost optimization measures are causing scheduling delays, it shows up here before it shows up in application SLOs. Track it alongside cost metrics so you can see cost savings and reliability impact in the same review.

FAQs

What is Kubernetes cost optimization?

Kubernetes cost optimization is the practice of reducing resource waste across every layer of a Kubernetes cluster – pods, nodes, autoscaling, storage, networking, purchasing, and governance – without degrading application performance or reliability. It treats overspend as a structural engineering problem, not a billing problem.

Why are my Kubernetes costs so high?

The most common causes are overprovisioned resource requests (CPU averages 8% utilization against requests), autoscaling configured at defaults that never scale down, no cost ownership per namespace, orphaned PVCs and LoadBalancers, cross-AZ traffic charges, and GPU nodes running at 5% average utilization. Datadog found 83% of container costs are idle resources.

How much can I save?

Savings depend on your starting state. Rightsizing alone typically recovers 20–40% of compute spend. Adding Spot Instances saves 60–90% on eligible workloads. Teams implementing the full optimization loop – Measure, Allocate, Rightsize, Autoscale, Govern, Review – routinely achieve 40–60% total compute cost reduction. Cast AI’s production benchmark shows 43% average reduction.

What is the difference between cost optimization and cost management?

Cost management is visibility and attribution: knowing what you spend, where it goes, and who owns it. Cost optimization is action: changing the system configuration to spend less for equivalent or better outcomes. You need cost management as the foundation before optimization actions are meaningful. Most teams have some cost management; far fewer have a repeatable optimization process.

How do I rightsize without causing OOM kills?

Collect p95 and p99 memory usage data over at least 7–14 days before changing any limits. Set memory limits at 1.2–1.5x the p99 observed peak, never below it. Apply changes to one workload at a time, monitor for OOMKilled events for 24 hours after each change, and roll back if triggered. For CPU, leave limits uncapped or set them at 4–8x requests – CPU throttling from tight limits is a worse outcome than a slightly higher request.

HPA vs VPA vs Cluster Autoscaler vs Karpenter – which do I need?

Most production clusters need both pod-level and node-level autoscaling. For pods: HPA for horizontal scaling of stateless services; KEDA for event-driven and scale-to-zero workloads; VPA in recommendation-only mode (or Cast AI Workload Autoscaler) for request rightsizing. For nodes: Karpenter on EKS for flexible, fast provisioning; Cast AI for multi-cloud. Do not run HPA and VPA in active mode on the same Deployment.

Are Spot Instances safe?

Yes, for the right workloads. Stateless services, batch jobs, CI runners, and ML training jobs handle interruptions well with proper configuration: graceful termination, PodDisruptionBudgets, and multi-AZ deployment. The risk is interruption – typically 2 minutes notice on AWS. Cast AI predicts interruptions up to 30 minutes ahead and proactively rebalances. Never run stateful databases on Spot without a tested, automated failover process.

How do I allocate Kubernetes cost to teams?

Label namespaces with team, cost center, environment, and product tags. Deploy OpenCost or Kubecost for per-namespace cost attribution. Start with showback – sharing cost reports without billing consequences – before moving to chargeback. Only ~14% of teams implement chargeback today. Even showback creates meaningful behavioral change around request sizing within weeks.

What are the biggest Kubernetes cost drivers?

In rough order of impact: compute waste from overprovisioned requests (83% of container costs are idle per Datadog 2024), GPU underutilization (5% average), cross-AZ network transfer charges, oversized and orphaned PVCs, unused LoadBalancer services, and observability/logging pipeline costs at scale.

OpenCost vs Kubecost vs an automation platform?

OpenCost: free, CNCF-incubating, vendor-neutral visibility. Good for teams comfortable acting on data manually. Kubecost: adds enterprise features (multi-cluster, budget alerts, recommendations) on top of OpenCost. Cast AI: goes beyond visibility to autonomous action – rightsizing pods, replacing nodes, optimizing Spot purchasing – with a free monitoring tier and optimization priced on savings delivered. The choice depends on whether you need to see the problem or automate the fix.

How do I reduce GPU cost in Kubernetes?

Configure NVIDIA GPU time-slicing to share one GPU across multiple inference pods. Use MIG partitioning on H100/A100 for isolated multi-tenant inference. Run batch training on Spot at 60–90% discount. Right-size inference workloads – test on smaller GPU types (T4, A10G) before assuming you need an A100. ALLEN Digital (case study) achieved 70%+ cost savings vs SageMaker by moving AI training to optimized Kubernetes-native scheduling.

Can I automate Kubernetes cost optimization safely?

Yes. Start automation in recommendation-only mode. Review recommendations for two weeks to build confidence in the system’s judgment. Enable automated application on non-critical workloads first, then expand. Cast AI Workload Autoscaler applies rightsizing without pod restarts and supports workload-level rollout controls. The operational risk of well-configured automation is lower than manual changes applied inconsistently at scale.

What KPIs should I track?

CPU request utilization rate (target 40–65% vs current 8% industry average), memory request utilization rate (target 50–70% vs current 20%), node bin-packing efficiency (target 70–85%), Spot percentage of fleet, cost per namespace month-over-month, GPU utilization rate (target 40–60% vs current 5%), and P99 pod scheduling latency as your reliability sentinel. Track the reliability KPI alongside cost KPIs to confirm optimization is not coming at the expense of performance.

Cast AIBlogKubernetes Cost Optimization: How to Reduce Cluster Waste Without Hurting Reliability