Kubernetes HPA: How Horizontal Pod Autoscaling Saves Cost

Kubernetes HPA (Horizontal Pod Autoscaler) automatically scales pod replicas up or down based on real demand. Despite years of Kubernetes maturity, fleet-wide CPU utilization still averages just 8% across production clusters (Cast AI 2026 State of Kubernetes Optimization Report, across tens of thousands of clusters on AWS, GCP, and Azure). Overprovisioning is not a configuration problem. It is a structural one. HPA is one of the tools that helps close the gap. If your workloads run at fixed replica counts regardless of traffic, you’re almost certainly over-provisioning during off-peak hours and potentially under-provisioning during peaks.

This guide covers how HPA works, how to configure it with the current autoscaling/v2 API, which metrics to scale on, and how it connects to real cloud cost reduction. We also cover the common mistakes that cause HPA to either flap or silently fail to scale at all.

Key Takeaways

Kubernetes HPA automatically adjusts pod replica counts based on CPU, memory, or custom metrics, checking every 15 seconds.
The autoscaling/v2 API is the current standard; autoscaling/v1 is deprecated and limits you to a single CPU metric.
Fleet-wide average CPU utilization is just 8% across production clusters (Cast AI 2026 State of Kubernetes Optimization Report); HPA is the mechanism that reclaims that wasted spend.
Scale-down uses a 5-minute stabilization window by default; scale-up is immediate. Both are tunable.
HPA and VPA conflict when both target CPU or memory; understand the safe co-use rules before combining them.
KEDA extends HPA with over 60 external metric sources, including Prometheus, SQS, and Kafka, and enables scale-to-zero.

What Is Kubernetes HPA?

Kubernetes HPA (Horizontal Pod Autoscaler) is a built-in control-plane controller that automatically increases or decreases the number of pod replicas in a Deployment, StatefulSet, or ReplicaSet based on observed metrics. It checks metrics every 15 seconds and adjusts replica counts to keep average utilization near a configured target. HPA works with Deployments, ReplicaSets, and StatefulSets. For StatefulSets, ensure pods are stateless peers or that the application handles per-pod state (such as sharded or quorum-aware services). Naive horizontal scaling of a stateful database can cause split-brain or quorum loss. HPA does not apply to DaemonSets, which run exactly one pod per node by definition. For a broader view of Kubernetes autoscaling strategies, see the Cast AI guide to Kubernetes autoscaling for cloud cost optimization.

HPA addresses a fundamental problem with static deployments: traffic is not constant. A fixed replica count either wastes money during quiet periods or hits a ceiling during spikes. HPA removes both failure modes by treating replica count as a continuous variable, not a static setting.

How Kubernetes HPA Works

Understanding the internal mechanics of HPA helps you configure it correctly and diagnose failures when scaling doesn’t behave as expected.

The HPA Control Loop

The HPA Control Loop is the core algorithm that drives all scaling decisions. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period on the controller manager), the HPA controller executes this sequence:

Poll the appropriate metrics API for current metric values across all target pods.
Compute the desired replica count using the formula: desiredReplicas = ceil(currentReplicas × currentMetricValue / desiredMetricValue)
Apply a tolerance threshold: if the ratio falls within ±10% of 1.0, no scaling action occurs.
Check the stabilization window to prevent thrashing: scale-up has a 0-second window (immediate) and scale-down has a 300-second window (5 minutes) by default.
Enforce minReplicas and maxReplicas bounds.
Issue the scale command to the target workload.

For example, if a Deployment has 4 replicas, average CPU utilization is 80%, and the target is 60%, HPA computes: ceil(4 × 80 / 60) = ceil(5.33) = 6. Two replicas are added. This formula is sourced from the Kubernetes official documentation on Horizontal Pod Autoscaling.

The tolerance band prevents unnecessary churn when metrics hover near the threshold. Without it, HPA would continuously add and remove replicas in response to minor fluctuations.

Metrics Server and Metrics Availability

HPA pulls metrics from three APIs depending on the metric type:

metrics.k8s.io: CPU and memory (served by metrics-server)
custom.metrics.k8s.io: custom per-pod or per-object metrics (served by prometheus-adapter or similar)
external.metrics.k8s.io: metrics from outside the cluster (served by KEDA or a custom adapter)

For CPU and memory scaling, metrics-server must be deployed and healthy. Verify it with kubectl top pods. If that command fails, HPA will show <unknown> for utilization and no scaling will occur. Install it with:

# Pin to a specific version. Replace v0.7.2 with the latest stable from github.com/kubernetes-sigs/metrics-server/releases
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.2/components.yaml

Newly created pods are excluded from metric averaging for 30 seconds after creation (--horizontal-pod-autoscaler-initial-readiness-delay). For CPU specifically, pods transitioning to Ready within a 5-minute window are treated as not-yet-ready. Both behaviors prevent premature scale-back during pod startup warm-up.

Configuring Kubernetes HPA

HPA configuration lives in a HorizontalPodAutoscaler manifest. All examples below use autoscaling/v2, which has been stable since Kubernetes 1.26. The older autoscaling/v1 is deprecated; it supports only a single CPU metric via targetCPUUtilizationPercentage and lacks multi-metric support, stabilization windows, and behavior tuning. Migrate any existing v1 manifests to v2.

Before configuring HPA, make sure every container in the target Deployment has resources.requests.cpu set. Without it, HPA cannot compute a utilization percentage and will not scale. For more on setting requests and limits correctly.

HPA Metric Types

The autoscaling/v2 API supports five metric types. Understanding which one fits your workload determines whether HPA will scale accurately:

Metric Type	Source	Example Use Case
Resource	Metrics Server	Scale on CPU/memory usage
Pods	Custom metrics adapter	Requests per second per pod
Object	Custom metrics adapter	Queue depth in a message broker
External	External metrics adapter	SQS queue depth, Pub/Sub backlog
ContainerResource	Metrics Server	Scale on a specific container’s CPU

Multiple metrics can be combined in a single HPA spec. When you do, HPA scales to satisfy the most demanding metric, not the average across them. If CPU says scale to 5 replicas and a custom metric says scale to 8, HPA targets 8.

CPU-Based HPA (autoscaling/v2 YAML)

This is the most common configuration. Target 60% average CPU utilization, which leaves headroom for traffic spikes before pods saturate:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  # Explicit policies prevent silent application of defaults (max 4 pods or 100% per 15s)
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 30

A few notes on this config: minReplicas: 2 ensures you don’t drop to a single pod, which would be a single point of failure for any service with an availability SLA. The scaleUp.stabilizationWindowSeconds: 60 window controls how long HPA looks back to find the minimum recommendation before acting. With a 60-second window, HPA takes the minimum of all recommendations in the last 60 seconds, which prevents over-scaling during volatile metric spikes. The actual rate limit comes from the policies block: type: Percent, value: 50, periodSeconds: 30 means HPA can add at most 50% more replicas every 30 seconds.

Memory-Based HPA

Memory-based scaling uses the same structure, but targets memory rather than cpu. However, memory scaling requires care: unlike CPU, memory is not compressible. A pod that approaches its memory limit doesn’t slow down gradually; it gets OOM-killed. Memory-based HPA works best for workloads with predictable, proportional memory growth (such as in-memory caches or batch jobs loading data sets):

metrics:
- type: Resource
  resource:
    name: memory
    target:
      type: Utilization
      averageUtilization: 70

For workloads where memory usage grows due to application-level leaks rather than real load increases, memory-based HPA will add replicas without solving the underlying problem. Profile memory usage patterns before relying on memory as the primary scaling signal.

Custom and External Metrics (KEDA ScaledObject Example)

CPU and memory metrics don’t tell the full story for every workload. A queue consumer should scale on queue depth, not CPU. A streaming processor should scale on consumer lag. For these cases, KEDA (Kubernetes Event-Driven Autoscaling) is the standard solution in 2026. KEDA deploys as an operator and extends the external.metrics.k8s.io API with over 60 built-in scalers. It also enables scale-to-zero (minReplicaCount: 0), which native HPA does not support.

The following ScaledObject scales a Deployment based on HTTP request rate from Prometheus. When the rate exceeds 100 requests/second per replica, KEDA adds pods:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-app-scaler
  namespace: default
spec:
  scaleTargetRef:
    name: my-app
    kind: Deployment
    apiVersion: apps/v1
  minReplicaCount: 0
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: http_requests_per_second
      threshold: "100"
      query: sum(rate(http_requests_total{job="my-app", namespace="default"}[2m]))
      # Replace job/namespace labels with your actual Prometheus labels -- verify with: kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2" | jq .

KEDA creates the metrics API endpoint automatically; no custom adapter configuration is needed. The HPA controller reads the metric through the standard external metrics API, so HPA’s own scaling algorithm still runs. KEDA manages metric exposure; native HPA does the actual replica math.

Monitoring HPA After Deployment

These metrics are exported by kube-state-metrics, a prerequisite separate from metrics-server. Install it with: kubectl apply -f https://github.com/kubernetes/kube-state-metrics/releases/latest/download/kube-state-metrics.yaml or via helm install ksm prometheus-community/kube-state-metrics.

After enabling HPA, track these key Prometheus metrics to confirm it is working correctly:

# Current vs desired replicas: triggers investigation when current < desired for > 5m
kube_horizontalpodautoscaler_status_current_replicas{namespace="default", horizontalpodautoscaler="my-app-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{namespace="default", horizontalpodautoscaler="my-app-hpa"}

# Alert when HPA is capped at maxReplicas: means your maxReplicas is too low
kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas

Track these weekly after initial deployment. An HPA pegged at maxReplicas is the most common signal that your scaling ceiling needs adjustment.

HPA vs VPA: Which Should You Use?

Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) solve different problems. HPA adjusts the number of pods (horizontal scaling). VPA adjusts the CPU and memory requests allocated to each pod (vertical scaling).

The key distinction in practice: HPA is reactive to live traffic. Add more pods when load increases; remove them when it drops. VPA is based on historical usage patterns. Over time, it observes that a pod rarely uses more than 500m CPU, so it reduces requests from 2000m to 600m. Both solve real problems. However, running them together on the same metric creates a conflict that the Kubernetes autoscaler maintainers have confirmed (kubernetes/autoscaler issue #6060).

The conflict works like this: VPA raises a pod’s CPU request from 500m to 1000m. Suddenly, the pod reports 50% CPU utilization instead of 100%, even though actual usage is unchanged. HPA sees utilization drop below target and scales in, removing replicas. Now each remaining pod handles more traffic, CPU spikes, and VPA raises requests further. The cycle continues.

The safe co-use patterns:

HPA on CPU + VPA in Recommend mode only: VPA surfaces request suggestions but doesn’t apply them automatically. You review and apply manually.
HPA on custom metrics + VPA in Auto mode: HPA scales on queue depth or request rate (not CPU), so VPA’s request changes don’t corrupt HPA’s signal.

Using both in Auto mode on the same resource metric is a misconfiguration that produces unpredictable scaling behavior.

How HPA Saves Cloud Cost

The Cast AI 2026 State of Kubernetes Optimization Report, which analyzed production clusters across tens of thousands of environments on AWS, GCP, and Azure, found that average CPU utilization across the fleet sits at just 8%. Additionally, 69% of requested CPU goes entirely unused across those production clusters (Cast AI 2026 State of Kubernetes Optimization Report, tens of thousands of clusters).

HPA addresses this directly: when traffic drops, it removes replicas. Fewer replicas means fewer pods requesting CPU and memory. Combined with Cluster Autoscaler, fewer pods means fewer active nodes. The cost savings compound: fewer node-hours billed, fewer resource reservations held idle.

However, there’s a subtlety that most teams miss. HPA’s utilization percentage is computed as actual_usage / requested_amount. If requests are inflated, say a pod requests 2000m CPU but actually uses 500m, HPA sees 25% utilization and concludes there is plenty of headroom. It doesn’t scale down. The waste remains, expressed as idle capacity inside each running pod rather than as extra replica count.

This is where request sizing becomes a prerequisite for effective HPA. Cast AI Workload Autoscaler observes actual CPU and memory consumption over time and corrects requests toward actual usage. Once requests reflect real consumption, HPA’s utilization percentage becomes accurate. The Workload Autoscaler adjusts the baseline; HPA manages the scaling response. Unlike VPA in Auto mode, this approach avoids the feedback loop described above because it operates on the requests that HPA uses as its denominator, not as a competing autoscaler.

HPA and Node Autoscaling Work Together

HPA scales pod replicas. When replicas increase beyond available node capacity, Cluster Autoscaler or Karpenter provisions new nodes. When replicas decrease, idle nodes may be removed. Getting the timing right between HPA and your node autoscaler matters.

For Cluster Autoscaler, set --scale-down-delay-after-add (default 10m) to be at least equal to HPA’s scaleDown.stabilizationWindowSeconds (default 5m). A common mistake: Cluster Autoscaler removes a node 3 minutes after HPA scaled up, before the stabilization window confirms the scale is stable, causing thrash.

For Karpenter, set consolidateAfter: 10m (or match your HPA scaleDown window) to avoid removing nodes while HPA is still observing the scale:

spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 10m

Common HPA Mistakes

Most HPA failures come down to a small set of repeatable misconfigurations. These are the ones worth checking first when HPA isn’t behaving as expected.

Missing resource requests

If resources.requests.cpu is not set on a container, HPA cannot compute utilization. The HPA status shows <unknown> and no scaling occurs. This is the most common silent failure. Every container in every targeted pod must have requests set.

Using autoscaling/v1

The older API only supports targetCPUUtilizationPercentage, a single-metric CPU config with no stabilization windows. Migrate to autoscaling/v2. The spec.metrics[] array replaces the old field and supports all five metric types.

Not tuning stabilization windows

The default scale-down window is 300 seconds. For workloads with variable traffic, this can still cause replica count oscillation. Production workloads typically benefit from a 600-second scale-down window. Conversely, the default scale-up window of 0 seconds fires immediately, which can overshoot if multiple pods start simultaneously. A 60-second scale-up window prevents HPA from stacking replicas before the first wave finishes initializing.

maxReplicas set too low

When HPA hits the replica ceiling during a traffic spike, pods stay at maximum count but the workload continues to degrade. The symptom is sustained high latency with no further scaling events. Set maxReplicas based on measured peak load, not a conservative estimate. Monitor HPA events with kubectl describe hpa to see when the ceiling is hit.

Running HPA and VPA both targeting CPU in Auto mode

As described above, this creates a feedback loop. Check that you’re using one of the safe co-use patterns: HPA on CPU with VPA in Recommend mode, or HPA on custom metrics with VPA in Auto mode.

Oversized resource requests corrupting the signal

When requests are set far above actual usage (a pattern confirmed by the Cast AI finding that 69% of requested CPU goes unused), HPA sees artificially low utilization and delays scaling. The fix is accurate requests.

Wrong target type for Pod metrics

Pod-type metrics must use AverageValue as the target type. Using Value instead means HPA scales on the raw sum across all pods, not the per-pod average, producing wildly incorrect replica counts as the fleet grows.

minReplicas set to 1 for HA services

During scale-down, HPA will reduce to a single replica if minReplicas: 1. For any service that cannot tolerate a pod restart causing a brief outage, set minReplicas: 2 at minimum.

Not adding a PodDisruptionBudget

HPA’s minReplicas: 2 guarantees the scheduler won’t scale below 2 replicas under normal load. But it does not prevent a rolling update or node drain from terminating both replicas simultaneously. Add a PDB alongside any HA workload:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-app

With minAvailable: 1, Kubernetes will not voluntarily disrupt more than one replica at a time.

Conclusion

Kubernetes HPA is one of the few platform primitives that directly reduces cloud spend without requiring application changes. It removes the choice between over-provisioning for peaks and under-provisioning for normal traffic by making replica count dynamic. The autoscaling/v2 API gives you multi-metric support, stabilization window tuning, and per-container resource targeting, tools that make HPA production-grade rather than a basic demo feature.

For HPA to work accurately, the requests it scales against must reflect actual usage. Oversized requests corrupt the utilization signal and leave idle capacity that HPA can’t see. Rightsizing those requests, then letting HPA respond to accurate utilization data, is how teams close the gap between the 8% average CPU utilization reported in production and something approaching efficient use of reserved capacity.

For a broader look at how HPA fits into a complete Kubernetes cost optimization strategy alongside Cluster Autoscaler, VPA, and node rightsizing, see the Cast AI guide to Kubernetes autoscaling for cloud cost optimization.

Kubernetes cost optimization

Monitor organization-wide and cluster-level resource spending. Automate resource allocation and scale instantly with zero downtime.

Learn more

Frequently Asked Questions

What is Kubernetes HPA?

Kubernetes HPA (Horizontal Pod Autoscaler) is a built-in Kubernetes controller that automatically scales the number of pod replicas in a Deployment, StatefulSet, or ReplicaSet based on observed metrics. It checks metrics every 15 seconds and adjusts replica counts to keep average utilization near a configured target. HPA supports CPU, memory, custom per-pod metrics, object metrics, and external metrics from systems like Prometheus, SQS, or Kafka. It does not apply to DaemonSets. Source: Kubernetes documentation on Horizontal Pod Autoscaling.

How does HPA decide when to scale?

HPA computes the desired replica count using the formula: desiredReplicas = ceil(currentReplicas × currentMetricValue / desiredMetricValue). It applies a ±10% tolerance band; if the ratio is within that range, no action occurs. Scale-up is immediate by default (0-second stabilization window); scale-down waits 5 minutes (300 seconds) to avoid reacting to transient spikes. Both windows are configurable in the behavior block of the HPA spec.

What is the difference between HPA and VPA?

HPA adds or removes pod replicas (horizontal scaling). VPA adjusts the CPU and memory requests assigned to each pod (vertical scaling). They serve different purposes: HPA handles traffic variability by changing replica count; VPA handles overprovisioning by rightsizing each pod. Running both in Auto mode on the same CPU or memory metric creates a feedback loop: VPA changes requests, HPA reacts to the changed utilization percentage, and the two oscillate. Safe co-use requires either using HPA on custom metrics with VPA in Auto mode, or using HPA on CPU with VPA in Recommend-only mode.

Can HPA scale on custom metrics?

Yes. The autoscaling/v2 API supports five metric types: Resource (CPU/memory), Pods (custom per-pod averages), Object (metrics from a specific Kubernetes object), External (metrics from outside the cluster), and ContainerResource (per-container resource metrics). For external metrics like Prometheus query results, SQS queue depth, or Kafka consumer lag, KEDA provides over 60 built-in scalers and exposes them through the standard external metrics API that HPA reads. KEDA also enables scale-to-zero, which native HPA does not support.

Why is my HPA not scaling?

The most common causes are:
(1) resources.requests.cpu is not set on containers, so HPA cannot compute utilization and shows <unknown>;
(2) metrics-server is not installed or is unhealthy, verify with kubectl top pods; (3) the HPA is at minReplicas during scale-down or at maxReplicas during scale-up;
(4) the metric is within the ±10% tolerance band and no action is needed;
(5) the stabilization window has not elapsed yet. Use these commands to diagnose:

# Verify metrics-server is registered kubectl get apiservices | grep metrics # Verify custom metrics API is available (requires custom metrics adapter) kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2" | jq . # Check HPA status and events kubectl describe hpa my-app-hpa -n default | grep -A20 "Events:"

Run kubectl describe hpa <name> to see current metric values, replica bounds, and any scaling events.

Kubernetes cost optimization

Cut Kubernetes costs with automation

Demystifying Quantizations: Guide to Quantization Methods for LLMs

How to Win Back Control Over Your Enterprise Cloud Costs

Why AWS Cost Explorer Isn’t Enough to Seriously Reduce Your Cloud Expenses

Solutions

Resources

Company

Book a demo

What Is Kubernetes HPA and How Can It Help You Save on the Cloud?

Key Takeaways

What Is Kubernetes HPA?

How Kubernetes HPA Works

The HPA Control Loop

Metrics Server and Metrics Availability

Configuring Kubernetes HPA

HPA Metric Types

CPU-Based HPA (autoscaling/v2 YAML)

Memory-Based HPA

Custom and External Metrics (KEDA ScaledObject Example)

Monitoring HPA After Deployment

HPA vs VPA: Which Should You Use?

How HPA Saves Cloud Cost

HPA and Node Autoscaling Work Together

Common HPA Mistakes

Missing resource requests

Using autoscaling/v1

Not tuning stabilization windows

maxReplicas set too low

Running HPA and VPA both targeting CPU in Auto mode

Oversized resource requests corrupting the signal

Wrong target type for Pod metrics

minReplicas set to 1 for HA services

Not adding a PodDisruptionBudget

Conclusion

Kubernetes cost optimization

Frequently Asked Questions

Cut Kubernetes costs with automation

More articles

Demystifying Quantizations: Guide to Quantization Methods for LLMs

How to Win Back Control Over Your Enterprise Cloud Costs

Why AWS Cost Explorer Isn’t Enough to Seriously Reduce Your Cloud Expenses

Boost Kubernetes performance, security, and cost optimization

Book a demo