You deployed Karpenter, watched it scale your EKS cluster in seconds, and assumed the hard part was done. Then consolidation fired at 2 PM on a Tuesday, restarted pods your users were actively using, and your SLO alert fired. Or a Spot interruption hit before Karpenter could drain the node gracefully, because the SQS queue was never wired up. Or someone pushed a deployment with inflated resource requests and Karpenter provisioned 40 nodes in three minutes. These are not edge cases; they are the predictable result of near-default configuration in a real production environment.
Karpenter’s v1 stable API, which reached stable release in mid-2024 and is maintained under the kubernetes-sigs GitHub organization, has matured significantly. NodePool and EC2NodeClass replace the legacy Provisioner CRDs, disruption budgets give you control over the pace and blast radius of voluntary disruptions, and the SQS interruption pipeline is well-tested when properly wired. These ten Karpenter best practices cover what the official documentation explains and what it leaves out, with production-ready v1 API YAML throughout. This guide covers the AWS/EKS implementation of Karpenter; while the project is expanding to other cloud providers, the YAML examples use the karpenter.k8s.aws API group which is AWS-specific.
1. Run Karpenter on Dedicated Nodes
If Karpenter runs on nodes it manages, a consolidation event can evict the controller itself mid-operation, leaving your cluster unable to provision new nodes until the pod restarts on whatever capacity remains. This circular dependency is the most common cause of the “Karpenter went quiet” incident pattern in production clusters.
Run the controller on AWS Fargate using a Fargate profile scoped to the karpenter namespace, or on a dedicated managed node group carrying a taint of karpenter.sh/controller:NoSchedule. Add the matching toleration to the Karpenter deployment only. Either approach isolates the controller from the workloads it manages and ensures it survives any consolidation event affecting the rest of the cluster.
2. Design Mutually Exclusive NodePools
When multiple NodePools can satisfy the same pod’s scheduling requirements, Karpenter selects the NodePool with the highest spec.weight value; if weights are equal, selection is non-deterministic. Without explicit isolation, workloads land in unpredictable pools, team-level cost attribution becomes impossible, and enforcing instance-type or capacity-type policies per workload tier requires constant manual auditing. At scale, this ambiguity compounds: 500-node clusters with overlapping pools produce scheduling behavior that is genuinely difficult to predict or debug.
Use distinct taints per NodePool to enforce isolation by default. Use a toleration matching the NodePool’s taint on workloads that must target a specific pool. Pods without that toleration will not land on those nodes. When pools need to overlap intentionally, use spec.weight to establish explicit priority.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: frontend
spec:
template:
spec:
taints:
- key: team
value: frontend
effect: NoSchedule
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]3. Set Resource Limits on Every NodePool
Without resource limits on a NodePool, a misconfigured deployment, a runaway HPA, or an autoscaling bug can provision hundreds of nodes before any billing alert fires. NodePool limits are a provisioning guard that is independent of your cloud cost alerts. Limit checking is eventually consistent: during rapid scale-outs, provisioning can briefly exceed the configured threshold before Karpenter reconciles and halts new requests.
Set spec.limits.cpu and spec.limits.memory on every NodePool. When Karpenter reaches the limit, it stops provisioning and pending pods remain unscheduled until capacity frees up. Size limits to 110-120% of expected peak load to avoid creating artificial ceilings that block legitimate scaling events while still catching runaway provisioning.
spec:
limits:
cpu: "1000"
memory: 4000Gi4. Enable Spot Interruption Handling via SQS, Not Node Termination Handler
Spot interruption handling in Karpenter is not automatic. Without the SQS integration, Karpenter only learns of a Spot interruption when the node disappears from the Kubernetes API, which is often too late for workloads that require graceful connection draining, checkpoint saves, or coordinated shutdown sequences. When the SQS queue is configured via --interruption-queue, Karpenter receives the EC2 Spot Instance Interruption Warning through EventBridge up to two minutes before the node is reclaimed and uses that window to cordon and drain proactively. The Node Termination Handler (NTH) that predates Karpenter creates a worse outcome: it conflicts with Karpenter’s own interruption handling by racing to drain the same node, leaving pods in indeterminate states.
Create an SQS queue and five EventBridge rules that forward the following events to it: EC2 Spot Instance Interruption Warning, EC2 Instance Rebalance Recommendation, EC2 Instance State-change Notification, AWS Health Event (source: aws.health), and EC2 Capacity Reservation Instance Interruption Warning. Reference the queue in the Karpenter controller configuration via the --interruption-queue flag or the INTERRUPTION_QUEUE environment variable on the controller deployment. The queue must be in the same AWS region as your cluster. Do not run NTH alongside Karpenter for interruption handling; NTH can still be used alongside Karpenter if you specifically need to act on Spot Rebalance Recommendations, though this introduces additional node churn.
Instance status check failures (system and instance status checks) are handled separately via the ec2:DescribeInstanceStatus API — this is a polling mechanism, not an EventBridge event, and does not require the interruption queue to be configured. If you are pre-creating the SQS queue manually, the Karpenter controller IAM role requires at minimum the following permissions to interact with it:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueUrl"
],
"Resource": "arn:aws:sqs:REGION:ACCOUNT_ID:karpenter-interruption-queue"
}
]
}5. Pin AMIs in Production
Using @latest in your EC2NodeClass means every node Karpenter launches pulls the current AMI at that moment. During an incident when Karpenter is replacing nodes rapidly, @latest can introduce an untested kernel or kubelet version mid-recovery, turning a capacity problem into a software compatibility problem. AMI drift across a cluster also makes debugging subtle node-level issues significantly harder when nodes were provisioned days or weeks apart.
Pin to a specific alias version for production EC2NodeClasses, validate AMI upgrades in staging first, and rely on Karpenter’s built-in drift detection to roll nodes in a controlled way once you update the pinned version intentionally.
spec:
amiSelectorTerms:
- alias: al2023@v20240807 # Pin to tested version
# NOT: alias: al2023@latest6. Maximize Instance-Type Diversity for Spot
Spot availability is tied to individual instance types within specific availability zones. A NodePool restricted to m5.xlarge and m5.2xlarge is competing in exactly two Spot capacity pools. When those pools are constrained during a regional capacity event, Karpenter falls back to on-demand or fails to provision entirely, which is the leading reason Spot adoption stalls for teams running at scale. This is also the prerequisite for enabling Spot-to-Spot consolidation: When performing single-node (1-to-1) Spot-to-Spot consolidation, Karpenter requires at least 15 instance types priced lower than the current running Spot instance. Having fewer cheaper alternatives will result in an Unconsolidatable event for that candidate. This 15-instance-type minimum does not apply to multi-node consolidations (many nodes collapsing to one).
Use category-level requirements instead of explicit instance-type lists to keep your pool of available Spot capacity broad across instance families and generations. SpotToSpotConsolidation is disabled by default and must be enabled explicitly. Set it via the FEATURE_GATES environment variable on the controller deployment or via the --feature-gates CLI flag: --feature-gates SpotToSpotConsolidation=true.
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gte
values: ["3"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]7. Configure Consolidation and Disruption Budgets Correctly
Karpenter’s default disruption budget allows up to 10% of managed nodes to be voluntarily disrupted simultaneously, but it applies around the clock with no schedule awareness. Without custom budgets, consolidation runs during business hours when your traffic is peaking, and mid-day pod restarts from consolidation are indistinguishable from application errors until you trace the timeline. Disruption budgets let you define when voluntary disruptions are allowed and when they must pause.
Use WhenEmptyOrUnderutilized for active consolidation paired with a consolidateAfter delay of at least 1 minute to prevent thrashing on transient load dips. Set nodes: "0" in a budget scheduled for business hours to freeze all voluntary disruptions (including consolidation and drift) during your peak traffic window. Note: budget schedules are evaluated in UTC; adjust the cron expression for your local business hours accordingly.
At scale, consolidation frequency matters as much as timing. A 500-node cluster with many underutilized nodes can trigger sustained disruption cycles lasting hours if consolidateAfter is tuned too aggressively. Tune it to 5 minutes for batch workloads where pod restarts are cheap, and 10 minutes or longer for stateful services. Watch karpenter_voluntary_disruption_decisions_total to confirm that budget windows are actually suppressing disruptions during the protected hours you configured.
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- schedule: "0 9 * * mon-fri" # Business hours start
duration: 8h
nodes: "0" # No disruptions during business hours8. Set Node Expiry to Enforce AMI Freshness
Long-lived nodes accumulate kernel drift, unpatched CVEs, and configuration state that was never intended to persist. In a cluster running for months without rolling replacements, some nodes will carry materially different software stacks than freshly launched ones, which complicates both incident response and security audits. Node expiry is the production-safe mechanism for enforcing rolling replacement on a schedule rather than waiting for a manual process or a crisis to force it.
Set expireAfter to 720h (30 days) or less.Expiration is a forceful disruption method: Karpenter begins draining expired nodes immediately and cannot be rate-limited via disruption budgets. PDBs are respected during draining, but misconfigured PDBs or karpenter.sh/do-not-disrupt annotations can block draining indefinitely. Pair expireAfter with terminationGracePeriod to enforce a hard upper bound on node lifetime: the maximum lifetime equals the sum of the two values. Once expireAfter elapses, Karpenter begins draining the node and allows up to terminationGracePeriod for pods to exit before forceful termination.
In clusters with multiple NodePools, uniform expiry values create synchronized replacement waves that can spike provisioning load and EC2 API rate limits at the same time. Stagger expiry windows by assigning different values per pool: 720h for stable API workload pools and 360h for batch processing pools. This distributes replacement traffic across time and reduces the risk of RunInstances throttling during a replacement wave at cluster scale.
spec:
template:
spec:
expireAfter: 720h # 30-day rolling replacement9. Use the do-not-disrupt Annotation Strategically
Karpenter’s consolidation and drift mechanisms will voluntarily disrupt any node unless configured otherwise. The do-not-disrupt annotation protects against voluntary disruption only, while expiration, Spot interruption, and manual node deletion bypass it entirely. Batch jobs that are mid-run, stateful workloads performing data writes, and services with 10-plus minute warmup times will restart if their node gets consolidated at an inopportune moment. The do-not-disrupt annotation instructs Karpenter to leave the hosting node alone until the annotated pod completes its work.
Apply karpenter.sh/do-not-disrupt: "true" to pods where mid-run disruption causes data loss or unacceptable cold-start latency, particularly ML training jobs and data pipeline workers. The annotation also accepts Go duration strings (e.g., "30m") to set a time-limited protection window instead of permanent protection. Useful for jobs with a known maximum runtime.
One important edge case: if terminationGracePeriod is configured on the NodePool, Karpenter can still disrupt an annotated pod via drift once the grace period expires, even if a PDB would otherwise block it. Do not rely solely on do-not-disrupt for indefinite protection when terminationGracePeriod is set. Apply it selectively: annotating all production pods defeats consolidation and leads to node sprawl that grows silently over time.
Audit annotation coverage regularly with kubectl get pods -A -o json | jq '.items[] | select(.metadata.annotations["karpenter.sh/do-not-disrupt"] == "true") | .metadata.name. When protected pods exceed 20 to 30 percent of your total workload, consolidation efficiency degrades enough to offset the Spot savings the NodePool was designed to capture. Review protected workloads each quarter and remove the annotation from jobs where application-level checkpointing now handles mid-run interruptions safely.
apiVersion: batch/v1
kind: Job
metadata:
name: ml-training
spec:
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"10. Monitor with Prometheus: The Metrics That Actually Matter
Karpenter exposes its operational state through Prometheus metrics on port 8080, but those metrics are only useful when you know which signals to watch. Provisioning bottlenecks and consolidation storms go undetected until they surface as SLO failures or billing surprises when there are no alerts configured against the relevant counters and histograms. Karpenter optimization starts with knowing what your cluster is actually doing.
Alert on karpenter_pods_provisioning_startup_duration_seconds at p95 above 60 seconds, which indicates a provisioning bottleneck. (This metric has ALPHA stability; karpenter_pods_startup_duration_seconds is the STABLE counterpart but is a Summary type and does not expose _bucket series required for histogram_quantile.) Track karpenter_voluntary_disruption_decisions_total for sudden spikes that signal a consolidation storm. Watch karpenter_nodeclaims_disrupted_total filtered by the interruption reason label to identify Spot instability patterns. Monitor the ratio of karpenter_nodepools_usage to karpenter_nodepools_limit and alert above 0.8 so you know before your NodePool hits its ceiling and pods start queuing unscheduled.
Define these as a PrometheusRule so the alert definitions are version-controlled alongside the rest of your cluster configuration:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: karpenter-alerts
namespace: karpenter
spec:
groups:
- name: karpenter
rules:
- alert: KarpenterProvisioningLatencyHigh
expr: histogram_quantile(0.95, rate(karpenter_pods_provisioning_startup_duration_seconds_bucket[5m])) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Karpenter p95 provisioning latency exceeds 60s"
- alert: KarpenterNodePoolNearLimit
expr: karpenter_nodepools_usage / karpenter_nodepools_limit > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "NodePool {{ $labels.nodepool }} {{ $labels.resource }} utilization above 80% of configured limit"Beyond alerting, a Grafana dashboard combining karpenter_nodepools_usage, karpenter_nodeclaims_disrupted_total, and karpenter_pods_provisioning_startup_duration_seconds gives engineering teams a real-time view of provisioning throughput, disruption frequency, and capacity headroom in one place. Karpenter ships four importable Grafana dashboard JSON files in its GitHub repository. The signals recommended in this section are split across two of them: provisioning latency in karpenter-performance-dashboard.json and disruption and capacity headroom in karpenter-capacity-dashboard.json.
These ten tips get you to a production-grade Karpenter setup. There is, however, a ceiling to what Karpenter can do on its own. Karpenter makes provisioning and consolidation decisions based entirely on what pods declare in their resource requests, not on what they actually consume. A pod requesting 4 CPU and using 0.4 CPU still occupies 4 CPU worth of node capacity from Karpenter’s perspective. Across hundreds of pods with over-provisioned requests, this creates invisible waste that consolidation cannot recover, because the inputs Karpenter receives are inaccurate from the start.
Taking Karpenter Further with Cast AI
Cast AI operates as an autonomous optimization layer alongside Karpenter, not as a replacement for it. When pod resource requests are set higher than actual usage, Karpenter bins pods onto nodes sized for the declared request rather than real consumption. Its workload rightsizing analyzes actual CPU and memory consumption continuously and adjusts pod resource requests to match reality. Karpenter consolidation requires pod restarts to move workloads to cheaper nodes; Cast AI’s Container Live Migration moves running containers without restarts, enabling zero-downtime consolidation that Karpenter alone cannot achieve (currently available on AWS). Spot interruptions arrive with a 2-minute warning that is often insufficient for stateful or latency-sensitive workloads; predictive Spot interruption handling uses operational patterns to anticipate interruptions before the warning fires, reducing Spot interruptions by up to 94% across production deployments (based on Cast AI operational data).
Akamai ran this combination across production clusters and saw between 40 and 70 percent in cloud cost savings through rightsized resource requests and higher sustained Spot utilization. Yotpo reduced cloud spend by 40 percent through automated Spot adoption at a scale that manual NodePool tuning alone could not reach. Cast AI connects to an existing Karpenter cluster with no changes required to your NodePool or EC2NodeClass configuration.
Running Karpenter Well Is an Ongoing Practice
Karpenter’s v1 API is stable, its community is active under kubernetes-sigs, and it genuinely solves the node provisioning problem better than Cluster Autoscaler for dynamic workloads. The practices above reflect where real production clusters have broken and what configuration changes prevent those failures from recurring. Running Karpenter on dedicated infrastructure, isolating NodePools with taints, wiring up the SQS interruption pipeline, pinning AMIs, and configuring disruption budgets correctly gets you to a cluster that scales reliably and consolidates safely without surprises. Prometheus-based observability closes the loop on whether the configuration is working as intended. Each of these practices compounds: a well-isolated NodePool with proper limits and disruption budgets behaves predictably at 50 nodes and at 500, and that predictability is what makes autonomous infrastructure possible.



