Key takeaways
- CrashLoopBackOff is a pod status (
Waiting.Reason), not an error code — the container starts, crashes, and Kubernetes keeps retrying with exponential backoff. - Backoff sequence: 10s → 20s → 40s → 80s → 160s → 300s (capped at 5 minutes), resets after 10 minutes of successful operation.
- Six root causes: application/config errors, OOM kills, failing liveness probes, bad image or entrypoint, missing dependencies, and init container failures.
- The CrashLoopBackOff Diagnostic Loop:
get pods→describe pod→logs --previous→get events→ fix. - Exit code 137 = OOMKilled (128 + SIGKILL signal 9). Exit codes 1 or 2 = application or config error. Init container failures show status
Init:CrashLoopBackOff. - Cast AI Workload Autoscaler detects OOMKill events and applies corrected memory limits immediately — up to 240 changes per hour, no manual intervention.
What is CrashLoopBackOff?
CrashLoopBackOff is a Kubernetes pod status (Waiting.Reason: CrashLoopBackOff) indicating that a container is repeatedly starting, crashing, and being restarted by the kubelet. It is not an error code. Kubernetes applies exponential backoff between each restart attempt to avoid overwhelming the node. The sequence is: 10s → 20s → 40s → 80s → 160s → 300s, capped at 5 minutes. After 10 consecutive minutes of successful operation, the counter resets to zero.
This differs from ImagePullBackOff, which happens before the container starts — Kubernetes cannot pull the image (wrong tag, bad credentials, network issue). With CrashLoopBackOff, the image pulled fine; the container exits with a non-zero code. Deleting and recreating the pod only resets the backoff timer. The same crash will recur until you fix the underlying cause.
The common causes
| Cause | Signal | Exit code hint | kubectl clue | Fix section |
|---|---|---|---|---|
| Misconfig (env vars, entrypoint, permissions) | Container exits immediately on start | 1 or 2 | logs --previous shows config error | Config errors |
| Out-of-memory (OOM) kill | Container terminated by kernel | 137 | describe pod shows OOMKilled: true | OOM restarts |
| Failing liveness probe | Pod restarts but app seems healthy | 143 (SIGTERM) or 137 (SIGKILL if grace period exceeded) | describe pod shows Liveness probe failed | Probe failures |
| Bad image / wrong entrypoint | Container exits in milliseconds | 127 or 1 | logs --previous empty or exec error | Config errors |
| Missing dependency (ConfigMap, Secret, upstream service) | App crashes waiting for dep | 1 or 2 | describe pod shows volume mount error or missing resource | Config errors |
| Init container failure | Pod stuck before main container starts | 1 or 2 (init-specific) | STATUS shows Init:CrashLoopBackOff; describe pod shows init container exit | Init container failures |
Init container failures
Init container failures look subtly different in kubectl get pods. Instead of CrashLoopBackOff, the STATUS column shows Init:CrashLoopBackOff or Init:Error. The main application container never starts — Kubernetes runs init containers sequentially, and if any one exits with a non-zero code, the whole pod restarts.
The diagnostic commands are slightly different too, because kubectl logs <pod> defaults to the main container (which hasn’t run yet). You need to name the init container explicitly:
# List init container names for the pod
kubectl get pod <pod-name> -o jsonpath='{.spec.initContainers[*].name}'
# Read logs from a specific init container's last run
kubectl logs <pod-name> -c <init-container-name> --previousCommon init container failures break down into two categories. The first is waiting for an external dependency — a database readiness check, a service endpoint that isn’t up yet, or a network policy that blocks the init container’s probe. The second is a missing Secret or ConfigMap: the init container tries to mount or read a resource that doesn’t exist in the namespace, exits non-zero, and triggers the loop. Check both with kubectl describe pod — the Events section will show FailedMount or connection timeout errors before you even pull logs.
How to diagnose it step by step
Follow The CrashLoopBackOff Diagnostic Loop: get → describe → logs → events → fix. Each step narrows the cause. Don’t skip steps — what looks like an OOM kill can be a liveness probe timeout, and the fixes are different.
Step 1: Confirm status and restart count
# Check STATUS and RESTARTS columns
kubectl get pods -n <namespace>
# NAME READY STATUS RESTARTS AGE
# payments-api-7d9f4b-xkj2p 0/1 CrashLoopBackOff 14 47m
# db-migrations-9c3a1b-zzz9k 0/1 Init:CrashLoopBackOff 3 12mStep 2: Describe the pod — exit code and events
# Shows exit codes, OOMKilled flag, probe failures, volume mount errors
kubectl describe pod <pod-name> -n <namespace>
# Key fields:
# Last State: Terminated
# Reason: OOMKilled <-- OOM kill
# Exit Code: 137
# Liveness probe failed: ... <-- probe issue
# Warning FailedMount ... <-- missing ConfigMap or SecretInit container callout: If STATUS is Init:CrashLoopBackOff, the describe output’s Init Containers section shows each init container’s last exit code and reason. Look for FailedMount events or connection errors there — the main container’s Last State block will be empty since it never ran.
Step 3: Read the previous container’s logs
# --previous retrieves logs from the last terminated instance
# Without this flag you get the current (empty) container
kubectl logs <pod-name> -n <namespace> --previous
# Multi-container pods: specify the container
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous
# Init containers require the -c flag (main container hasn't run)
kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previousSidecars in multi-container pods: In multi-container pods, check each container individually. Service mesh sidecars — Istio’s envoy-proxy, Datadog agents — can consume memory that pushes the total pod over node capacity, causing the main container to be OOM-killed even if its own usage looks fine. Use kubectl logs <pod> -c <sidecar-name> --previous to check each container.
Step 4: Check cluster events
# Events sorted by time — look for OOMKill, probe failures, scheduling issues
kubectl get events -n <namespace> --sort-by='.lastTimestamp'Step 5: Check live resource usage
# Requires metrics-server
kubectl top pod <pod-name> -n <namespace> --containersStep 6: Extract the exit code programmatically
# Read exit code from lastState — 137=OOMKilled, 1/2=app error, 127=command not found
kubectl get pod <pod-name> -n <namespace> \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'Prometheus alert for proactive OOM detection: Track container_memory_working_set_bytes against memory limits. Alert at 80% — that’s your warning window before the kernel OOM killer fires.
# Containers using >80% of their memory limit
(container_memory_working_set_bytes{container!=""}
/ kube_pod_container_resource_limits{resource="memory", container!=""}) > 0.80
How to fix each cause
Application or config errors
When logs --previous shows Error: required environment variable DATABASE_URL is not set or exec: "myapp": executable file not found in $PATH, you have a config or packaging problem.
- Missing env vars: Confirm the referenced ConfigMap or Secret exists in the same namespace. A reference to a non-existent object causes an immediate exit.
- Wrong entrypoint: Verify
spec.containers[].commandandargs. If you’re overriding the image’s CMD, confirm the binary path inside the image. - Permission errors: Mounted volumes may not be writable by the container’s non-root UID. Check
fsGroupandrunAsUserinsecurityContext.
For live debugging, attach an ephemeral container without triggering another restart:
# Attach a debug container to inspect filesystem, env vars, and network
kubectl debug -it <pod-name> -n <namespace> \
--image=busybox \
--target=<container-name>Out-of-memory restarts
Exit code 137 with OOMKilled: true in lastState means the Linux kernel terminated the container for exceeding its memory limit. See our deep-dive: OOMKilled in Kubernetes.
Setting limits correctly is harder than it looks. In our 2026 Kubernetes Optimization Report, clusters averaged around 20% memory utilization — heavily overprovisioned overall, yet individual pods were still undersized. In one representative cluster analyzed for our 2026 Kubernetes Optimization Report, we recorded 40–50 OOM kills per hour — spiking above 80 during peak load. After deploying automated rightsizing, the rate dropped to near zero.
VPA helps automate this. Valid updateMode values: Off (recommendations only), Initial (set on pod creation), Recreate (apply via pod eviction), Auto (in-place on K8s 1.33+, beta with the InPlacePodVerticalScaling feature gate; clusters running 1.27–1.32 still use pod eviction).
JVM workloads: Set -Xmx to ~75% of the container memory limit. A 512Mi limit should have -Xmx384m. This leaves headroom for non-heap memory — metaspace, thread stacks, native libraries. Skipping this is one of the most common causes of Java OOM kills in Kubernetes.
Cast AI Workload Autoscaler goes beyond VPA. Its OOM event handler detects a kill as it happens, generates a new recommendation with increased memory overhead, and applies it immediately — no manual intervention. It supports up to 240 changes per hour and uses in-place pod resizing on K8s 1.33+ (beta, InPlacePodVerticalScaling feature gate; 1.27–1.32 uses pod eviction instead).
# Check kernel OOM events on the node — do NOT use /var/log/syslog
journalctl -k | grep "Out of memory"LimitRange objects can silently impose memory limits on pods that don’t set them explicitly. If an operator deploys a pod without a resources.limits.memory field, a LimitRange default (say, 128Mi) applies automatically — the pod appears to be running without limits but gets OOM-killed at 128Mi. Check before you assume the pod has no limit:
# Check for LimitRange objects in the namespace
kubectl get limitrange -n <namespace>
# See the default limits applied
kubectl describe limitrange -n <namespace>Liveness and readiness probe failures
These two probes do different things. Liveness probe failure → kubelet restarts the container (causes CrashLoopBackOff). Readiness probe failure → pod removed from Service endpoints (does NOT restart the container). Conflating them produces unnecessary restarts.
A note on exit codes for liveness-triggered restarts: The kubelet sends SIGTERM first — the container’s exit code will typically be 143 (128 + signal 15) or whatever exit code the application returns in its signal handler. SIGKILL (exit code 137) only follows if the container is still running after terminationGracePeriodSeconds expires. If you’re seeing 137 on a liveness probe failure, your application is ignoring SIGTERM and being forcibly killed — that’s a separate problem worth fixing.
For slow-starting containers, use startupProbe (Kubernetes 1.18+). It disables liveness and readiness until it passes, giving the application its full startup window without forcing absurdly high initialDelaySeconds on the liveness probe.
# startupProbe for slow starters — failureThreshold * periodSeconds = startup grace window
# 30 * 10s = 5 minutes before liveness/readiness activate
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # 30 allowed failures
periodSeconds: 10 # checked every 10s
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3Two rules that eliminate most probe-related CrashLoopBackOff incidents: (1) never probe external dependencies in liveness checks — a 30-second database blip should not restart every pod in your deployment; (2) use startupProbe for any app that takes more than 10–15 seconds to initialize.
How to prevent crash loops
Set accurate resource requests and limits. Requests drive scheduling; limits cap consumption. Profile under realistic load — don’t guess. An undersized memory limit produces OOM kills; no limit lets a leaking container consume the entire node.
Use startup probes for init-heavy workloads. For Java applications on Kubernetes, implementing a startupProbe with failureThreshold=30 and removing CPU limits allows the JVM class-loading phase to complete without being throttled or killed — this is the combination that eliminates init-phase CrashLoopBackOff for JVM workloads. CPU limits cause throttling that dramatically slows JVM class loading; rely on CPU requests for scheduling and monitor actual utilization instead of capping it hard.
Implement graceful shutdown. Handle SIGTERM, drain in-flight requests, and set terminationGracePeriodSeconds to match. A pod that crashes on shutdown can corrupt state that causes the next startup to fail.
Rollback safety. Before applying any memory limit or probe changes in production, make sure you can undo them quickly:
# Always check rollout history before making changes
kubectl rollout history deployment/<name> -n <namespace>
# Roll back if the change makes things worse
kubectl rollout undo deployment/<name> -n <namespace>Automate memory rightsizing. Manual limits drift. Cast AI Workload Autoscaler tracks actual memory usage continuously and keeps limits calibrated. Its OOM event handler closes the feedback loop between a kill event and a corrected limit without requiring an engineer to notice, diagnose, and redeploy. At 240 changes per hour across a fleet, it handles scale that VPA alone cannot.
Use OpsPilot for real-time diagnosis. When a CrashLoopBackOff alert fires at 2am, OpsPilot gives you root cause in seconds: “payments-api has restarted 14 times in 2 hours. Root cause: OOMKilled — memory limit 256Mi, peak RSS 312Mi at v2.3.9 rollout.” That’s the full diagnostic loop compressed into one response, with the specific version and memory figures you need to act.
FAQ
What is CrashLoopBackOff in Kubernetes?
CrashLoopBackOff is a pod status (Waiting.Reason: CrashLoopBackOff) indicating a container is repeatedly starting and crashing. Kubernetes applies exponential backoff between restart attempts — starting at 10 seconds, capping at 5 minutes. It is not an error code; it describes the pod’s current waiting state. See the Kubernetes pod lifecycle documentation for the full specification.
How long does the backoff last?
The sequence is 10s, 20s, 40s, 80s, 160s, then 300s (5 minutes) where it caps. Each failed restart doubles the wait, up to 5 minutes. After 10 consecutive minutes of successful operation, the backoff counter resets to zero.
What is the difference between CrashLoopBackOff and ImagePullBackOff?
ImagePullBackOff means Kubernetes cannot pull the container image — wrong tag, missing credentials, or a network issue. The container never starts. CrashLoopBackOff means the image pulled successfully but the container exits with a non-zero code after starting. Both use exponential backoff, but they occur at different lifecycle stages and require different fixes.
What does exit code 137 mean in Kubernetes?
Exit code 137 means the container was killed by SIGKILL (128 + signal 9) — either the Linux kernel’s OOM killer or the kubelet after terminationGracePeriodSeconds expired. For OOM kills, kubectl describe pod shows OOMKilled: true in the container’s lastState. The fix is to increase the memory limit using profiling data or an automated rightsizing tool. For liveness probe restarts, note that kubelet sends SIGTERM first (exit code 143); 137 only appears if the container didn’t exit within the grace period.
How do I stop CrashLoopBackOff quickly?
Run kubectl logs <pod-name> --previous to read the crashed container’s output, then kubectl describe pod <pod-name> to check exit codes and events. Exit code 137 + OOMKilled: true: increase memory limits. Exit codes 1 or 2 with config errors: fix env vars or entrypoint. Liveness probe failures in events: add a startupProbe. If STATUS shows Init:CrashLoopBackOff, use kubectl logs <pod> -c <init-container-name> --previous — the main container hasn’t run yet. Deleting the pod only resets the backoff timer — it does not fix the crash.
Can OOM kills cause CrashLoopBackOff?
Yes — OOM kills are one of the most common causes. The kernel terminates the container when it exceeds its memory limit (exit code 137). Kubernetes registers the crash and restarts the container. If the limit is still too low, the container hits it again, crashes again, and the loop continues. Cast AI Workload Autoscaler detects the OOMKill event and immediately applies a corrected memory limit, breaking the cycle without manual intervention.
What kubectl command shows CrashLoopBackOff?
kubectl get pods -n <namespace> shows the STATUS column where CrashLoopBackOff appears alongside the RESTARTS count. For root cause, use kubectl describe pod <name> -n <namespace> for exit codes and events, and kubectl logs <name> --previous for the last container’s output. If STATUS shows Init:CrashLoopBackOff, first list init container names with kubectl get pod <name> -o jsonpath='{.spec.initContainers[*].name}' then pull their logs with kubectl logs <name> -c <init-container-name> --previous.
Does Cast AI help with CrashLoopBackOff?
Yes, in two ways. OpsPilot diagnoses CrashLoopBackOff incidents in seconds, surfacing root cause and restart count without manual log triage. The Cast AI Workload Autoscaler handles OOM-driven crash loops by detecting kill events, generating updated memory recommendations, and applying them immediately — on Kubernetes 1.33+ via in-place pod resizing (beta, InPlacePodVerticalScaling feature gate), or on 1.27–1.32 via pod eviction.



