,

Kubernetes Exit Codes Explained: 137, 139, 143 and How to Fix Them

Kubernetes exit codes reveal why containers fail. Learn the meaning of exit codes 137, 139, and 143, how to identify the root cause of pod crashes, and the fastest way to troubleshoot CrashLoopBackOff incidents in production.

Kunal Das Avatar
Exit codes in containers and Kubernetes showing pod failure diagnostics

Key takeaways

  • Exit codes in containers follow the POSIX formula: values above 128 equal 128 + the signal number that killed the process.
  • Exit code 137 means OOMKilled — the kernel sent SIGKILL because the container exceeded its memory limit. No graceful shutdown, no cleanup. But exit code 137 can also mean the grace period expired: check the Reason field to tell them apart.
  • Exit code 139 is a segmentation fault (SIGSEGV, signal 11). Usually a null pointer dereference, buffer overflow, or a crash in a native extension like NumPy or TensorFlow.
  • Exit code 143 is SIGTERM (signal 15) — Kubernetes is terminating the pod. If your app doesn’t shut down cleanly within terminationGracePeriodSeconds (default 30s), check your ENTRYPOINT form: shell-form (sh -c) silently eats the signal.
  • Exit code 127 means the binary wasn’t found at startup. Wrong CMD, wrong base image, or a PATH mismatch after switching from ubuntu to alpine.
  • Use the Exit Code Diagnostic Loop — Identify → Classify → Fix → Prevent — to move from CrashLoopBackOff to confirmed root cause without guessing.

Your pod crashed. STATUS shows CrashLoopBackOff, RESTARTS is climbing. The first question is always the same: what exit code? Exit codes are integers returned by a container process when it terminates — zero means success, anything above 128 means the process was killed by an OS signal (the formula is 128 + signal number, per the POSIX standard). Kubernetes surfaces these under Last State in kubectl describe pod. Three codes dominate production incidents: 137 (OOM killed), 139 (segfault), and 143 (SIGTERM). This post covers what each means, how to confirm it in under five minutes, and how to stop seeing it.

What container exit codes mean in Kubernetes

When a Linux process exits, it returns an integer between 0 and 255 to the parent process. The kernel uses values above 128 to communicate signal-based termination: if the process was killed by signal N, the exit status is 128 + N. This is POSIX behavior — not specific to Linux, not specific to Kubernetes, not specific to containers. Kubernetes just exposes it in a place you can query.

The container runtime (containerd or CRI-O) captures the exit code and stores it in the pod status. Kubernetes then surfaces it in two places: kubectl describe pod under Last State → Exit Code, and the lastState.terminated object in the pod spec. The Reason field alongside the exit code is what tells you whether Kubernetes itself triggered the kill (OOMKilled, Error, Completed) or whether the process exited on its own.

The Exit Code Diagnostic Loop gives you a repeatable path through any crash: Identify which pod and container crashed;
Classify the exit code and reason from kubectl describe;
Fix the immediate cause using the logs and events;
Prevent recurrence by addressing the root condition — whether that’s a memory limit, a signal handling bug, or a Dockerfile error. The five commands below cover the Identify and Classify steps in full.

# Step 1: Identify — find pods with restarts or bad status
kubectl get pods -n <ns>  # check RESTARTS and STATUS

# Step 2: Classify — get the exit code and reason from Last State
kubectl describe pod <pod> -n <ns>  # Last State > Exit Code + Reason

# Step 3: Read the last run's output before the crash
kubectl logs <pod> --previous  # last container output

# Step 4: Machine-readable terminated state (pipe to jq for clarity)
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' | jq .

# Step 5: Cluster-level events sorted by time
kubectl get events --sort-by='.lastTimestamp' -n <ns>

Run these in order. By step 4 you’ll have the exit code, reason, started/finished timestamps, and container ID. By step 5 you’ll see whether a node memory pressure event, a deployment rollout, or an HPA scale-down preceded the crash.

Exit codes reference table

The table below covers every exit code you’ll encounter in day-to-day Kubernetes operations. Signal numbers follow the POSIX standard; see man7.org signal(7) for the full reference.

Exit CodeSignalSignal #Kubernetes ReasonCommon CauseFix Summary
0CompletedProcess exited cleanlyExpected for Jobs/batch. For long-running pods, check liveness probe or restart policy.
1ErrorApplication runtime error or unhandled exceptionCheck kubectl logs --previous for stack trace or error message.
125ErrorContainer runtime itself failed to startCheck image pull policy, OCI compatibility, and containerd/CRI-O logs on the node.
126ErrorPermission denied — entrypoint binary not executableAdd RUN chmod +x /app/binary in Dockerfile or fix file permissions in the image.
127ErrorBinary not found in PATHVerify CMD/ENTRYPOINT path; check base image switch didn’t drop required utilities.
137SIGKILL9OOMKilled / ErrorMemory limit exceeded (OOMKilled) or grace period expired (Error)Check Reason field: OOMKilled → raise memory limit; Error → raise terminationGracePeriodSeconds.
139SIGSEGV11ErrorSegmentation fault — invalid memory accessReproduce locally with debugger; check native extensions and dependency versions.
143SIGTERM15Error / CompletedKubernetes terminated the pod; app didn’t handle shutdown gracefullyUse exec-form CMD; implement SIGTERM handler; tune terminationGracePeriodSeconds.
255ErrorContainer runtime error or infrastructure issue (e.g., node failure, kubelet restart, underlying VM crash)Check node status with kubectl get nodes and kubectl describe node <node>.

The exit codes you will actually hit, and how to fix them

Exit code 137 (OOMKilled)

Exit code 137 = 128 + 9 (SIGKILL). The kernel’s OOM killer sent an unkillable signal to the container process because it tried to allocate memory beyond the cgroup limit set by Kubernetes. There is no warning, no signal handler, no cleanup — the process is gone mid-execution.

kubectl describe on a recently killed pod will show something like this under Last State:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Thu, 25 Jun 2026 08:12:04 +0000
  Finished:     Thu, 25 Jun 2026 08:14:31 +0000

Note the Reason: OOMKilled field — this is set by Kubernetes, not by the application. The OOM kill also appears in kernel logs on the node. On managed Kubernetes (EKS, GKE, AKS), direct node access isn’t available. Use:

# Requires cluster-admin RBAC; works on managed K8s where direct node SSH is unavailable
# chroot /host is required: kubectl debug node mounts the node filesystem at /host,
# so without it you see the container's (empty) kernel ring buffer, not the node's
kubectl debug node/<node-name> -it --image=ubuntu -- chroot /host dmesg -T | grep -i oom

The output will show the process name, PID, and the memory usage that triggered the kill — useful for confirming whether the process that was killed is the one you expected. The chroot /host prefix is essential: kubectl debug node sessions mount the node filesystem at /host, and omitting it reads the container’s own (empty) kernel ring buffer instead of the host’s.

Exit code 137: OOMKilled vs. grace period expiry

Exit code 137 does not always mean the kernel OOM killer. Two distinct failure modes produce the same exit code, and they have completely different fixes:

  • Exit 137 + Reason: OOMKilled — the container hit its memory limit. The kernel fired SIGKILL. Fix: raise the memory limit or use automated rightsizing.
  • Exit 137 + Reason: Error (and the pod was being terminated) — terminationGracePeriodSeconds expired before the container shut down. Kubernetes sent SIGKILL. Fix: increase terminationGracePeriodSeconds, fix a preStop hook that’s running too long, or make the application’s shutdown path faster.

Raising memory limits for a grace period problem wastes money and doesn’t fix anything. Check the Reason field first.

Why did my pod get OOMKilled when it wasn’t the biggest consumer?

This is a classic on-call question. Your pod is using 200Mi. Another pod is using 1.8Gi. Your pod dies first. The answer is Kubernetes Quality of Service (QoS) classes, which determine which pods the Linux OOM killer targets when a node is under memory pressure.

Kubernetes assigns one of three QoS classes based on how a pod’s resources are configured:

  • BestEffort — no requests or limits set on any container. Linux assigns oom_score_adj=1000. These pods die first.
  • Burstablerequests are set but lower than limits, or only one is set. Linux assigns oom_score_adj 2–999 (calculated as max(2, 1000 − (1000 × memory_request / node_total_memory)); smaller requests produce a higher adjustment and die sooner). These die next.
  • Guaranteedrequests == limits on every container in the pod. Linux assigns oom_score_adj=-997. These are protected and die last.

The OOM killer picks the highest oom_score_adj process when it needs to free memory. A BestEffort pod using 200Mi will be killed before a Guaranteed pod using 1.8Gi every time. Check your pod’s QoS class:

kubectl get pod <pod> -o jsonpath='{.status.qosClass}'

If you’re seeing OOMKills on pods that aren’t the largest consumers, check whether they’re BestEffort. Setting explicit requests and limits promotes them to at least Burstable and raises their survival priority on a pressured node.

The fix depends on whether this is a one-off spike or a pattern. For a one-off: raise the memory limit and redeploy. For a recurring pattern: the workload’s memory footprint has grown since you set the original limit, and you’re in a static configuration that will keep killing pods until you change it. Bumping the limit is the right call; the question is how much to bump it by. If you’re guessing, you’re either over-provisioning (expensive) or under-provisioning (back to OOMKilled).

For a complete walkthrough of diagnosis and remediation, see the OOMKilled deep-dive.

A few patterns that reliably cause exit code 137: ML inference containers that load a model into RAM on startup (the model grows with new versions); Java applications where the JVM heap isn’t bounded by -XX:MaxRAMPercentage and grows into the container limit; data processing jobs that read a full dataset into memory and fail when the dataset grows beyond what was tested.

Exit code 139 (SIGSEGV)

Exit code 139 = 128 + 11 (SIGSEGV). The process touched memory it had no business touching — null pointer dereference, write past an array boundary, use-after-free. The kernel fires SIGSEGV and the process is gone. Unlike OOMKilled, there’s no special Reason to guide you; kubectl describe just says Error. You’re on your own.

The most common production source of exit code 139 that isn’t a direct code bug: native extensions. Python packages that wrap C or C++ libraries — NumPy, TensorFlow, certain database drivers — can segfault when there’s a version mismatch between the Python wrapper and the compiled binary, when a shared library is missing, or when the library has a known bug in a specific version. The Python interpreter itself won’t throw an exception; it just disappears with exit code 139.

Start with kubectl logs <pod> --previous. A C-level segfault often produces no output at all — that silence is itself a clue. If signal handlers were registered, you might see a partial stack trace. When logs are empty and you need to go deeper, you have three good options depending on the runtime:

  • C/C++: rebuild with AddressSanitizer: CFLAGS="-fsanitize=address -fno-omit-frame-pointer" make. ASAN will catch the memory error with a precise stack trace instead of a silent crash. Note: requires an ASAN-compiled binary, and in-cluster use may need SYS_PTRACE capability added to the pod’s securityContext (see below).
  • Python: add import faulthandler; faulthandler.enable() at module level. When the interpreter crashes, Python will print a minimal C-level traceback to stderr before dying — this often points directly at the offending extension.
  • JVM native crashes: the JVM writes a hs_err_pid<PID>.log file to the working directory on fatal errors. If the container has a writable filesystem, check /proc/1/cwd/hs_err_pid*.log or mount a volume to persist it.

For in-cluster debugging with a debugger or profiler, add SYS_PTRACE to the container’s security context — most default pod security profiles drop it:

securityContext:
  capabilities:
    add:
    - SYS_PTRACE

A crash that only appears in Kubernetes but not locally usually points to resource limits, CPU architecture differences (x86 vs ARM), or a missing system library in the image. Check whether the crash correlates with a dependency version change in the last deployment — native extension segfaults often surface immediately after a pip install --upgrade that pulled in a newer C extension.

Exit code 143 (SIGTERM)

Exit code 143 = 128 + 15 (SIGTERM). Kubernetes sends SIGTERM to the container process when a pod is being terminated — during a rolling deployment, a node drain, a scale-down, or a manual kubectl delete pod. This is expected behavior. What’s not expected is when your app exits dirty: it drops open database connections, cuts off in-flight requests, or fails to release a distributed lock.

The sequence Kubernetes follows on pod termination, per the Kubernetes pod lifecycle docs: the pod is removed from Service endpoints, the preStop hook runs (if defined), then SIGTERM is sent to the main container process. The process has terminationGracePeriodSeconds (default 30 seconds) to exit cleanly. After that, SIGKILL fires — which gives you exit code 137, not 143.

The most common bug causing exit code 143 problems: shell-form ENTRYPOINT. When your Dockerfile has ENTRYPOINT ["sh", "-c", "./myapp"], the shell process (PID 1) receives SIGTERM — but sh doesn’t forward signals to child processes by default. Your app never sees the signal. Either the app misses the shutdown window and Kubernetes sends SIGKILL, or the app runs past the grace period for no reason. The fix is exec-form: CMD ["./myapp"] so your application becomes PID 1 directly and receives signals itself.

# BAD: shell-form does not forward SIGTERM to the child process
ENTRYPOINT ["sh", "-c", "./myapp"]

# GOOD: exec-form — myapp becomes PID 1, receives SIGTERM directly
CMD ["./myapp"]

If your app needs a longer shutdown window — flushing a queue, draining connections, finishing in-flight work — increase terminationGracePeriodSeconds in the pod spec. If you need an action before the signal arrives (for example, deregistering from a load balancer or waiting for iptables propagation to complete), use a preStop hook. The hook runs before SIGTERM is sent, giving you a clean sequencing primitive.

A common pattern for zero-downtime rolling deployments: add a short preStop sleep to allow iptables rule propagation before the container starts receiving no new traffic. Without this, the load balancer may still route requests to a container that has already started shutting down, producing connection reset errors in clients.

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 5"]  # allow iptables propagation before SIGTERM

The sleep 5 gives kube-proxy time to update iptables rules across nodes so no new requests are routed to the pod. After the preStop hook completes, Kubernetes sends SIGTERM to the container process. With a 5-second preStop and a grace period of 60 seconds, the application has up to 55 seconds of drain time.

⚠ Grace period timing: terminationGracePeriodSeconds starts at pod deletion, not at SIGTERM. preStop execution time counts against the grace period window. Formula: terminationGracePeriodSeconds = max_preStop_duration + expected_drain_time + 10s buffer. A 5-second preStop with a 30-second grace period leaves only 25 seconds for drain — often not enough for long-lived connections. Set the grace period explicitly based on your actual shutdown path, not the default.

Exit code 127 (command not found)

Exit code 127 is not signal-based. The shell returns it when CMD or ENTRYPOINT references a binary that does not exist at the specified path. The container starts, the shell tries to exec the command, and immediately returns 127 without ever running application code.

Three common sources: a typo in CMD (easy to catch); a path that was valid in the old base image but isn’t in the new one; and the ubuntu-to-alpine migration that drops GNU coreutils and bash in favor of BusyBox. If your entrypoint script used bash-specific syntax and you switched to an alpine base, it breaks with 127. Same story if your app relied on /usr/bin/python3 and the new image only has /usr/local/bin/python3.

Diagnosis is fast: kubectl logs <pod> --previous will usually print something like sh: /app/myapp: not found or exec: "python": executable file not found in $PATH. If the log is empty, the container exited before producing output — exec into a debug container in the same pod namespace using kubectl debug and check the filesystem directly.

How resource configuration drives the most common exit codes

Exit code 137 is the only one in this list that’s structural rather than accidental. OOMKilled doesn’t happen because of a bug in your application; it happens because the memory limit you set — at deployment time, based on a load test or a rough estimate — no longer matches the memory the workload actually uses. The gap between those two numbers grows over time: the model gets larger, the dataset grows, the traffic pattern changes, and the static limit stays the same.

According to the Cast AI 2026 State of Kubernetes Optimization Report, 79% of Kubernetes workloads are memory-overprovisioned and 69% are CPU-overprovisioned. The instinct to pad limits generously is understandable — no one wants an OOMKill in production — but overprovisioned limits don’t actually prevent OOMKills on workloads that spike. Even when operators set a container’s memory limit to 2Gi and its normal usage is only 800Mi, a spike to 2.1Gi will trigger the OOM killer. The problem isn’t that the limit is too low on average; it’s that the limit is static and workloads aren’t.

The same report documented a cluster running 40 to 50 OOM kills per 30-minute measurement window — with spikes above 80 — reduced to near zero after automated rightsizing with Cast AI’s Workload Autoscaler (Cast AI 2026 State of Kubernetes Optimization Report). The autoscaler detects when a container’s actual usage is 50% or more above the current recommendation and adjusts the limit before the OOM killer fires, rather than waiting for a human to notice the pattern in a post-mortem.

For teams managing this manually: Kubernetes Vertical Pod Autoscaler (VPA) in updateMode: Auto will update resource requests and limits on running pods (with a restart), which prevents OOMKills but introduces restarts. ⚠️ Always pair VPA Auto mode with a PodDisruptionBudget (minAvailable: 1). Without it, VPA Updater will evict all replicas of a single-replica Deployment simultaneously during a resize event, causing a full service outage. updateMode: Off gives you recommendations without applying them. The gap between a good recommendation and zero OOMKills is still the apply latency — VPA doesn’t act until after a pod has been running long enough to establish a baseline. Cast AI’s Workload Autoscaler operates on real-time usage with anomaly detection, which closes the latency gap for workloads that spike unexpectedly rather than trending gradually.

OpsPilot, Cast AI’s AI-powered operations tool, surfaces OOMKilled diagnosis inline: given a pod name, it returns the root cause, the container’s configured limit versus peak RSS, and the rollout or event that triggered the spike — with a kubectl command to apply the fix. An example output: “Root cause: OOMKilled, container memory limit 256Mi, peak RSS 312Mi at v2.3.9 rollout.” That’s the full diagnostic loop completed in one query, without digging through describe output and logs manually.

Exit code 143 has a parallel resource story: if terminationGracePeriodSeconds is set too low for the workload’s actual shutdown time, pods get SIGKILLed after the grace period expires — which produces exit code 137, not 143. The two codes are connected. A pod that should exit cleanly on SIGTERM but has a 30-second grace period and a 45-second shutdown path will always show exit code 137 in production, even though the root cause is configuration, not an OOM event.

If you’re seeing persistent exit code 137 and the Reason field in kubectl describe is OOMKilled, the fix is resource limits. If the Reason is Error or the pod is terminating by schedule and the code is still 137, check terminationGracePeriodSeconds first.

Alerting on exit codes with PrometheusRule

Reactive debugging after an OOMKill is slower than getting alerted the moment it happens. If you’re running kube-state-metrics and Prometheus Operator, deploy this PrometheusRule to fire an alert within one minute of any pod being OOMKilled:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: k8s-exit-code-alerts
  labels:
    prometheus: kube-prometheus  # adjust to match your Prometheus Operator ruleSelector
spec:
  groups:
  - name: exit-codes
    rules:
    - alert: PodOOMKilled
      # kube_pod_container_status_last_terminated_reason is a gauge (0 or 1)
      # use changes() instead of increase() for gauge metrics
      expr: changes(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[5m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} OOMKilled in {{ $labels.namespace }}"
        runbook_url: "https://cast.ai/learn/oomkilled/"

This happens when kube_pod_container_status_last_terminated_reason transitions (meaning at least one OOMKill event occurred in the past 5 minutes). A critical detail: kube_pod_container_status_last_terminated_reason is a gauge metric (it holds a value of 0 or 1), not a counter — so increase() is incorrect and produces unreliable or NaN results. Use changes(), which is designed for gauges and counts the number of value transitions in the window. The for: 1m clause prevents flapping on momentary scrape gaps. If your kube-state-metrics version includes container_oom_events_total (a true counter), you can use rate(container_oom_events_total[5m]) > 0 instead — that expression is both semantically correct and more granular. Wire the alert to your on-call channel and you’ll know about OOMKills before the next deployment review.

See how Cast AI’s Workload Autoscaler eliminates OOMKills at scale without requiring manual limit tuning.

FAQ

What does exit code 137 mean in Kubernetes?

Exit code 137 means the container was killed by SIGKILL (signal 9). The most common cause is Reason: OOMKilled — the kernel’s OOM killer fired because the container exceeded its configured memory limit. It can also appear with Reason: Error when terminationGracePeriodSeconds expired and Kubernetes sent SIGKILL. Check the Reason field in kubectl describe pod under Last State to tell them apart. SIGKILL cannot be caught or handled; there is no graceful shutdown regardless of cause.

What is exit code 139 in a container?

Exit code 139 is a segmentation fault: the process received SIGSEGV (signal 11) because it attempted an invalid memory access. Common causes are null pointer dereferences, buffer overflows, and crashes in native C/C++ extensions (NumPy, TensorFlow, JNI-based libraries). kubectl logs <pod> --previous may return empty output if the crash happened at the C level before any logging occurred. Enable Python’s faulthandler or rebuild with AddressSanitizer to get a stack trace.

What is exit code 143 in Kubernetes?

Exit code 143 means the container received SIGTERM (signal 15) and exited. Kubernetes sends SIGTERM during pod termination — rolling deployments, node drains, scale-downs, or manual pod deletion. The application has terminationGracePeriodSeconds (default 30 seconds) to shut down before SIGKILL is sent. If your app isn’t receiving the signal, check whether you’re using shell-form ENTRYPOINT, which does not forward signals to child processes.

How do I find the exit code of a crashed Kubernetes pod?

Run kubectl describe pod <pod> -n <namespace> and look under Last State → Exit Code and Reason. For a machine-readable version: kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' | jq . This returns the exit code, reason, and timestamps for the last container run.

What is the difference between exit code 137 and 143 in Kubernetes?

Both are signal-based exits. Exit code 143 (SIGTERM) is Kubernetes asking the pod to terminate gracefully — the application can catch this signal and shut down cleanly. Exit code 137 (SIGKILL) cannot be caught; the kernel fires it when the container exceeds its memory limit (OOMKilled) or when terminationGracePeriodSeconds expires after a SIGTERM that was not handled in time.

How do I fix exit code 127 in a Kubernetes pod?

Exit code 127 means the binary specified in CMD or ENTRYPOINT was not found. Check kubectl logs <pod> --previous for a “not found” error message. Common fixes: correct the path in your Dockerfile CMD; verify the binary exists in the image (kubectl debug with an ephemeral container is useful here); if you recently switched base images from ubuntu to alpine, check that required utilities are installed in the new image.

How does Cast AI prevent OOMKilled (exit code 137) in Kubernetes?

Cast AI’s Workload Autoscaler monitors real-time memory usage and adjusts container memory limits automatically before the OOM killer fires. It detects anomalous usage — spikes 50% or more above the current recommendation — and updates limits without waiting for a human to notice a pattern post-incident. According to the Cast AI 2026 State of Kubernetes Optimization Report, one customer cluster went from 40 to 50 OOM kills per 30-minute measurement window to near zero after enabling automated rightsizing.

Cast AIBlogKubernetes Exit Codes Explained: 137, 139, 143 and How to Fix Them