In the world of Kubernetes, not all resource challenges are visible at first glance.
Recently, I encountered a complex scenario in which over 100 workloads on a single node were experiencing intermittent startup failures, CrashLoopBackOffs, and performance degradation despite the node having sufficient resources based on pod requests.
The scenario
A production Kubernetes cluster with the following characteristics:
- 100+ workloads running on individual nodes
- Primarily Java-based applications
- Some workloads with exceptionally large Docker images
- Pods that appeared to have sufficient resources (based on requests)
- High failure rate during pod initialization phase
Upon investigation, we discovered two distinct but related problems:
Problem 1: CPU spikes during initialization
Java applications, in particular, were consuming significantly more CPU during initialization than their requested resources indicated. While these applications would eventually settle into a steady state of much lower resource usage, the startup phase created substantial resource contention, especially when multiple pods were starting simultaneously.
The technical impact:
- Failed liveness probes during extended startup times
- CrashLoopBackOff states triggered by initialization failures
- Resource competition creating a “noisy neighbor” problem
- Manual pod deletion/recreation sometimes resolved issues – but only when done one at a time
Problem 2: Large image size paralysis
Some workloads with extremely large container images (multiple GB) were facing an additional set of challenges:
- Extended image pull times, consuming network bandwidth High disk I/O during extraction
- Temporary storage pressure on nodes
- Component timeouts during extended initialization
What made this particularly puzzling was that even isolated manual restarts of these pods would sometimes fail.
The solution framework
After extensive testing and optimization, we developed a comprehensive approach to solving both issues without overprovisioning our infrastructure.
For CPU Initialization Spikes:
1. Configure startup probes
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10Startup probes give applications sufficient time to initialize before liveness checks begin, preventing premature restarts.
2. Remove CPU limits
Requests:
cpu: "1"
memory: "1Gi"
Limits:
memory: "2Gi" # Keeps memory limits to prevent OOM issues. No CPU limits, allowing burst during initialization.This avoids hard throttling during initialization, allowing the container to temporarily use more CPU, leading to better startup success.
3. Use priorityClassName for smarter scheduling
spec:
priorityClassName: low-priorityAssigning priority classes helps Kubernetes make smarter scheduling decisions about which workloads get resources first.
4. Use topologySpreadConstraints
topologySpreadConstraints:
maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: your-appFor large image issues:
1. Optimize kubelet image pull parameters
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
registryPullQPS: 10
registryBurst: 20These settings control the rate and concurrency of image pulls, preventing network and I/O saturation.
2. Implement image optimization
Using multi-stage builds dramatically reduced our image sizes:
dockerfile
FROM maven:3.8-openjdk-11
WORKDIR /app
COPY . .
RUN mvn package -DskipTests
FROM openjdk:11-jre-slim
COPY --from=builder /app/target/app.jar /app.jar
ENTRYPOINT ["java", "-jar", "/app.jar"]3. Pre-pull critical images
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: image-prepuller
spec:
template:
spec:
initContainers:
- name: prepull
image: docker
command: ["/bin/sh", "-c"]
args:
- |
docker pull your-large-image:latest
volumeMounts:
- name: docker-socket
mountPath: /var/run/docker.sock
containers:
- name: pause
image: k8s.gcr.io/pause:3.5Optional strategies worth considering
For environments with specific constraints or requirements, these additional strategies may be helpful:
1. Implement staggered deployments
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 02. Java-specific optimizations
-XX:+UseContainerSupport
-XX:InitialRAMPercentage=50.0
-XX:+TieredStopAtLevel=1
-Djava.security.egd=file:/dev/./urandom
-XX:+AlwaysPreTouchThese JVM flags significantly improve startup performance and container awareness.
Results and lessons learned
After implementing these optimizations:
- Pod startup success rate improved from ~65% to over 99%
- Initialization times for Java applications decreased by 40%
- Large image pull failures reduced by 90%
- Overall cluster stability significantly improved
Most importantly, we achieved these improvements without increasing our infrastructure footprint, proving that effective resource management often involves understanding application behavior patterns rather than simply adding more resources.
Key takeaways
- Understand the full lifecycle of your applications – not just their steady-state behavior
- Monitor initialization phases separately from normal operation
- Configure Kubernetes to match your specific workload patterns
- Remove CPU limits for initialization-heavy workloads when appropriate
- Java applications require special consideration in containerized environments
- Large images need infrastructure optimizations beyond application-level changes
Addressing these often-overlooked aspects of resource management can significantly enhance reliability and efficiency in Kubernetes. This approach allows for improvements without overprovisioning your infrastructure.
What optimization challenges have you encountered in your Kubernetes environments?
Kubernetes cost optimization
Monitor organization-wide and cluster-level resource spending. Automate resource allocation and scale instantly with zero downtime.



