Kubernetes pod scheduling plays a critical role in how your applications perform and how much you pay to run them. Every time the scheduler determines the pod’s running location, it balances factors such as cost efficiency, resource availability, fault tolerance, and workload priorities. For teams managing dynamic environments or production-grade clusters, configuring pod scheduling effectively is essential to maintaining resilience without overspending.
Check out the first part of this series for more insights into the three pod scheduling mechanisms and best practices.
In this part, I dive into resource optimization and resiliency best practices for Kubernetes clusters. Using real-world examples, we’ll explore how to fine-tune scheduling policies to improve availability, reduce waste, and keep workloads running smoothly even during failures or scaling events.
Whether you’re trying to reduce cloud costs or build a more resilient platform, understanding the subtleties, implementation patterns, and trade-offs is critical for creating high-performance, robust, and cost-effective Kubernetes infrastructures.
Resource Optimization Considerations
Scheduling policies significantly impact resource utilization and costs. In many real-world examples, optimized pod distribution has improved CPU utilization by 35-47% and memory utilization by 28-39%.
Bin-Packing vs. Spreading
While spreading workloads improves resilience, excessive spreading can lead to resource fragmentation:
# May lead to excessive spreading and poor bin-packing
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- app-name
topologyKey: kubernetes.io/hostnameBalancing Cost and Resilience
For optimal resource efficiency with appropriate resilience:
1. Use node-level soft anti-affinity for non-critical services:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- service-name
topologyKey: kubernetes.io/hostname2. Reserve strict constraints for Zone/Region level or critical services:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-service
- maxSkew: 3 # More flexible at node level
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: critical-service3. Group related non-critical services together:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- related-service
topologyKey: kubernetes.io/hostnameResilience Engineering with Topology Controls
Proper distribution policies form the foundation of resilience engineering in Kubernetes. Here are a few best practices to help you boost the resilience of your Kubernetes clusters.
Multi-Level Resilience Strategy
For comprehensive resilience, implement constraints at multiple levels:
# Comprehensive resilience configuration
topologySpreadConstraints:
- maxSkew: 1 # Strict region balance
topologyKey: topology.kubernetes.io/region
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-service
- maxSkew: 1 # Strict zone balance within regions
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-service
- maxSkew: 2 # More flexible node balance
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: critical-serviceCascading Constraint Patterns
Design constraints to cascade from strict to flexible:
- Hard constraints at broad topology levels (region, zone)
- Softer constraints at narrow levels (node, rack)
- Fallback provisions for scheduling when an ideal distribution isn’t possible
# Hard requirement at zone level
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: resilient-appCombined with:
# Soft preference at node level
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- resilient-app
topologyKey: kubernetes.io/hostnameReal-World Implementation Patterns
Different workload types require distinct distribution strategies:
Pattern 1: Global Service with Regional Presence
For services that need global distribution with a local presence:
apiVersion: apps/v1
kind: Deployment
metadata:
name: global-cache
spec:
replicas: 12
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/region
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: global-cache
- maxSkew: 2
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: global-cacheThis ensures even distribution across all regions with reasonable zone distribution.
Pattern 2: Stateful Application with Cross-Zone Resilience
For database clusters and stateful applications:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: distributed-database
spec:
replicas: 5
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- distributed-database
topologyKey: topology.kubernetes.io/zoneThis guarantees no two database instances share an availability zone, maximizing resilience against zone failures.
Pattern 3: Performance-Sensitive Microservices
For microservices that benefit from proximity but need basic resilience:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 6
template:
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache-service
topologyKey: kubernetes.io/hostname
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-service
topologyKey: kubernetes.io/hostnameThis balances the need for API services to be near their caches while maintaining some separation between API instances.
Pattern 4: Cost-Optimized Non-Critical Services
For cost-sensitive workloads where some resilience is desired but not critical:
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
template:
spec:
topologySpreadConstraints:
- maxSkew: 5 # Allow significant imbalance
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: batch-processorThis provides basic distribution while allowing for substantial bin-packing and resource efficiency.
Common Pitfalls and Misconfigurations
Even experienced Kubernetes engineers can fall into scheduling traps. Understanding common pitfalls can help avoid service disruptions and performance issues.
Pitfall 1: Overly Strict Anti-Affinity
Setting hard anti-affinity without considering cluster size can lead to scheduling failures:
# Problematic on small clusters
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- app-name
topologyKey: kubernetes.io/hostnameSolution: Use preferred anti-affinity or topology spread constraints with ScheduleAnyway for smaller clusters or non-critical services.
Pitfall 2: Conflicting Affinity Rules
Contradictory rules can create scheduling impossibilities:
# Conflicting rules
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- service-a
topologyKey: kubernetes.io/hostname
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- service-a
topologyKey: kubernetes.io/hostnameSolution: Carefully review affinity rules for logical consistency before deployment.
Pitfall 3: Excessive Node Specialization
Over-using node selectors and taints alongside affinity rules can severely restrict scheduling options:
# Too restrictive
nodeSelector:
disktype: ssd
cpu: highperf
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: kubernetes.io/hostnameSolution: Minimize node specialization and use soft preferences where possible.
Pitfall 4: Ignoring Scaling Implications
Distribution policies that work for small deployments may fail during scale-up:
# Works for 3 replicas, fails at 10+
spec:
replicas: 3 # Later scaled to 10+
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: kubernetes.io/hostnameSolution: Design distribution policies with maximum potential scale in mind.
Pitfall 5: Forgetting About Resource Constraints
Distribution policies can conflict with resource availability:
# May cause scheduling failures
resources:
requests:
memory: 16Gi
cpu: 4
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: resource-heavySolution: Ensure your distribution strategy accounts for the resource profile of your workloads.
Conclusion
When properly implemented, these three techniques lay the groundwork for high-performance applications that remain available despite infrastructure issues while optimizing resource use.
Here are the takeaways from our exploration:
Balance resilience and efficiency:
- Implement stricter constraints at broader topology levels
- Use more flexible constraints at narrow levels
- Consider resource implications of distribution strategies
Apply context-appropriate patterns:
- Consider service criticality, scale, and performance requirements
- Adapt strategies to your specific cluster topology and size
- Test distribution policies at target scale before production deployment
By following the best practices presented in this article, your Kubernetes infrastructure can join the ranks of top-performing environments.
/inlin
Kubernetes cost monitoring
Manage your Kubernetes expenses with real-time tracking of each namespace, workload, and resource allocation group.



