Enterprise Kubernetes Best Practices

Even mature cloud-native teams report challenges with Kubernetes implementation, particularly around complexity, scaling, and security. Resource utilization is another common pain point. According to the 2025 Kubernetes Benchmark Report, the average CPU utilization is just 10%, and memory utilization is only 23%.

This represents a significant opportunity for optimization.

This guide distills years of platform engineering experience into actionable enterprise Kubernetes best practices for implementations that are resilient to failures, secure by design, and optimized for resource utilization.

Whether you’re operating on AWS EKS, Azure AKS, Google GKE, or on-premises infrastructure, these principles will help you build enterprise-grade Kubernetes platforms that scale efficiently while controlling cloud costs.

Get the guide – Enterprise Kubernetes Best Practices

Download your copy

Resilience engineering in Kubernetes

Resilience—maintaining service availability despite infrastructure failures—requires deliberate design decisions in your Kubernetes architecture.

Here are three best practices for boosting your implementation’s resilience.

Multi-zone pod distribution with topology spread constraints

The foundation of resilient Kubernetes workloads is proper pod distribution across infrastructure failure domains. Topology spread constraints provide declarative control over how pods distribute across your cluster.

Here’s an example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: resilient-application
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: resilient-application
      # Provide fallback for when zones have issues
      - maxSkew: 2
        topologyKey: kubernetes.io/hostname
       whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: resilient-application

This configuration ensures your applications distribute evenly across availability zones while providing a fallback for node-level distribution. Our research shows that properly implemented topology constraints can reduce mean time to recovery (MTTR) by up to 43% during zone failures.

DevOps Pro Tip: For workloads with 2-3 replicas, use pod anti-affinity instead of topology spread constraints to ensure strict separation with minimal configuration.

Comprehensive health probes

Kubernetes offers three types of health probes, each serving a distinct purpose in your resilience strategy:

Liveness probes – detects broken application states and triggers container restarts
Readiness probes – controls traffic routing to pods that are ready to serve requests
Startup probes – allows applications with lengthy initialization to avoid premature restarts

Here’s an example:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /started
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Properly configured health probes can prevent many customer-impacting outages. Implementing them is not merely a best practice—it’s a critical component of site reliability engineering (SRE).

Pod disruption budgets: controlled maintenance

Pod disruption budgets (PDBs) protect application availability during voluntary disruptions like node upgrades or cluster scaling:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # or use maxUnavailable
  selector:
    matchLabels:
      app: critical-service

PDBs control how many pods of a workload can be down simultaneously during voluntary disruptions, ensuring service continuity. They work by:

Defining the minimum number of pods that must remain available (minAvailable)
Or defining the maximum number of pods that can be unavailable (maxUnavailable)
Blocking voluntary disruptions when budget constraints are violated

PDB configuration guidelines:

Align with replica count: Your PDB must be configured to your deployment’s replica count. For example:
- A replica count of 3 with maxUnavailable: 1 allows 1 pod to be disrupted (66% available)
- A replica count of 3 with minAvailable: 2 produces the identical outcome
- A replica count of 1 with maxUnavailable: 1 will permit the single pod to be disrupted, potentially causing a service outage
Common misconfiguration: Setting maxUnavailable: 1 with a replica count of 1 will allow the single pod to be evicted during node drains, causing service downtime. To prevent this, use minAvailable: 1 for single-replica workloads.
Percentage-based configurations: You can also use percentages:

spec: minAvailable: "50%"  # or maxUnavailable: "50%"

This approach automatically adjusts as replica counts change.

Multiple workloads: For applications with multiple components, ensure each deployment has its own PDB.

A survey revealed that 83% of organizations experienced Kubernetes-related outages, many of which were tied to upgrades due to improper planning or configuration.

GitOps-based disaster recovery

Modern disaster recovery leverages Infrastructure as Code (IaC) and GitOps patterns to enable rapid, consistent recovery:

Store all infrastructure and application configurations in Git repositories
Use declarative IaC tools like Terraform to manage cluster and cloud resources
Implement automated recovery pipelines that can rebuild environments in minutes
Test DR procedures regularly with chaos engineering practices

Organizations leveraging GitOps-based disaster recovery strategies experience significant enhancements, with 60% of users reporting faster repair times and rollback capabilities and 53% citing easier rollbacks, leading to reduced configuration errors through automation.

Download

Resource optimization strategies

According to the 2025 Kubernetes Benchmark Report, the average CPU utilization was just 10%, while memory utilization averaged only 23%. This means nearly 90% of CPU and 77% of memory resources are wasted in typical Kubernetes deployments.

Here are five key areas you should improve to maximize your usage of compute resources.

Rightsizing workloads

The foundation of resource optimization is appropriate workload sizing:

resources:
  requests:
    cpu: 100m
    memory: 512Mi
  limits:
    memory: 512Mi

Here are a few best practices for production workloads:

Set memory requests equal to limits to guarantee Quality of Service (QoS)
Configure CPU requests based on 95th percentile usage in production
Consider leaving CPU limits undefined to allow burst capacity
Implement automated right-sizing with Vertical Pod Autoscaler

The report shows that automated memory request adjustments are particularly critical, as 5.7% of containers exceed their requested memory at some point during 24 hours, leading to instability and performance issues.

Horizontal Pod Autoscaler for dynamic workloads

Horizontal Pod Autoscaler (HPA) enables automatic scaling based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
    scaleUp:
      stabilizationWindowSeconds: 60

Here are four resource optimization best practices for HPA:

Use custom metrics (requests-per-second, queue depth) over CPU when possible
Configure appropriate scale-up/down behaviors to avoid thrashing
Set minReplicas based on minimum acceptable availability, not minimum load
Implement cluster autoscaler in conjunction with HPA for node-level scaling

Cross-AZ traffic optimization with topology-aware routing

Topology-aware routing optimizes traffic by prioritizing local or same-zone pod communication over cross-zone routing to minimize cross-zone networking costs in Kubernetes. Here’s how to configure it:

apiVersion: v1
kind: Service
metadata:
  name: app-service
  annotations:
    service.kubernetes.io/topology-aware-hints: "auto"
spec:
  selector:

This configuration enables topology-aware routing with the following behavior:

Routes traffic to pods on the same node when possible.
If no local pods are available, routes to pods in the same availability zone.
Falls back to cross-zone routing only when necessary.

Depending on workload and cluster configuration, topology-aware routing can significantly reduce cross-AZ data transfer costs, potentially by 20-50% in well-optimized multi-zone Kubernetes deployments.

Ensure your cluster has topology-aware hints enabled, and nodes are properly labeled with topology information (e.g., topology.kubernetes.io/zone). Monitor traffic patterns using tools like Grafana or AWS CloudWatch for precise savings.

Download

Node selection and bin-packing

Advanced resource optimization requires strategic node selection:

Consolidate workloads onto fewer, larger nodes rather than many small nodes
Implement cluster autoscaler with bin-packing strategy
Consider Spot/Preemptible Instances for stateless, fault-tolerant workloads
Use node taints and tolerations judiciously, avoiding excessive node specialization

Best practice example

Different GPUs carry different Spot Instance prices, thereby unlocking new savings. Here’s an example of the AWS G5 instance types offering varying GPU configurations depending on the size of the instance.

Specifically, the G5.xlarge, G5.4xlarge, and G5.16xlarge (NVIDIA A10G Tensor Core GPU) instances are each equipped with one GPU. The G5.24xl has four GPUs, while the G5.48xl provides eight GPUs.

When using the G5.16xl instance, users can use a single GPU while accessing significant additional compute resources (CPU). Workloads that don’t require the GPU can utilize these additional resources, leading to cost-effective computation.

Regarding price per GPU, larger multi-GPU instances provide better value than several smaller, single-GPU instances. When you look at the prices per GPU, there isn’t much difference between the 4-GPU (G5.24xl) and 8-GPU (G5.48xl) instances. This means that larger instances are cheaper for tasks that use a lot of GPUs.

Cost-effective compute selection

Our 2025 Kubernetes Benchmark Report shows that being flexible with compute options can yield significant savings:

Consider both Arm and x86 architectures – Azure offers up to 65% savings with Arm CPUs.
Leverage Spot Instances – organizations using a mix of On-Demand and Spot Instances realized 59% average savings, while Spot-only clusters achieved 77% savings.
Be aware of regional price differences – for GPU workloads, selecting the optimal region and availability zone can reduce costs by 2-7x compared to average Spot prices.

What is the best region/AZ to run your AI workload?

When running GPU-heavy workloads, where you choose to run them can make a huge difference in cost.

The report analyzed AWS p4d.24xlarge instances, equipped with 8 NVIDIA A100 GPUs from January 2024 to February 2025, revealing significant cost variations across regions and AZs. Some regions and AZs are up to six times cheaper than the average.

If you can adjust your AI or GPU-intensive workloads to the most cost-efficient regions and AZs—rather than defaulting to higher-cost zones like us-east-1a—you could achieve massive savings:

🔹 2x-7x savings compared to the average Spot Instance price globally
🔹 3x-10x savings compared to the average On-Demand Instance price

GPU pricing and availability fluctuate frequently, so flexibility in choosing regions can be a powerful way to optimize costs and ensure you’re making the most of your cloud resources.

Security hardening for enterprise Kubernetes

Kubernetes security requires a defense-in-depth approach spanning multiple control points.

Here are several key areas teams should focus on when it comes to security:

Network policy implementation

Network policies provide Kubernetes-native microsegmentation:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Best practices for network security:

Start with default-deny policies and explicitly allow required traffic
Implement namespace isolation for multi-tenant clusters
Monitor policy violations before enforcing in production
Consider a service mesh for advanced traffic management and encryption

External secrets management

Decoupling secrets from application code and Kubernetes manifests is critical:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: database-credentials-k8s
  data:
  - secretKey: username
    remoteRef:
      key: database/credentials
      property: username
  - secretKey: password
    remoteRef:
      key: database/credentials
      property: password

Integrate with cloud provider secrets services (AWS Secrets Manager, Azure Key Vault, Google Secret Manager) or dedicated solutions like HashiCorp Vault.

Comprehensive container security

Container security requires a multi-layered approach:

Static image scanning: Implement vulnerability scanning in CI/CD pipelines
Container runtime security: Deploy runtime monitoring to detect behavioral anomalies
CIS benchmark compliance: Regularly audit against industry security standards

Example Jenkins pipeline stage for container scanning:

stage('Security Scan') {
  steps {
    sh 'trivy image --severity HIGH,CRITICAL --exit-code 1 ${IMAGE_NAME}:${IMAGE_TAG}'
  }
}

Organizations with comprehensive container security programs experience significantly fewer Kubernetes security incidents. Studies show that advanced DevSecOps practices can reduce incident rates by up to 50% compared to those with inadequate security measures.

Research indicates that 89% of organizations face Kubernetes incidents annually, yet robust security approaches, including vulnerability management and runtime protection, substantially lower these risks.

Download

Policy enforcement with OPA Gatekeeper

Enforce security and compliance requirements with policy-as-code:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredprobes
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredProbes
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredprobes
        
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.livenessProbe
          msg := sprintf("Container %v must specify a livenessProbe", [container.name])
        }

Key policies to implement:

Required resource requests/limits
Restricted image repositories
Enforcement of health probes
Container security context requirements

Comprehensive audit logging

Enable and configure audit logging in your cloud provider’s Kubernetes service:

# AWS EKS example
aws eks update-cluster-config \
  --region us-west-2 \
  --name production-cluster \
  --logging '{"clusterLogging":[{"types":["audit"],"enabled":true}]}'

Audit log best practices:

Export logs to a centralized SIEM solution
Create alerts for suspicious administrative activities
Implement appropriate retention policies based on compliance requirements
Regularly review access patterns to detect potential security issues

Implementation roadmap

Implementing these best practices calls for a phased approach:

Phase 1: Assessment and Foundation (Weeks 1-4)

Assess current Kubernetes implementation against best practices
Implement resource requests/limits for all workloads
Deploy basic health probes for critical services
Establish baseline monitoring and observability
Measure current CPU and memory utilization

Phase 2: Resilience Engineering (Weeks 5-8)

Implement topology spread constraints for critical workloads
Configure pod disruption budgets
Deploy advanced health probes with all three probe types
Test failover scenarios with controlled chaos engineering

Phase 3: Security Hardening (Weeks 9-12)

Implement network policies in monitoring mode
Deploy external secrets management
Set up container image scanning in CI/CD pipelines
Configure OPA Gatekeeper with critical policies

Phase 4: Resource Optimization (Weeks 13-16)

Implement automated right-sizing for workloads
Configure horizontal pod autoscalers with custom metrics
Optimize node configurations and bin-packing
Implement service topology for traffic optimization
Consider Spot Instances for appropriate workloads

Phase 5: GitOps and Automation (Weeks 17-20)

Deploy GitOps workflows for application deployment
Implement infrastructure as code for all components
Create automated disaster recovery procedures
Develop continuous compliance monitoring
Implement agentic autoscaling for dynamic resource management

Download

Start implementing Kubernetes best practices today

Building resilient, secure, and cost-optimized Kubernetes infrastructures requires a deliberate approach spanning multiple disciplines—from platform engineering to site reliability engineering to security. By implementing the best practices outlined in this guide, organizations can achieve up to:

Up to 99.99% availability for critical services through multi-zone pod distribution and comprehensive health probes
40-60% reduction in cloud infrastructure costs by right-sizing workloads and implementing Horizontal Pod Autoscaler (HPA)
Up to 50% fewer security incidents by establishing comprehensive container security and policy enforcement
Up to 70% reduction in downtime by using Pod Disruption Budgets (PDBs) for controlled maintenance and GitOps for disaster recovery

With average CPU utilization at just 10% and memory at 23%, most organizations have significant room for improvement. Organizations can dramatically reduce waste by applying these best practices—particularly around right-sizing, bin-packing, and strategically using Spot Instances—while maintaining or improving application performance and reliability.

The journey to Kubernetes excellence is continuous. Start with the highest-impact items—topology spread constraints, resource rightsizing, and network policies—and progressively implement the remaining best practices based on your organization’s priorities and resources.

Kubernetes cost optimization

Monitor resource spending, automate resource allocation, and scale instantly with zero downtime.

Get started

Get the guide – Enterprise Kubernetes Best Practices

Kubernetes cost optimization

Boost DevOps efficiency. Automatically:

How Automation Reduces Large Language Model Costs

Why Automation Needs to be Part of Your Cloud Cost Management Strategy

Highlights and Trends from KubeCon 2024 Salt Lake City

Solutions

Resources

Company

Book a demo

Enterprise Kubernetes Best Practices: Building a Resilient, Secure, and Cost-Optimized Kubernetes Platform

Get the guide – Enterprise Kubernetes Best Practices

Resilience engineering in Kubernetes

Comprehensive health probes

Pod disruption budgets: controlled maintenance

GitOps-based disaster recovery

Get the one-pager: Resilience Engineering in Kubernetes

Resource optimization strategies

Rightsizing workloads

Horizontal Pod Autoscaler for dynamic workloads

Get the one-pager: Resource Optimization Strategies

Node selection and bin-packing

Cost-effective compute selection

What is the best region/AZ to run your AI workload?

Security hardening for enterprise Kubernetes

Network policy implementation

External secrets management

Comprehensive container security

Get the one-pager: Security Hardening for Kubernetes

Policy enforcement with OPA Gatekeeper

Comprehensive audit logging

Implementation roadmap

Phase 1: Assessment and Foundation (Weeks 1-4)

Phase 2: Resilience Engineering (Weeks 5-8)

Phase 3: Security Hardening (Weeks 9-12)

Phase 4: Resource Optimization (Weeks 13-16)

Phase 5: GitOps and Automation (Weeks 17-20)

Get the one-pager: Implementation Roadmap

Start implementing Kubernetes best practices today

Kubernetes cost optimization

Boost DevOps efficiency. Automatically:

More articles

How Automation Reduces Large Language Model Costs

Why Automation Needs to be Part of Your Cloud Cost Management Strategy

Highlights and Trends from KubeCon 2024 Salt Lake City

Boost Kubernetes performance, security, and cost optimization

Book a demo