Kubernetes Scheduler: How To Make It Work With Inter-Pod Affinity And Anti-Affinity

The Kubernetes scheduler decides where to schedule your pods on its own, and its decisions may not necessarily align with yours. Sometimes, you can live with that. But in some cases, this uncertainty may lead to degraded (or less optimal) performance. But that’s not the end of it. You might also see increased costs and availability problems.

kubernetes scheduler

Let’s dive into all those problems and find out how to solve them using inter-pod affinity and anti-affinity!

Before we begin, let’s briefly check what inter-pod affinity and anti-affinity are and what their structures are.

Inter-pod affinity

Affinity refers to attraction. So, inter-pod affinity means that the pod wants to be on the same topology as the matching pod.

Here’s the pod spec for affinity:

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web
        topologyKey: topology.kubernetes.io/zone
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - backend
          topologyKey: topology.kubernetes.io/zone

Now let’s go over the details of the inter-pod affinity structure and how it impacts the Kubernetes scheduler:

  • requiredDuringSchedulingIgnoredDuringExecution – it means that conditions must be satisfied for the pod to be scheduled. This is also called a hard requirement.
  • preferredDuringSchedulingIgnoredDuringExecution – if a condition can be satisfied, it will be satisfied. But if not, it will be ignored. This is also called a soft requirement.
  • podAffinityTerm – the pod affinity term defines which pods we select with a label selector and which node topology key we target.
    • A soft requirement has podAffinityTerm as separate property with an additional weight parameter that defines which term is more important.
    • A hard requirement has an affinity term as a root list item object. For the hard affinity rule, all affinity terms and all expressions should be satisfied for the pod to be scheduled.

Inter-pod affinity requiredDuringSchedulingIgnoredDuringExecution terms and expressions are ANDed. This means that everything should be satisfied for the pod to be scheduled, that makes the rule a very strict requirement when several condition are used.

It’s easy to mix it up with node affinity, which is not that strict (terms are ORed and expressions are ANDed – the first term should be satisfied for the pod to be scheduled)

Inter-pod anti-affinity

Inter-pod anti-affinity is the opposite of affinity. It means pods don’t want to be on the same topology as their matching pods.

Here’s what the pod spec looks like:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web
        topologyKey: kubernetes.io/hostname
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - backend
          topologyKey: kubernetes.io/hostname

Let’s dive into the inter-pod anti-affinity structure:

  • requiredDuringSchedulingIgnoredDuringExecution – it means that conditions must be satisfied to pod be scheduled (hard requirement).
  • preferredDuringSchedulingIgnoredDuringExecution – if a condition can be satisfied, it will be satisfied. But if not, it will be ignored (soft requirement).
  • podAffinityTerm – the pod affinity term defines which pods we select with a label selector and which node topology key we target.
    • A soft requirement has podAffinityTerm as separate property with an additional weight parameter that defines which term is more important.
    • A hard requirement has an affinity term as a root list item object. For the hard affinity rule, all affinity terms and all expressions should be satisfied for the pod to be scheduled.

Degraded performance problem – and how to solve it with an anti-affinity rule

When running various workloads on your Kubernetes cluster, different workloads may depend on different resources. Some workloads might be CPU-heavy or memory-heavy – the following could be easily controlled by specifying the correct container resources.

However, workloads could heavily use the network or attached disks that you can’t control directly. When you fail to add restrictions for the Kubernetes scheduler in an unfortunate event, several workloads using the disk or network heavily might land on the same node and cause network or disk overloading.

As a result, you’ll see degraded performance on disk- or network-dependent workloads when the node’s network bandwidth is reached. You can control this or at least lower the risk of hitting these limits to a minimum by using pod labeling and adding pod anti-affinity.

Let’s say we have several workloads that are using the network heavily. We could label all those workloads with some custom label like 'network-usage':'high' and define the pod anti-affinity rule on this workload:

apiVersion: v1
kind: Pod
metadata:
  name: network-heavy-app
  labels:
    network-usage: high
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: network-usage
            operator: In
            values:
            - high
        topologyKey: kubernetes.io/hostname
  containers:
  - name: network-heavy-app
    image: registry.k8s.io/pause:2.0

By defining the label on a pod, we mark it as a high-network user. We can identify all pods that heavily use a network using that label.

The pod anti-affinity rule is used here to prevent pods with the label 'network-usage':'high' to be scheduled on the same node(topologyKey: kubernetes.io/hostname), isolating pods from each other on different nodes.

High availability problem – and how to solve it with anti-affinity

Sometimes, the Kubernetes scheduler might schedule the same workload replicas on the same node.

That creates a high availability problem – if nodes go down, all or portion of workload replicas goes down, and that can create partial or full downtime of the application.

You can solve this problem using pod anti-affinity by targeting the application name and using the hostname topology key:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: highly-available-app
  labels:
    app: highly-available-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: highly-available-app
  template:
    metadata:
      labels:
        app: highly-available-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - highly-available-app
              topologyKey: kubernetes.io/hostname
      containers:
        - name: highly-available-app
          image: registry.k8s.io/pause:2.0

The example above defines a deployment where each highly-available-app replica can only be scheduled on a separate node.

Cost problem – reduce your network costs using an affinity rule

A Kubernetes cluster cost consists of the VM (CPU and RAM) price, storage price, network price, and Kubernetes-as-a-service price.

Let’s talk about the network price. Usually, cloud providers charge for a network bandwidth that leaves an availability zone. This means that network traffic between pods that are running on different availability zones is paid!

You can’t eliminate outer-zone traffic by 100%, but you can still reduce costs significantly by placing heavy communicating pods in the same availability zone. You can use inter-pod affinity in the zone topology key to achieving that:

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - backend
        topologyKey: topology.kubernetes.io/zone
  containers:
  - name: web
    image: registry.k8s.io/pause:2.0

The example above defined a pod that can only be scheduled on the same zone as pod matching app=backend. Having this affinity requirement could decrease network costs between web and backend pods.

Wrap up

Modifying the decision-making process of the Kubernetes scheduler using affinity and anti-affinity is a smart move. You can use such rules to avoid all kinds of problems that arise when pods are scheduled by default, without these smart rules in place.

Affinities use labels for selecting targets, and it’s important to create a good labeling strategy for your Kubernetes ecosystem. Check out this post for essential labeling best practices: Kubernetes Labels: Expert Guide with 10 Best Practices

  • Blog
  • Kubernetes Scheduler: How To Make It Work With Inter-Pod Affinity And Anti-Affinity
Subscribe

Leave a reply

Notify of
0 Comments
Inline Feedbacks
View all comments

Recent posts