Scaling with Karpenter and Empty Pod(A.k.a Overprovisioning) - Part 1

youngjin

Theo Jung

Posted on September 28, 2023

Scaling with Karpenter and Empty Pod(A.k.a Overprovisioning) - Part 1

Introduce

In this article, we aim to compare Cluster Autoscaler (CA) and Karpenter in the context of node provisioning within AWS's managed service, Elastic Kubernetes Service (EKS). Additionally, we would like to introduce the operational principles of Karpenter.

Recently, there has been a growing interest in Microservices Architecture (MSA) and Kubernetes, with many companies using AWS transitioning from on-premises or EC2/ECS environments to Elastic Kubernetes Services (EKS).

To provide more reliable service in the EKS environment, fast pod provisioning is crucial, and as pods multiply, node provisioning becomes necessary. However, ensuring service stability through rapid pod and node provisioning while also optimizing costs can be a challenging task.

feeling like a headache

Today, we'd like to share our team's approach to fast provisioning using Karpenter and a scaling strategy that leverages empty pods.

In Part 1, we'll explain Karpenter, and in Part 2, we'll delve into the scaling strategy using empty pods. Be sure to read next post for more details.

Why should we use Karpenter in an EKS environment?

As a provisioning tool for automatically adjusting AWS EKS clusters, there are two well-known options: the Kubernetes Cluster Autoscaler (CA) and Karpenter. Let's discuss why Karpenter might be the preferred choice.

First, it's essential to understand how CA operates, which is based on Auto Scaling Groups (ASGs).
Pods are deployed on one or more EC2 nodes, and nodes are provisioned through node groups associated with Amazon EC2 ASGs. CA monitors the EKS cluster for unscheduled pods and provisions nodes through ASGs when there are unscheduled pods.

There are two primary operations: provisioning when additional nodes are needed and deprovisioning when nodes need to be removed.

  • Provisioning involves adding nodes to the EKS cluster through ASGs to accommodate new pods.
  • Deprovisioning, on the other hand, entails removing nodes when scaling down is required.

Now, let's discuss why Karpenter might be a better choice.

Provisioning

How to work CA

  • There are pending pods due to resource shortages.
  • CA increases the Desired count in the ASG.
  • AWS ASG provisions new nodes.
  • The kube-scheduler assigns pending pods to the newly provisioned nodes.

When CA provisions nodes, it makes decisions based on the presence of unallocated pods rather than the node's resource utilization. CA adjusts the Desired count in the ASG and provisions new nodes accordingly. These newly created nodes then host the assigned pods.

Deprovisioning

In the case of CA, Deprovisioning is determined based on available resources on the nodes. Nodes with resource utilization below 50% are considered for deprovisioning. CA calculates whether it can relocate the pods running on that node elsewhere and proceeds with node termination. Additionally, the Desired count in the ASG is adjusted accordingly.

In both of the above cases, CA has a constraint where the node types are limited by the ASG associated with the node group. This means that more node types than necessary may be created, making cost optimization challenging.

Karpenter has evolved to address cost optimization more effectively and operate independently of ASGs. When comparing the differences between Karpenter and CA, there are three key aspects to consider:

  1. No Grouping Constraint: CA sends requests to ASGs, requiring the setup of multiple node groups to use various instance types. Karpenter, on the other hand, allows specifying a list of different instance types and dynamically allocates the most cost-efficient instance type that meets the conditions at provisioning time, within the available availability zones in a region.

  2. Bypassing kube-scheduler: CA relies on the kube-scheduler to detect unscheduled pods and inform ASGs, which doesn't result in immediate node creation. Karpenter, however, operates independently of the kube-scheduler. It directly creates nodes and assigns pods when there are pending pods, bypassing the kube-scheduler for faster operation.

  3. Cost Optimization: Karpenter evaluates currently provisioned on-demand node instances and compares their prices and resources to determine if they can be consolidated into a more suitable node type. This operation allows for cost optimization, although it doesn't optimize spot instances (they are not included in the optimization).

With these three perspectives in mind, our team adopted Karpenter as the node provisioning tool. Now that we've discussed how CA operates and why Karpenter is chosen, let's delve into how Karpenter works in the next part.

Deep Dive in to Karpenter

Before deep diving into how Karpenter operates, it's essential to examine its components.

Karpenter uses two key components, namely Provisioner and NodeTemplate, to rapidly provision nodes that meet specific conditions.

Firstly, the Provisioner is responsible for configuring the instance family, availability zone, weights, and other parameters that determine the role of the providerref when nodes are created during provisioning.

Next, the NodeTemplate is included in the spec.providerRef is part of the Provisioner, and can be viewed as a template that defines the node to be provisioned, such as which AMI to run or which security group to use.

How to work

First, let’s look at the illustration of Karpenter’s operation process below. If you look at the picture, you can see that, unlike CA, it operates regardless of ASG. Because of this, when there are not enough nodes to allocate pods, node provisioning occurs in JIT (Just-In-Time), allowing pods to be allocated more quickly.

How to work karpenter

Provisioning

Karpenter log

If you look at the log of the Karpenter Pod, you can see that there are three Pods in the Pending state and that three new On-Demand type nodes appear for this. How does this work? Provisioning must also satisfy certain conditions of the provisioner, and let's take a look at what these conditions are one by one.

Here are the corrected sentences with improved grammar and clarity:

Condition 1 - Resource Request: This condition specifies that the Provisioner will operate if the pending pod requires more resources than the current node can provide.

Condition 2 - Node Selection: You can label the NodeSelector to specify the desired Provisioner to operate on.

Condition 3 - NodeAffinity: This condition specifies that the Provisioner operates when NodeAffinity conditions are met. NodeAffinity behavior is determined by two conditions: requiredDuringSchedulingIgnoredDuringExecution (which must be satisfied) and preferredDuringSchedulingIgnoredDuringExecution (which should be satisfied whenever possible). You can specify the desired Provider by using key/value labels or requirements in the Provisioner using NodeSelectTerms.

Condition 4 - Topology Distribution: This condition specifies that the Provisioner will operate if the conditions specified in topologySpreadConstraints are met. These conditions can ensure that multiple nodes are provisioned and prevent the same pod from appearing on a single node. Currently supported topologyKeys include topology.kubernetes.io/zone, kubernetes.io/hostname, and karpenter.sh/capacity-type.

Condition 5 - Pod Affinity/Anti-affinity: This condition specifies that the Provisioner will operate if there are no nodes available to allocate a pod when the affinity condition is met. Pods can be assigned to nodes based on the PodAffinity and PodAntiAffinity conditions. If a pod needs allocation but there are no suitable nodes, the Provisioner will run and provision a new node.

Deprovisioning

To optimize costs by reducing scaling for nodes that have been provisioned but are no longer in use, we initiate deprovisioning. Deprovisioning is governed by four conditions, each of which we will explore:

  • Provisioner Deletion

Nodes created by the provisioner are considered owned by the provisioner. Therefore, when the provisioner is deleted, nodes generated by the provisioner are stopped, initiating deprovisioning.

  • Empty

For non-daemonset pods, de-provisioning takes place after the ttlSecondsAfterEmpty specified in the provisioner has passed since the disappearance of pods.

  • Interrupt

Deprovisioning is triggered when node-related interruption events, such as Spot stop interrupts or Node terminations, are received through Event Bridge and queued in SQS.

  • Expire

Nodes are stopped and de-provisioned when the ttlSecondsUntilExpired, as specified in the provisioner, elapses after node provisioning.

  • Integration

To optimize costs, we perform the task of comparing the costs of currently provisioned single or multiple nodes and consolidating them into a more suitable single node. This operation is only applicable to on-demand node types and does not work with Spot instances.

So far, we have delved into the components of Karpenter and how they operate. Do you notice the differences in how CA and Karpenter work? If it still seems unclear, why not try building it yourself?

Concluding the article

In this article, we didn't mention Empty Pods. To satisfy your curiosity, you'll have to read the next post.

In the upcoming article, we will focus on PriorityClass, Empty Pods, and the scaling strategy using Karpenter that we described today.

Reference

💖 💪 🙅 🚩
youngjin
Theo Jung

Posted on September 28, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related