Blazing fast Kubernetes scaling with ASG warm pools
Ole Markus With
Posted on April 19, 2021
Last week, AWS launched warm pools for auto scaling groups (ASG). In short, this feature allows you to create a pool of pre-initialised EC2 instances. When the ASG needs to scale out, it will pull in Nodes from the warm pool if available. Since these are already pre-initialised, the scale-out time is reduced significantly.
The warm pool is also virtually for free. You pay for the warm pool instances as you would for any other stopped instance. You also have to pay for the time the instances spend on initialising when entering the warm pool, but this is more or less cancelled out by the reduced time spent on initialising when entering the ASG itself.
What does pre-initialised mean, you wonder? Took a while for me to understand what that means too. What happens is that the EC2 instance boots, runs for a while, and then shuts down. We'll get back to this in a bit.
Now, what intrigued me is this bit from the warm pool documentation:
This is a very interesting claim. Why on earth would this feature not be possible for Kubernetes regardless of what shape and form self-managed could Kubernetes comes in?
AWS may have made interesting choices with their Elastic Kubernetes Service precluding them from taking advantage of warm pools, but there are alternatives!
Over the last year or so I have regularly contributed to kOps, which is my preferred way of deploying and maintaining production-ready clusters on AWS. I know well how it boots a plain Ubuntu instance and configures it to become a Kubernetes node, and I could not imagine implementing warm pool support would be any challenge. And turns out it was not either.
The results
This post will describe in detail some of the inner workings of kOps, how ASG behaves, and how to observe various time spans between when a scale-out is triggered and the node is ready for Kubernetes workloads.
If you are here only for the results, here is the TL;DR:
- Time between a scale-out is triggered and Pods start improved at least 50%.
- It should be possible to improve this even further.
- Most, if not all, of the functionality below will be available in kOps 1.21.
Summary
This table shows the number of seconds between CAS being triggered and Pods starting.
Configuration | First Pod started | Last Pod started |
---|---|---|
No warm pool | 149 | 190 |
Warm pool | 79 | 149 |
Warm pool plus life cycle hook | 76 | 98 |
Warm pool plus life cycle hook + warm-pull images | 70 | 79 |
And now for the details!
Initialising a Kubernetes node
First, let us have a look at what the process of converting a brand new EC2 instance to a Kubernetes Node means.
On a new EC2 instance, this happens on first boot:
- cloud-init installs a configuration service called
nodeup
. -
nodeup
takes the cluster configuration and installscontainerd
,kubelet
, and the necessary distro packages with their correct configurations. -
nodeup
establishes trust with the API server (control plane). -
nodeup
creates and installs asystemd
service forkubelet
. -
nodeup
starts thekubelet
service, which is the process on each node that manages the Kubernetes workloads. -
kubelet
pulls down all the images the control plane tells it to run and starts them as defined by Pod specs and similar.
When a kOps-provisioned instance reboots, nodeup
runs through all of the above again to ensure the instance is in the expected state. nodeup
is smart enough not to redo already performed tasks though, so the second run is quite fast.
Doing nothing at all
The most naïve way of implementing support for warm pools is to do nothing more than creating the warm pool. Unfortunately, this would start kubelet
, which will register the Node with the cluster. Since the AWS cloud provider does not remove instances in stopped
state, the control plane marks the Node NotReady
, but keep it around in case it comes back up.
It is not a catastrophe to have a large amount of NotReady
Nodes in the cluster, but any sane monitoring would not be too happy, and detecting actual bad Nodes would be harder.
Making nodeup
aware of the warm pool
The only thing I had to do to support warm pools gracefully was to make nodeup
conscious of its ASG lifecycle state.
Instances entering the warm pool have a slightly different lifecycle than instances that go directly into the ASG. nodeup
needs to do is to:
- check if the current instance ASG lifecycle state has a
warming:
prefix. - if it does not, install and start the
kubelet
service.
This way kubelet
does not start and join the cluster on first boot, but since we enabled the service, systemd
will start it on the second boot.
Comparing the difference
In this section I will take you through comparing the time it takes to scale out a Kubernetes Deployment with and without a warm pool enabled. The acid test is the interval between Cluster Autoscaler (CAS) reacts to the scale-out demand and all the Pods starting.
Configuration
In the kOps Cluster spec I ensured I had the following snippet:
spec:
clusterAutoscaler:
enabled: true
balanceSimilarNodeGroups: true
This enables the cluster autoscaler addon.
The number of new instances per ASG influences the scale-out time of the ASG itself. Adding 9 instances to one ASG is significantly slower than adding 3 instances to 3 ASGs. So, to ensure fair comparisons, we tell CAS to balance the ASGs.
On each of the InstanceGroups with the Node
role, I set the following capacity:
spec:
machineType: t3.medium
maxSize: 10
minSize: 1
The cluster will launch and with the with the minimum capacity.
I then created a Deployment that has resource.requirement.cpu
set:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: default
spec:
selector:
matchLabels:
app: nginx
replicas: 0
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:stable-alpine
ports:
- containerPort: 80
resources:
requests:
cpu: 1
I used the nginx:stable-alpine
image here as it is fairly small. I did not want image pull time to significantly impact the scale-out time.
Scaling out
To scale the Deployment, I executed the following:
kubectl scale deployment.v1.apps/nginx-deployment --replicas=10
A t3.medium
instance only has 2 cpus, some of which are reserved by other Pods already. So, increasing the replicas as above causes CAS to scale out one instance per replica.
Obtaining the test results
A good way of getting the details is to list the events for a Pod:
$ kubectl get events -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message --field-selector involvedObject.kind=Pod,involvedObject.name=nginx-deployment-123abc
For all the configurations in this post, I ran the following command on the first and last Pod to enter the Running
state.
For the first Pod, I got:
FirstSeen LastSeen Count From Type Reason Message
2021-04-17T08:56:30Z 2021-04-17T08:56:30Z 1 cluster-autoscaler Normal TriggeredScaleUp Pod triggered scale-up: [{Nodes-eu-central-1a 1->5 (max: 10)} {Nodes-eu-central-1b 1->4 (max: 10)}]
<snip>
2021-04-17T08:58:59Z 2021-04-17T08:58:59Z 1 `kubelet` Normal Started Started container nginx
For the last Pod I got:
FirstSeen LastSeen Count From Type Reason Message
2021-04-17T08:56:30Z 2021-04-17T08:56:30Z 1 cluster-autoscaler Normal TriggeredScaleUp Pod triggered scale-up: [{Nodes-eu-central-1a->5 (max: 10)} {Nodes-eu-central-1b 1->4 (max: 10)}]
<snip>
2021-04-17T08:59:40Z 2021-04-17T08:59:40Z 1 `kubelet` Normal Started Started container nginx
This means the first Pod started after 149s and the last Pod after 190s. These are the numbers I'll be comparing across all configurations. I also found it interesting to compare the difference between the first and last Pod start time.
Hunting for delays
This part may not be all that interesting. Here I just try to show which time spans can be interesting for improving the warming process, including explaining what may cause a 41 second difference between those two Pods.
ASG reaction time
If I look at the ASG activity, I see the following message on all instances the ASG launched:
At 2021-04-17T08:56:30Z a user request explicitly set group desired capacity changing the desired capacity from 1 to 5. At 2021-04-17T08:56:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 5.
However, if I look at the actual boot of the instances, I see these two lines for the first and last instance to boot respectively:
-- Logs begin at Sat 2021-04-17 08:56:51 UTC, end at Sat 2021-04-17 09:10:23 UTC. --
-- Logs begin at Sat 2021-04-17 08:57:13 UTC, end at Sat 2021-04-17 09:09:19 UTC. --
So even if the ASG launched the instances at the same time, they do not actually boot at the same time.
It looks like we can generally assume 20-30 seconds response time on an ASG scale-out.
Nodeup run time
We can see that the nodeup
spends a fairly consistent amount of time initialising the node. The times below show when nodeup
finished on the two Nodes.
Apr 17 08:58:27 ip-172-20-42-5 `systemd`[1]: kops-configuration.service: Succeeded.
Apr 17 08:58:50 ip-172-20-90-140 `systemd`[1]: kops-configuration.service: Succeeded.
The delay is almost the same. From boot until the instance has been configured takes about 95-100 seconds.
Kubelet becoming ready
The last part is the time it takes before node enters Ready
state.
kubelet
becomes ready once it has registered with the control plane and verified storage, CPU, memory, and networking is working properly.
Interestingly, this part further skewed the difference between our two Nodes:
Ready True Sat, 17 Apr 2021 11:19:08 +0200 Sat, 17 Apr 2021 10:58:53 +0200 KubeletReady `kubelet` is posting ready status. AppArmor enabled
Ready True Sat, 17 Apr 2021 11:14:56 +0200 Sat, 17 Apr 2021 10:59:25 +0200 KubeletReady `kubelet` is posting ready status. AppArmor enabled
This last leg took 26s and 35s for those two instances.
Enter the warm pool
So how much does adding a warm pool improve the scale-out time?
Wrap up this test by scaling the deployment back to 0 so CAS can scale down the ASGs again.
kubectl scale deployment.v1.apps/nginx-deployment --replicas=0
Adding a warm pool
This feature has not been released to a kOps beta at the time of the writing; the field names below may change.
Configuration
The following will make kOps create warm pools for our InstanceGroups.
warmPool:
minSize: 10
maxSize: 10
Specifically, this will configure a warm pool with 10 Nodes. I only do this to ensure I have a known amount of warm instances to make the the tests comparable. When using warm pools under normal operations, I would just use AWS defaults.
Apply the configuration and watch the warm pool instances appear:
$ kops get instances
ID NODE-NAME STATUS ROLES STATE INTERNAL-IP INSTANCE-GROUP MACHINE-TYPE
i-01ade8dad4c7ce0cd ip-172-20-114-104.eu-central-1.compute.internal UpToDate node 172.20.114.104 Nodes-eu-central-1c t3.medium
i-01f169e730f88e016 ip-172-20-118-255.eu-central-1.compute.internal UpToDate master 172.20.118.255 master-eu-central-1c.masters t3.medium
i-069e09e5873042cd7 ip-172-20-93-29.eu-central-1.compute.internal UpToDate master 172.20.93.29 master-eu-central-1b.masters t3.medium
i-09b42c88fbd3399ab ip-172-20-58-68.eu-central-1.compute.internal UpToDate node 172.20.58.68 Nodes-eu-central-1a t3.medium
i-0a85aed8869a30432 ip-172-20-59-147.eu-central-1.compute.internal UpToDate master 172.20.59.147 master-eu-central-1a.masters t3.medium
i-0b37f6a258d9c7775 ip-172-20-75-150.eu-central-1.compute.internal UpToDate node 172.20.75.150 Nodes-eu-central-1b t3.medium
i-0c16b3c668615f259 UpToDate node WarmPool 172.20.50.226 Nodes-eu-central-1a t3.medium
i-0cdf1c334d452c9a6 UpToDate node WarmPool 172.20.126.45 Nodes-eu-central-1c t3.medium
i-0d00f4debb586d17f UpToDate node WarmPool 172.20.116.222 Nodes-eu-central-1c t3.medium
i-0d04d04cb7f4be2d7 UpToDate node WarmPool 172.20.58.177 Nodes-eu-central-1a t3.medium
i-0d49db4113702292c UpToDate node WarmPool 172.20.59.66 Nodes-eu-central-1a t3.medium
i-0e07b6cc361d7a909 UpToDate node WarmPool 172.20.89.193 Nodes-eu-central-1b t3.medium
i-0f3451adef13005e7 UpToDate node WarmPool 172.20.114.9 Nodes-eu-central-1c t3.medium
<snip>
The kOps output only shows that the instance is in the warm pool, not if it has finished pre-initialisation. But if you go into the Instance Management part of the ASG in the AWS console I can see something like this:
As you can see, not all instances have entered warmed:stopped
state yet. But after waiting a bit longer, they are all ready and I can scale out.
kubectl scale deployment.v1.apps/nginx-deployment --replicas=10
Once all the Pods have entered the Running
state, let's look if this has improved.
Again, find the first and last Pod that entered the Running
state and list their events. The method is the same as last time, and it shows that the first Pod starts after 79s and the last one starts after 149s.
The first Pod starts 70 seconds faster than the first node without a warm pool. The last Pod starts 41 seconds faster than the last Pod without warm instances. That is a pretty decent improvement.
Without the warm pool, the difference between the first Pod and last Pod was 41s. This time it was a whopping 70s; half a minute more. We cannot seem to blame this on ASG response time or the time between the kubelet
starting and becoming ready.
So, what is happening here?
Turns out the time an instance runs before it is shut down is completely arbitrary. Some instances stay running for seconds, others for minutes. On the Node the first Pod is running on, nodeup
was allowed run to until completion, while on the Node of the last Pod, it was barely allowed to run at all. Luckily nodeup
is fairly good at knowing the current state of the Node and can pick up from where it left of regardless of when it was interrupted.
Enter lifecycle hooks
But we do not want nodeup
to be interrupted. Nor do we want the instance to stay running long after nodeup
has finished.
What AWS did to solve this is to let warming instances run through the same lifecycle as other instances. So, if you have a lifecycle hook for EC2_INSTANCE_LAUNCHING
it will trigger on warming Nodes too.
Amend the InstanceGroup spec as follows:
warmPool:
useLifecycleHook: true
This will make kOps provision a lifecycle hook that can be used by nodeup
to signal that it has completed its configuration.
Do the scale down/scale up dance again and observe the first and last Pod creation time.
First Pod 76s; last Pod 98s. That's 22s difference between the first and last Pod. Down from 40s without warm pool and certainly down from the 70s for warm pool without the instance lifecycle hook.
The effect of warm pool alone
With warm pool and some fairly easy changes to kOps, we did the impossible. We created warm pool support for self-managed Kubernetes.
The result is significantly faster response times to Pod scale-out. From up to 190 seconds without warm pool, to up to 98 seconds with warm pool and lifecycle hooks. The best case went from 149 seconds to 76 seconds.
That is 50% faster. Not bad!
Exploiting warm pools further
So far we have focused just on running a regular nodeup
run. But can we do better? Can we exploit warm pool to make Pod scale out even faster?
Most likely we cannot do much about what happens before nodeup
runs. Nodeup already runs fairly fast; roughly 10 seconds. There could be some optimisations there, but we would not be able to shave off many seconds.
But what about those 26-35 seconds between nodeup
completing and the Pods becoming ready?
All Nodes are running additional system containers such as kube-proxy
a CNI, in this case Cilium. All of these need to be present on the machine before much else happens. And those images are not necessarily small.
In my case, the time it took was the following:
FirstSeen LastSeen Count From Type Reason Message
2021-04-18T06:10:34Z 2021-04-18T06:10:34Z 1 `kubelet` Normal Pulled Successfully pulled image "k8s.gcr.io/kube-proxy:v1.20.1" in 19.120641887s
2021-04-18T06:10:55Z 2021-04-18T06:10:55Z 1 `kubelet` Normal Pulled Successfully pulled image "docker.io/cilium/cilium:v1.9.4" in 8.270050914s
So not only did it take a while for each of these images to be pulled, but they are pulled in sequence, adding up to 25+ seconds.
But during warming, nodeup
already knows about these images. What if it could just pull them so they were already present? Let's try this and see if this changes anything.
After another round of the scale-in/scale-out dance we describing the node and see the image has indeed been pulled during warming.
2021-04-18T06:35:37Z 2021-04-18T06:35:37Z 1 `kubelet` Normal Pulled Container image "k8s.gcr.io/kube-proxy:v1.20.1" already present on machine
2021-04-18T06:35:37Z 2021-04-18T06:35:37Z 1 `kubelet` Normal Started Started container kube-proxy
2021-04-18T06:35:37Z 2021-04-18T06:35:37Z 1 `kubelet` Normal Pulled Container image "docker.io/cilium/cilium:v1.9.4" already present on machine
2021-04-18T06:35:38Z 2021-04-18T06:35:38Z 1 `kubelet` Normal Started Started container cilium-agent
We also see that the containers have started at roughly the same time.
Comparing this configuration with previous configuration shows this did not bring down the starting time of the first Pod by many seconds. This time the first Pod started 70 seconds after CAS triggered. But the last Pod consistently trails only a few seconds behind, at about 79 seconds. This makes the improvement worthwhile.
Pulling in images during warming could also be used for any other containers that may run on a given Node. This is certainly valuable for any DaemonSet, but also for any other Pod that have a chance of being deployed to the Node. Worst-case is that you waste a bit of disk space should the Pods not be scheduled on the Node.
Wrap up
The feature is not even two weeks old, so I certainly have not had the time to explore all the ways this can be exploited. nodeup
typically does not expect instances to reboot so there may be optimisations that can be done there as well. For example, also on second boot, kubelet
is triggered by nodeup
, which may not be necessary. If nodeup
successfully created the kubelet
service on the first run, there should be zero changes done to the system on the second run. The only reason there would be a change is if the cluster configuration changed, and those should only be applied through an instance rotation.
I hope you are as excited about this as I am. And if you wonder when this all will be available to you, the answer is "shortly". Some of the functionality has already been merged to Kubernetes/kops
, and I obviously have the code for the rest more or less ready. I hope for all of this to be available in kOps 1.21 expected sometime in May.
Posted on April 19, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.