Stop Messing with Kubernetes Finalizers

We've all been there - it's frustrating seeing deletion of Kubernetes resource getting stuck, hang or take a very long time. You might have "solved" this using the terrible advice of removing finalizers or running kubectl delete ... --force --grace-period=0 to force immediate deletion. 99% of the time this is a horrible idea and in this article I will show you why.

Finalizers

Before we get into why force-deletion is a bad idea, we first need to talk about finalizers.

Finalizers are values in resource metadata that signal required pre-delete operations - they tell resource controller what operations need to be performed before object is deleted.

The most common one would be:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
  finalizers:
  - kubernetes.io/pvc-protection
...

Their purpose is to stop a resource from being deleted, while controller or Kubernetes Operator cleanly and gracefully cleans-up any dependant objects such as underlying storage devices.

When you delete an object which has a finalizer, deletionTimestamp is added to resource metadata making the object read-only. Only exception to the read-only rule, is that finalizers can be removed. Once all finalizers are gone, the object is queued to be deleted.

It's important to understand that finalizers are just items/keys in resource metadata. Finalizers don't specify the code to execute. They have to be added/removed by the resource controller.

Also, don't confuse finalizers with Owner References. .metadata.OwnerReferences field specify parent/child relations between objects such as Deployment -> ReplicaSet -> Pod. When you delete an object such as Deployment a whole tree of child objects can be deleted. This process (deletion) is automatic, unlike with finalizers, where controller needs to take some action and remove the finalizer field.

What Could Go Wrong?

As mentioned earlier, the most common finalizer you might encounter is the one attached to Persistent Volume (PV) or Persistent Volume Claim (PVC). This finalizer protects the storage from being deleted while it's in use by a Pod. Therefore, if the PV or PVC doesn't want to delete, it most likely means that it's still mounted by a Pod. If you decide force-delete PV, be aware that backing storage in Cloud or any other infrastructure might not get deleted, therefore you might leave a dangling resource, which still costs you money.

Another example is a Namespace which can get stuck in Terminating state because resources still exist in the namespace that the namespace controller is unable to remove. Forcing deletion of namespace can leave dangling resources in your cluster which include for example Cloud provider's load balancer which might be very hard to track down later.

While not necessarily related to finalizers, it's good to mention that resources can get stuck for many other reasons other than waiting for finalizers:

The simplest example would be Pod being stuck in Terminating state, which usually signals issue with Node on which the Pod runs. "Solving" this with kubectl delete pod --grace-period=0 --force ... will remove the Pod from API server (etcd), but it might still be running on the Node, which is definitely not desirable.

Another example would be a StatefulSet, where Pod force-deletion can create problems because Pods have fixed identities (pod-0,pod-1). A distributed system might depend on these names/identities - if the Pod is force-deleted, but still runs on the node, you can end-up with 2 pods with same identity when StatefulSet controller replaces the original "deleted" Pod. These 2 Pods might then attempt to access same storage, which can lead to corrupted data. More on this in docs.

Finalizers in The Wild

We now know that we shouldn't mess with resources that have finalizers attacked to them, but which resources are these?

The 3 most common ones you will encounter in "vanilla" Kubernetes are kubernetes.io/pv-protection and kubernetes.io/pvc-protection related to Persistent Volumes and Persistent Volume Claims respectively (plus couple more introduced in v1.23) as well as kubernetes finalizer present on Namespaces. The last one however isn't in .metadata.finalizers field but rather in .spec.finalizers - this special case is described in architecture document.

Besides these "vanilla" finalizers, you might encounter many more if you install Kubernetes Operators which often perform pre-deletion logic on their custom resources. A quick search through code of some popular projects turn up the following:

Istio - istio-finalizer.install.istio.io
Cert Manager - finalizer.acme.cert-manager.io
Strimzi (Kafka) - service.kubernetes.io/load-balancer-cleanup
Quay - quay-operator/finalizer
Ceph/Rook - ceph.rook.io/disaster-protection
ArgoCD - argoproj.io/finalizer
Litmus Chaos - chaosengine.litmuschaos.io/finalizer

If you want to find all the finalizers that are present in your cluster, then you will have to run the following command against each resource type:

kubectl get some-resource -o custom-columns=Kind:.kind,Name:.metadata.name,Finalizers:.metadata.finalizers

You can use kubectl api-resources to get a list resources types available in your cluster.

Regardless of which finalizer is stopping deletion of your resources, the negative effects of force-deleting those resources will be generally the same, which is something being left behind, be it storage, load balancer, or a simple pod.

Also, the proper solution will be generally the same, which is finding the finalizer that's stopping the deletion, figuring out what is its purpose - possibly by looking at the source code of the controller/operator - and resolving whatever is blocking the controller from removing the finalizer.

If you decide to force-delete the problematic resources anyway, then the solution would be:

kubectl patch some-resource/some-name \
    --type json \
    --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'

One exception would be Namespace, which has finalize API method which is usually called when all resources in said Namespace are cleaned-up. If the Namespace refuses to delete even when there are no resources left to delete, then you can call the method yourself:

cat <<EOF | curl -X PUT \
  localhost:12345/api/v1/namespaces/my-namespace/finalize \
  -H "Content-Type: application/json" \
  --data-binary @-
{
  "kind": "Namespace",
  "apiVersion": "v1",
  "metadata": {
    "name": "my-namespace"
  },
  "spec": {
    "finalizers": null,
  }
}
EOF

Building Your Own

Now that we know what they are and how they work, it should be clear that they're quite useful, so let's see how we can apply them to our own resources and workloads.

Kubernetes ecosystem is based around Go, but for simplicity's sake I will use Python here. If you're not familiar with Python Kubernetes client library, consider reading my previous article first - Automate All the Boring Kubernetes Operations with Python.

Before we start using finalizers, we first need to create some resource in a cluster - in this case a Deployment:

# initialize the client library...

deployment_name = "my-deploy"
ns = "default"

v1 = client.AppsV1Api(api_client)

deployment_manifest = client.V1Deployment(
    api_version="apps/v1",
    kind="Deployment",
    metadata=client.V1ObjectMeta(name=deployment_name),
    spec=client.V1DeploymentSpec(
        replicas=3,
        selector=client.V1LabelSelector(match_labels={
            "app": "nginx"
        }),
        template=client.V1PodTemplateSpec(
            metadata=client.V1ObjectMeta(labels={"app": "nginx"}),
            spec=client.V1PodSpec(
                containers=[client.V1Container(name="nginx",
                                               image="nginx:1.21.6",
                                               ports=[client.V1ContainerPort(container_port=80)]
                                               )]))))

response = v1.create_namespaced_deployment(body=deployment_manifest, namespace=ns)

The above code creates a sample Deployment called my-deploy, at this time without any finalizer. To add a couple finalizers we will use following patch:

finalizers = ["test/finalizer1", "test/finalizer2"]

v1.patch_namespaced_deployment(deployment_name, ns, {"metadata": {"finalizers": finalizers}})

while True:
    try:
        response = v1.read_namespaced_deployment_status(name=deployment_name, namespace=ns)
        if response.status.available_replicas != 3:
            print("Waiting for Deployment to become ready...")
            time.sleep(5)
        else:
            break
    except ApiException as e:
        print(f"Exception when calling AppsV1Api -> read_namespaced_deployment_status: {e}\n")

The important part here is a call to patch_namespaced_deployment which sets .metadata.finalizers to a list of finalizers we defined. Each of these must be fully qualified, meaning they must contain / as they need to adhere to DNS-1123 specification. Ideally, to make them more understandable you should use format like kubernetes.io/pvc-protection, where you prefix it with hostname of your service which is related to the controller responsible for the finalizer.

Rest of the code in the above snippet simply makes sure that the replicas of the Deployment are available after which we can proceed with managing the finalizers:

from kubernetes import client, watch

def finalize(deployment, namespace, finalizer):
    print(f"Do some pre-deletion task related to the {finalizer} present in {namespace}/{deployment}")
    ...

v1 = client.AppsV1Api(api_client)
w = watch.Watch()
for deploy in w.stream(partial(v1.list_namespaced_deployment, namespace=ns)):
    print(f"Deploy - Message: Event type: {deploy['type']}, Deployment {deploy['object']['metadata']['name']} was changed.")
    if deploy['type'] == "MODIFIED" and "deletionTimestamp" in deploy['object']['metadata']:

        fins = deploy['object']['metadata']['finalizers']
        f = fins[0]
        finalize(deploy['object']['metadata']['name'], ns, f)
        new_fins = list(set(fins) - {f})
        body = [{
            "kind": "Deployment",
            "apiVersion": "apps/v1",
            "metadata": {
                "name": deploy['object']['metadata']['name'],
            },
            "op": "replace",
            "path": f"/metadata/finalizers",
            "value": new_fins
        }]
        resp = v1.patch_namespaced_deployment(name=deploy['object']['metadata']['name'],
                                              namespace=ns,
                                              body=body,
                                              field_manager="json")
    elif deploy['type'] == "DELETED":
        print(f"{deploy['object']['metadata']['name']} successfully deleted.")
print("Finished namespace stream.")

The general sequence here is as follows:

We start by watching the desired resource - in this case a Deployment - for any changes/events. We then look for the events that relate to modifications to the resource and we specifically check whether deletionTimestamp is present. If it's there, we grab list of finalizers from resource's metadata and start processing first of them. We first perform all necessary pre-deletion tasks with finalize function, after which we apply patch to the resource with original list of finalizers minus the one we processed.

If the patch in Python looks complicated to you, then just know that it's an equivalent to the following kubectl command:

kubectl patch deployment/my-deploy \
  --type json \
  --patch='[ { "op": "replace", "path": "/metadata/finalizers", "value": [test/finalizer1] } ]'

If the patch is accepted, we will receive another modification event at which point we will process another finalizer. We repeat that until all finalizers are gone. At that point resource gets automatically deleted.

Be aware that you might receive the events more than once, therefore it's important to make the pre-deletion logic idempotent.

If you run the above code snippets and then execute kubectl delete deployment my-deploy, then you should see logs like:

# Finalizers added to Deployment
Deploy - Message: Event type: ADDED, Deployment my-deploy was changed.
# "kubectl delete" gets executed, "deletionTimestamp" is added
Deploy - Message: Event type: MODIFIED, Deployment my-deploy was changed.
# First finalizer is removed...
Do some pre-deletion task related to the test/finalizer1 present in default/my-deploy
# Another "MODIFIED" event comes in, Second finalizer is removed...
Deploy - Message: Event type: MODIFIED, Deployment my-deploy was changed.
Do some pre-deletion task related to the test/finalizer2 present in default/my-deploy
# Finalizers are gone "DELETED" event comes - Deployment is gone.
Deploy - Message: Event type: DELETED, Deployment my-deploy was changed.
my-deploy successfully deleted.

The above demonstration using Python works, but isn't exactly robust. In real-world scenario you'd most likely want to use Operator framework either through kopf in case of Python, or more usually with Kubebuilder for Go. Kubebuilder docs also includes whole page on how to use finalizers, including sample code.

If you don't want to implement whole Kubernetes Operator, you can also choose to build Mutating Webhook which is described in Dynamic Admission Control docs. The process there would be the same - receive the event, process your business logic and remove the finalizer.

Conclusion

One think you should take away from this article is that you might want to think twice before using --force --grace-period=0 or removing finalizers from resources. There might be situations when it's OK to ignore finalizer, but for your own sake, investigate before using the nuclear solution and be aware of possible consequences as doing so might hide a systemic problem in your cluster.

Blog

Stop Messing with Kubernetes Finalizers

Martin Heinz

Finalizers

What Could Go Wrong?

Finalizers in The Wild

Building Your Own

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related