Escaping the OOM Killer
David Gries
Posted on February 4, 2024
Ever wondered why certain Pods face the Kubernetes OOM killer despite ample available resources? Or perhaps encountered applications attempting to exceed configured memory limits, seemingly functioning smoothly on a low-memory VM?
Resource Limits in Kubernetes
Containerizing applications has solidified its position as the go-to standard for modern infrastructure. Whether operating in a virtual environment or a bare-metal cluster, understanding the ins and outs of effective resource management is crucial.
In contrast to CPU limits, which cause a Pod's processes to wait if the CPU time slice limit is reached, reaching a memory limit can be destructive as it causes a process to be killed by the underlying system's out of memory (OOM) killer. This can be seen when a process exits with code 137
(SIGKILL
). This means that the process isn't shut down gracefully, which already has to be considered during development.
Problem 1: Cgroup Awareness
Compared to containerless environments, Kubernetes' reliance on cgroups for constraining system resources reveals some subtle challenges that may not be immediately apparent.
Let's dive into the highlighted issue. Consider the following scenario executed within a Pod with a memory limit set to 4GiB:
...
resources:
limits:
cpu: "1"
memory: 4Gi
requests:
cpu: 50m
memory: 64Mi
...
When inspecting /proc/meminfo
, the output reveals varying available resources:
bash-5.1$ cat /proc/meminfo
MemTotal: 8124016 kB
MemFree: 346844 kB
MemAvailable: 3358232 kB
Buffers: 999768 kB
Cached: 1690344 kB
...
This discrepancy arises because not all content in /proc
is namespace-aware. The metrics shown are actually those of the Node the Pod is scheduled on. Tools like free
for example, which pre-date cgroups, utilize this resource to collect memory metrics.
So, what's the solution? While there aren't direct replacements providing the same metric namespace-aware, there are methods to obtain similar metrics from within the container. A straightforward approach involves examining files under /sys/fs/cgroup/memory/
. In the example above, this yields:
bash-5.1$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
4294967296
This value precisely matches the configured 4GiB limit. It's vital to consider that when developing applications meant to run in a containerized environment.
Problem 2: Page Cache
Yet another not obvious issue arises when dealing with Linux's page cache, as it contributes to the memory.available
metric. This leads to cAdvisor including the page cache in the calculated used memory, creating unnecessary memory pressure on the Kubelet. This poses a challenge, given that the page cache should represent evictable memory: In scenarios where multiple applications heavily rely on cache, the Node experiences heightened memory pressure, leading to the eviction of Pods.
The issue can be easily mitigated by aligning memory limits with requests. This strategy guarantees sufficient memory availability for all Pods scheduled on the Node. However, it remains more of a workaround than a resolution, as the underlying problem persists — Kubelet does not evict caches; instead, it evicts the entire Pod.
Node-pressure eviction is especially bad because configurations like PodDisruptionBudget
and terminationGracePeriodSeconds
are not considered in this scenario!
Given the intricacies of this subject, the provided information just offers a high-level overview. For a more in-depth understanding, consider exploring the details presented in this GitHub issue. Notably, this specific comment contains a concise summary of the matter.
Problem 3: Invisible OOM Kills
By default, Kubernetes enforces process separation through namespaces. This ensures that the main process is assigned PID 1
, a crucial identifier in the Linux process hierarchy. It's responsible for the lifecycle of all sub-processes and the only one considered in Kubernetes' monitoring by default. Let's examine this with a quick look at a system using ps
:
root@ubuntu:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2796 1408 ? S<s 18:25 0:00 sleep 604800
root 20 0.0 0.0 4636 3840 pts/0 S<s 19:12 0:00 bash
root 41 0.0 0.0 7068 3072 pts/0 R<+ 19:13 0:00 ps -aux
In the output, PID 1
corresponds to the main process, showcasing the default isolation within namespaces.
The OOM Killer selects processes that will free up the maximum amount of memory, factoring in the oom_score
of each process. Therefore the process killed isn't always the main process of a container! Despite the OOM Killer's actions, Kubernetes metrics only reflect the OOM kill when PID 1
is affected. This invisibility could cause a potential mismatch between the container's status and its actual state when not using other sufficient health checks.
This leads to leaving the process's lifecycle unmanaged by Kubernetes. While that may not pose significant issues if the main process functions as an init system, it becomes problematic when child processes are not handled correctly by the container's init process after termination, leaving the Pod appearing to run without any apparent issues.
Understanding these details in process isolation and OOM handling is crucial for a predictable and stable environment.
Conclusion
In summary, effective management of Kubernetes memory constraints requires some understanding of namespaces and related challenges. Discrepancies in cgroup awareness, issues with Linux's page cache metrics in cAdvisor, and the invisibility of certain OOM kills underscore the need for a nuanced approach.
Mastering these intricacies is important to maintain a reliable Kubernetes infrastructure, optimizing resource utilization for containerized applications and preventing unexpected disruptions.
Posted on February 4, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.