Kubernetes: Can We Fix It With Insulating Tape? 👷
Roman Belshevitz
Posted on April 10, 2022
Readers asked to write "something about Pods", closer to the surface of the sea, and simpler. Well, ok, enjoy, I tried.
There are a few common incidents that can occur in a Kubernetes deployment or service. Let's discuss how to respond to them. We proceed from the fact that our knowledge base and "toolbox" are not huge.
🚧 1. Pod crashed
Uncover the cause of the crash and take corrective action. You can use the kubectl get pods
command to get information about the crashed pod.
CrashLoopBackOff
is a common error indicating a pod constantly crashing in an endless loop. This error can be caused by a variety of issues, including:
- Insufficient resources: lack of resources prevents the container from loading
- Locked file: a file was already locked by another container
- Locked database: the database is being used and locked by other pods
- Failed reference: reference to scripts or binaries that are not present on the container
- Setup error: an issue with the init container setup in Kubernetes
- Config loading error: a server cannot load the configuration file (check your YAMLs twice!)
- Misconfigurations: a general file system misconfiguration
- Connection issues: DNS or kube-dns is not able to connect to a third-party service
- Deploying failed services: an attempt to deploy services/applications that have already failed (e.g. due to a lack of access to other services)
There are a few unobvious ways to manually troubleshoot the CrashLoopBackOff
error.
🔬 Look at the logs of the failed Pod deployment
To look at the relevant logs, use this command:
$ kubectl logs [podname] -p
The -p
tells the software to retrieve the logs of the previous failed instance, which will let you see what's happening at the application level. For instance, an important file may already be locked by a different container because it's in use.
🔬 Examine logs from preceding containers
If the deployment logs can't pinpoint the problem, try looking at logs from preceding instances. You can run this command to look at previous Pod logs:
$ kubectl logs -n --previous
You can run this command to retrieve the last 20 lines of the preceding Pod.
$ kubectl logs --previous --tail20
Look through the log to see why the Pod is constantly starting and crashing.
🔬 List the events
If the logs don't tell you anything, you should try looking for errors in the space, where Kubernetes saves all the events that happened before your Pod crashed. You can run this command:
$ kubectl get events --sort-by=.metadata.creationTimestamp
Add a --namespace [mynamespace]
as needed. You will then be able to see what caused the crash.
🔬 Look for "Back-off restarting failed container"
You may be able to find errors that you can't find otherwise by running this command:
kubectl describe pod [name]
If you get "Back-off restarting failed container", this means your container suddenly terminated after Kubernetes started it.
Often, this is the result of resource overload caused by increased activity. Kubernetes provides liveness probes to detect and remedy such situations. As such, you need to manage resources for containers and specify the right limits for containers. You should also consider changing initialDelaySeconds
so the software has more time to respond.
🔬 Increase memory resources
Finally, you may be experiencing CrashLoopBackOff
errors due to insufficient memory resources. You can increase the memory limit by changing the resources:limits
in the Container's resource manifest:
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
requests:
memory: "100Mi"
limits:
memory: "200Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
We're limiting the containerized stress-ng
tool by Przemyslaw Ozgo here. What an irony!🙃
🚧 2. Cluster is unhealthy or overloaded
Take action to relieve the pressure. You can use the console tools, metrics or Lens GUI to get information about the CPU and memory usage of the cluster. See my article about resource management.
🚧 3. Services are unavailable
Investigate the cause of the outage and take corrective action. You can use the kubectl get svc
command to get information about the unavailable service.
A common problem with a malfunctioning service is that of missing or mismatched endpoints. For example, it’s important to ensure that a service connects to all the appropriate Pods by matching the Pod’s containerPort
label with the service’s targetPort
selector. Some other troubleshooting practices for services include:
- Verifying that the service works by DNS name
- Verifying that it works by IP Address
- Ensuring that
kube-proxy
is functioning as intended
🚧 4. Pod is stuck in the Pending state
You may want to restart your pods. Some possible reasons are:
- Resource use isn’t stated or when the software behaves in an unforeseen way. Check your resource limits or auto-scaling rules.
- A pod is stuck in a terminating state
- Mistaken deployments
- Requesting persistent volumes that are not available
Determine the cause of the problem and take corrective action. There are at least four methods how to restart pods.
As you may see, kubectl
command will help you a lot. Consider that you have it instead of insulating tape!
Thanks Ben Hirschberg from 🐦ArmoSec and Patrick Londa from 🐦BlinkOps for inspiration. Healthy clusters to you!
Posted on April 10, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.