Kubernetes gone bust. Now what?
Ricardo Castro
Posted on January 17, 2021
Originally published on mccricardo.com.
We've been operating a few Kubernetes clusters. Someone trips over, falls on a keyboard, and deletes several services. We need to (quickly!) get those back online.
We have several options to get things back to how they were:
- we have everything in version control - pipelines or GitOps reconcilers will take care of it;
- restore ectd backup - all Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data can be a lifesaver under disaster scenarios;
- use specific Kubernetes backup tools - for example Velero.
A tool like Velero is great since it makes backups of Kubernetes objects, as well as, instructing your cloud provider to make backups of PersistentVolumes. That said, this has a ramp-up and we need something now. Backing up our etcd cluster is always a safe bet and there are ways of doing that.
For a while now I've been a fan of Earliest Testable/Usable/Lovable as an "opposition" to MVP.
With this in mind, what we want is a fast way to have a safety net (skate) in case something goes wrong. Fortunately, etcd come equipped with built-in snapshot capabilities.
Backup etcd
We need to identify a few things from the etcd deployment in order to make a backup.
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://172.23.0.3:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://172.23.0.3:2380
- --initial-cluster=backup-control-plane=https://172.23.0.3:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://172.23.0.3:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://172.23.0.3:2380
- --name=backup-control-plane
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
Armed with advertise-client-urls, cert-file, key-file and trusted-ca-file values we can:
ETCDCTL_API=3 etcdctl --endpoints https://172.23.0.3:2379 \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
snapshot save snapshotdb
{"level":"info","ts":1610913776.2521563,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb.part"}
{"level":"info","ts":"2021-01-17T20:02:56.256Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1610913776.2563014,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://172.23.0.3:2379"}
{"level":"info","ts":"2021-01-17T20:02:56.273Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1610913776.2887816,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://172.23.0.3:2379","size":"3.6 MB","took":0.036583317}
{"level":"info","ts":1610913776.2891474,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb"}
Snapshot saved at snapshotdb
To be safe we can ensure the backup is ok:
ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 9b193bf0 | 1996 | 2009 | 2.7 MB |
+----------+----------+------------+------------+
Restore etcd
kube-apiserver uses etcd to store and retrieve information and, as such, we need to stop ip first. This will depend on how you have kube-apiserver configured. Next, we restore etcd:
ETCDCTL_API=3 etcdctl snapshot restore snapshotdb --data-dir="/var/lib/etcd-restore"
{"level":"info","ts":1610913810.5761065,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
{"level":"info","ts":1610913810.599168,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":7655}
{"level":"info","ts":1610913810.60404,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1610913810.6153672,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
We need to tell etcd to use this data folder and once it's up-and-running bring kube-apiserver back online:
volumes:
- hostPath:
path: /var/lib/etcd-restore
type: DirectoryOrCreate
name: etcd-data
Although this looks a bit clunky it's an easy way (skate again) to ensure a safety net in case of disaster while buying time to work a more capable solution (scooter -> bicycle -> motorcycle -> car). It might even come to the point where, for example, the bicycle is good enough.
Posted on January 17, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.