Kubernetes on ARM: a case study

At KrakenSystems we're working with various IoT devices. They are our main infrastructure for collecting data and sending them to further aggregating pipelines. For now, they are implemented as Beaglebone black devices, armv7l hard float CPU, AM335x 1GHz ARM® Cortex-A8 and only 512MB RAM. In this blogpost we're covering use case and rationale for using kubernetes on such underpowered devices.

Those devices perform simple services. Reading Modbus registers, or xbee protocol, or attaching to OBD (On-Board Diagnostic for vehicles), parsing the data, serializing in the protobuf format and sending on the message bus.

Design criteria & implementation

Deployment format

We want to deploy software as an immutable binary/container. Due to C++ build process arcane setup at KrakenSystems, and plethora of shared library dependencies container setup makes the most sense for this use case. The static binary is also a viable alternative, but that would require refactoring C++ our current build system written as bash/Makefile script collection running for about 15min from 0, and about a couple of minutes on CI after caching.

Another solution was deploying services bare metal. In this legacy setup there was a dedicated shared library folder per service and we did LD_LIBRARY_PATH trickery for shared library version management, defeating having shared libraries in the first place. Yet due to build system current state making static binary was (dev) time-consuming.

Kuberentes with its container management solution fits perfectly to our use-case. Nomad or plain old docker/cri-o/rkt would also satisfy this design criterion. Static binaries with systemd are also a satisfactory choice if it were simple to do in the present codebase state.

Monitoring

Node & service aliveness monitoring is critical. We require some agent running on the node and sending I'm alive to some system, together with a mature alerting pipeline. Consul is one solution. Kubernetes has this out-of-the-box, and together with Prometheus alerting rules seemed like a natural fit. We're also using Prometheus/grafana/alertmanager throughout our infrastructure which made this option more appealing.

Additionally, liveness and readiness health check aren't particularly useful for the edge devices since the process crashing signals the issue. They are not server component requiring accepting client connections.

Nevertheless, in the future, we plan to introduce liveness checks on the services as a failsafe mechanism in case service isn't sending data on the message bus -- its main purpose.

The remaining Ascalia infrastructure is on the kubernetes, thus it made sense reusing those same tools and setup for our edge devices. Less different moving parts is always better and leads to operational simplicity despite kubernetes being not simple to operate.

Updates

The edge devices aren't static islands forever resting int the Pacific Ocean. The code changes often, and configuration even more frequently.

The services are designed for simplicity. Their configuration is saved as YAML file under inotify watch for changes. Thus any update mechanism is possible in the future as sidecar, but keeps the development complexity in check. Furthermore, it's easier to debug.

Per edge device configuration is stored in the RDBMS, Postgres in this instance. Having 100s or 1000s edge devices polling the RDBMS for simple key/value pairs wouldn't end up nicely. Furthermore, there's no push style notification from the RDBMS on key update. Thus we need some additional layer in between.

We're reusing the kubernetes API server and it's backing key/value etcd store. We've defined each edge device as CRD (custom resource definition) object supporting rich and domain-specific information. The kubernetes also server as primitive inventory management supplementing the real Django backed for the operations (i.e. I don't care what Django does as long as updates the right REST endpoints in the kubernetes API)

In the future it's possible edge services shall watch the backing key/value store itself, whether it's kube api server, etcd, consul, riak, redis, or any other common key/value implementation.

Finally, we require async updates. The devices could be offline at the update application time. This rules out all non-agent based configuration management solutions. Ansible, our favorite configuration management tool for its simplicity and power is only used for initial setup, not update procedure (service update that is).

Wireguard VPN setup

Since we're using wireguard VPN solution we need to keep client server IP/public key list in sync asynchronously. This entails having an additional agent on the edge device you have to monitor, track and make sure it's alive.

We also need storing the offline device's public key and easy inspection for those keys/settings. The kubernetes CRDs are the natural fit for this role. We reuse the etcd backing store, have nice RBAC on those object and we've defined custom printer columns for easier VPN node management.

We used the following open-source inhouse tools:

Long story short we bootstrapped the wireguard VPN with wg-cni ansible role. This also installed wireguard based CNI for use in our kubernetes cluster.

The wg-cni role created our custom CRD manifests representing client/servers in the wireguard VPN topology.

After applying the manifests we started the wireguard operator daemonset keeping nodes in sync with further additions/removals.

Initial deployment

It wasn't without issue. We used kubespray as mature kubernetes deployment solution. It's the only complete solution for bare metal deployment. Being ansible based we're familiar with it and can easily extend it if necessary....and it was necessary.

We encountered myriad of problems:

missing support on ARM
Default pause image not supporting arm
missing cpuset (kernel update to 4.19 LTS solved it)
Run into space issues a few times
Flannel missing multi-arch support in kubespray (( before we transitioned to wireguard CNI for good ))
...

Most of these are tracked in the following issue/PRs:

Issues:

PRs:

After successfully applying the default container runtime, docker, it was time for basic performance analysis.

Initial performance analysis

Basic checklist

eMMC is mounted without atime
using armhf binaries ( readelf -A $(which kubelet | grep Tag_ABI_VFP_args)
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

USE

debian@bbb-test:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0   6088  19916 271064    0    0    25    20   17   32 18 12 69  0  0
 0  0      0   6120  19916 271064    0    0     0     0 1334 4098 13 20 67  0  0
 0  0      0   6120  19916 271064    0    0     0     0 1554 4046 13 19 68  0  0
 0  0      0   6120  19924 271056    0    0     0    16  929 2443 10  8 81  1  0
 0  0      0   6120  19924 271064    0    0     0     0 1611 4128 24 20 56  0  0
 0  0      0   6120  19924 271064    0    0     0     0  919 2443  6 11 83  0  0
 0  0      0   5996  19924 271064    0    0     0     0 1240 3312 29 28 42  0  0
 0  0      0   5996  19924 271064    0    0     0     0  958 2417 13  9 77  0  0
 3  0      0   5996  19924 271064    0    0     0     0 1915 5693 28 25 46  0  0
 0  0      0   5996  19924 271064    0    0     0     0 1089 3296 12 18 70  0  0

debian@bbb-test:~$ pidstat 30 1
Linux 4.19.9-ti-r5 (bbb-test)   02/25/2019      _armv7l_        (1 CPU)

04:59:26 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command

04:59:56 PM     0     26749    3.54    1.62    0.00    5.16     0  dockerd
04:59:56 PM     0     26754    0.44    0.37    0.00    0.81     0  docker-containe
04:59:56 PM     0     26784    0.00    0.07    0.00    0.07     0  kworker/u2:2-flush-179:0
04:59:56 PM     0     28814   10.08   10.79    0.00   20.88     0  kubelet
04:59:56 PM     0     29338    0.51    1.15    0.00    1.65     0  kube-proxy
04:59:56 PM   997     29734    1.42    0.37    0.00    1.79     0  consul
04:59:56 PM     0     30867    0.03    0.00    0.00    0.03     0  docker-containe
04:59:56 PM     0     30885    0.47    0.67    0.00    1.15     0  flanneld
04:59:56 PM  1000     31776    0.30    0.07    0.00    0.37     0  mosh-server

About 30% CPU is on the kubernetes without any meaningful work.

iostat -xz 1
sar -n DEV 1
sar -n TCP,ETCP 1

Don't show significant network pressure. Speedtest-cli shows 30MBit download/upload speeds which are more than sufficient for our use case.

In summary, there's high CPU usage with low disk, memory and network usage.

Performance analysis

Stracing kubelet shows about 66% is spent in the locks:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.66    1.671983        5006       334        60 futex
 11.77    0.299775         967       310           epoll_wait
  9.24    0.235263         364       647           nanosleep
  2.58    0.065766          31      2136           clock_gettime
  1.75    0.044623          38      1180        68 read
...
------ ----------- ----------- --------- --------- ----------------
100.00    2.546516                 10290       356 total

Though using pprof and tracing profiles showed more useful information:

debian@bbb-test:~$ wget http://127.0.0.1:10248/debug/pprof/profile?seconds=120
debian@bbb-test:~$ wget http://127.0.0.1:10248/debug/pprof/trace?seconds=120

From there we concluded:

25% time is spent in housekeeping
Changing --housekeeping-interval=10m from default 10s
Increasing node update period didn’t considerably affect CPU usage

This housekeeping is mostly for container metrics, which we don't really need them every 10s, once in a while is perfectly fine for our use case.

GODEBUG=gctrace=1,schedtrace=1000
gc 80 @1195.750s 0%: 0.070+217+0.19 ms clock, 0.070+57/63/59+0.19 ms cpu, 24->24->12 MB, 25 MB goal, 1 P

SCHED 1196345ms: gomaxprocs=1 idleprocs=1 threads=18 spinningthreads=0 idlethreads=6 runqueue=0 [0]

There are no big issues with go's GC nor scheduler in the kubelet process, thus haven't analyzed this further.
o

debian@bbb-test:~$ sudo perf stat -e task-clock,cycles,instructions,branches,branch-misses,instructions,cache-misses,cache-references
^C
 Performance counter stats for 'system wide':

         16,203.12 msec task-clock                #    1.000 CPUs utilized
     4,332,572,389      cycles                    # 267393.223 GHz                    (71.42%)
       911,023,486      instructions              #    0.21  insn per cycle           (71.40%)
        98,098,648      branches                  # 6054350.923 M/sec                 (71.41%)
        30,116,184      branch-misses             #   30.70% of all branches          (71.44%)
       885,259,275      instructions              #    0.20  insn per cycle           (71.45%)
         6,967,361      cache-misses              #    1.836 % of all cache refs      (57.16%)
       379,417,471      cache-references          # 23416495.155 M/sec                (57.14%)

      16.202385758 seconds time elapsed

We observe 30+% branch misprediction rate in the kubelet process. After further analysis this is system-wide. This cheap ARM processor has horrible branch prediction algorithms.

Improvements

We performed the following improvements:

nicked the docker and replaced it with CRI plugin. Concretely we used containerd
increased the housekeeping interval from 10s to 10m
throw away flannel for wireguard CNI (that is native routing mostly)

Average:        0         9    0.00    0.14    0.00    0.24    0.14     -  ksoftirqd/0
Average:        0        10    0.00    0.17    0.00    0.35    0.17     -  rcu_preempt
Average:        0       530    0.00    0.03    0.00    0.00    0.03     -  jbd2/mmcblk1p1-
Average:        0       785    0.00    0.07    0.00    0.21    0.07     -  haveged
Average:        0       818    0.03    0.00    0.00    0.00    0.03     -  connmand
Average:        0       821    4.64    4.47    0.00    0.00    9.12     -  kubelet
Average:        0      1416    0.14    0.07    0.00    0.03    0.21     -  fail2ban-server
Average:        0      1760    0.42    0.69    0.00    0.35    1.11     -  kube-proxy
Average:        0      3436    1.70    0.90    0.00    0.00    2.60     -  containerd
Average:        0      4274    0.07    0.03    0.00    0.07    0.10     -  systemd-journal
Average:        0     17442    0.00    0.38    0.00    0.17    0.38     -  kworker/u2:2-events_unbound
Average:        0     19070    0.00    0.03    0.00    0.00    0.03     -  kworker/0:2H-kblockd
Average:        0     26772    0.00    0.24    0.00    0.28    0.24     -  kworker/0:1-wg-crypt-wg0
Average:        0     28212    0.00    0.31    0.00    0.28    0.31     -  kworker/0:3-events_power_efficient

And in the steady state, we have ~15% CPU usage overhard for the monitoring benefits.
Still, quite a bit, though livable. Maybe cri-o would have lower overhead, though containerd's is pretty slim too.
We'll investigate how can we optimize the kubelet for even lower resource consumption by turning off unneeded features.

Summary

To summarize everything, is running kubernetes on the edge devices sane choice? Maybe.

For us so far so good, everything works with some considerable, though livable overhead.

Trying to only install Prometheus node_exporter, for example, shoots your CPU every scrape, and slows everything to a crawl for those few 100s milliseconds.

This hardware is quite underpowered and with bad branch prediction makes any software running on it weaker than on comparable armv8 or x86_64 architectures.

In the future we'll try to optimize things even further, hopefully reducing kubelet CPU overhead to a more reasonable percentage. We've tried rancher's k3s without a big difference (( actually worse performance since we couldn't change housekeeping interval ))

There's also KubeEdge project which looks promising for kubernetes on IoT.

Blog