Understand how linux containers works with practical examples

ivanmoreno

Iván Moreno

Posted on April 18, 2021

Understand how linux containers works with practical examples

Nowadays a bast majority of server workloads run using linux containers because of his flexibility and lightweight but have you ever think how does linux containers works. In this tutorial we will demystify how does linux containers works with some practical examples. Linux containers works thanks two kernel features: namespaces and cgroups.

Table of contents

Linux Namespaces

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers. [1]

Currently the linux kernel have 8 types of namespaces:

Namespace Isolates
cgroup Cgroup root directory
IPC System V IPC, POSIX message queues
Network Network devices, stacks, ports, etc.
Mount Mount points
PID Process IDs
Time Boot and monotonic clocks
User User and group IDs
UTS Hostname and NIS domain name

Linux control groups (cgroups)

Cgroups allow you to allocate resources — such as CPU time, system memory, network bandwidth, or combinations of these resources — among user-defined groups of tasks (processes) running on a system. You can monitor the cgroups you configure, deny cgroups access to certain resources, and even reconfigure your cgroups dynamically on a running system. [2]

Container Fundamentals (key technologies)

In this section we gonna make some practices with the following key technologies that make possible the usage of containers in linux:

NOTE: This tutorial was made using a VM with 1GB of ram and 1vCPU using debian 10 buster with kernel 4.19.0-16-amd64. All commands were executed using root privileges

Process namespace fundamentals

A process namespace isolate a running command from the host. Let's see how to implement a process namespace in linux.

List process namespaces

$ lsns -t pid
Enter fullscreen mode Exit fullscreen mode

Get the PID of the current terminal

$ echo $$ # parent PID
Enter fullscreen mode Exit fullscreen mode

Launch a new zsh terminal using namespaces

$ unshare --fork --pid --mount-proc zsh
$ sleep 300 &
$ sleep 300 &
$ sleep 300 &
$ sleep 300 &
$ sleep 300 &
$ top
Enter fullscreen mode Exit fullscreen mode

See the process tree from the parent

$ ps f -g <PPID>
Enter fullscreen mode Exit fullscreen mode

List namespaces

$ lsns -t pid
Enter fullscreen mode Exit fullscreen mode

Filesystem Overlay FS fundamentals

Containers need tohave a filesystem, one of the most used filesystem for containers is overlay who can mount with layers and merge in a single directory, the lower layers are read only and all changes are made on the upper layer. Let's see how does overlay fs works.

Create directories

$ cd /tmp
$ mkdir {lower1,lower2,upper,work,merged}
Enter fullscreen mode Exit fullscreen mode

Create some files in lower directories

$ echo "Lower 1 - original" > lower1/file1.txt
$ echo "Lower 2 - original" > lower2/file2.txt
Enter fullscreen mode Exit fullscreen mode

Create overlay FS

$ mount -t overlay -o lowerdir=/tmp/lower1:/tmp/lower2,upperdir=/tmp/upper,workdir=/tmp/work none /tmp/merged
Enter fullscreen mode Exit fullscreen mode

Create, modify files

$ cd /tmp/merged
$ echo "file created in merged directory" > file_created.txt
$ echo "file 1 modified" > file1.txt
Enter fullscreen mode Exit fullscreen mode

Umount overlay fs

$ cd /tmp
$ umount /tmp/merged
Enter fullscreen mode Exit fullscreen mode

Inspect lower and upper dirs

$ find -name '*.txt' -type f 2>/dev/null | while read fn; do echo ">> cat $fn"; cat $fn; done
Enter fullscreen mode Exit fullscreen mode

Networking Linux bridge fundamentals

Linux container uses network namespaces to isolate the network from the host, this is possible implementing a bridge interface that acts like network switch, and every container connect to that interface with his own ip address. Let's see how does linux bridge and network namespaces works.

Create a Network Virtual bridge

$ ip link add br-net type bridge
Enter fullscreen mode Exit fullscreen mode

List Network Interfaces

$ ip link
Enter fullscreen mode Exit fullscreen mode

Assign an IP Address to bridge interface

$ ip addr add 192.168.55.1/24 brd + dev br-net
Enter fullscreen mode Exit fullscreen mode

Bring UP the bridge interface

$ ip link set br-net up
Enter fullscreen mode Exit fullscreen mode

Create 2 Network Namespaces

$ ip netns add ns1
$ ip netns add ns2
Enter fullscreen mode Exit fullscreen mode

Create a Virtual Ethernet cable pair

$ ip link add veth-ns1 type veth peer name br-ns1
$ ip link add veth-ns2 type veth peer name br-ns2
Enter fullscreen mode Exit fullscreen mode

Assign veth to namespaces

$ ip link set veth-ns1 netns ns1
$ ip link set veth-ns2 netns ns2
$ ip link set br-ns1 master br-net
$ ip link set br-ns2 master br-net
Enter fullscreen mode Exit fullscreen mode

Assign IP address to veth within namespaces

$ ip -n ns1 addr add 192.168.55.2/24 dev veth-ns1
$ ip -n ns2 addr add 192.168.55.3/24 dev veth-ns2
Enter fullscreen mode Exit fullscreen mode

Bring UP veth interfaces within Namespaces

$ ip -n ns1 link set lo up
$ ip -n ns2 link set lo up
$ ip -n ns1 link set veth-ns1 up
$ ip -n ns2 link set veth-ns2 up
Enter fullscreen mode Exit fullscreen mode

Bring UP bridge veth in the local host

$ ip link set br-ns1 up
$ ip link set br-ns2 up
Enter fullscreen mode Exit fullscreen mode

Configure default route within namespaces

$ ip -n ns1 route add default via 192.168.55.1 dev veth-ns1 
$ ip -n ns2 route add default via 192.168.55.1 dev veth-ns2 
Enter fullscreen mode Exit fullscreen mode

Enable IP forward in the host

$ sysctl -w net.ipv4.ip_forward=1
Enter fullscreen mode Exit fullscreen mode

Configure MASQUERADE in the host for 192.168.55.0/24 subnet

$ iptables -t nat -A POSTROUTING -s 192.168.55.0/24 ! -o br-net -j MASQUERADE
Enter fullscreen mode Exit fullscreen mode

Check connectivity within namespaces

$ ip netns exec ns1 ping -c 3 192.168.55.3 # ping ns2
$ ip netns exec ns2 ping -c 3 192.168.55.2 # ping ns1
$ ip netns exec ns1 ping -c 3 192.168.55.1 # ping br-net gateway
$ ip netns exec ns2 ping -c 3 192.168.55.1 # ping br-net gateway
$ ip netns exec ns1 ping -c 3 1.1.1.1 # ping internet
$ ip netns exec ns2 ping -c 3 1.1.1.1 # ping internet
Enter fullscreen mode Exit fullscreen mode

Control groups (cgroups) fundamentals

Control groups or cgroups are used by containers to limit the usage of resource in the host machine. Let's see how does cgroups works.

Create cgroups directory

$ mkdir -p /mycg/{memory,cpusets,cpu}
Enter fullscreen mode Exit fullscreen mode

Mount cgroups directory

$ mount -t cgroup -o memory none /mycg/memory
$ mount -t cgroup -o cpu,cpuacct none /mycg/cpu
$ mount -t cgroup -o cpuset none /mycg/cpusets
Enter fullscreen mode Exit fullscreen mode

Create new directories under CPU controller

mkdir -p /mycg/cpu/user{1..3}
Enter fullscreen mode Exit fullscreen mode

Assign CPU shares to every user (This example uses 1vCPU)

# 2048 / (2048 + 512 + 80) = 77%
$ echo 2048 > /mycg/cpu/user1/cpu.shares
# 512 / (2048 + 512 + 80) = 19%
$ echo 512 > /mycg/cpu/user2/cpu.shares
# 80 / (2048 + 512 + 80) = 3%
$ echo 80 > /mycg/cpu/user3/cpu.shares
Enter fullscreen mode Exit fullscreen mode

Create artificial load

$ cat /dev/urandom &> /dev/null &
$ PID1=$!
$ cat /dev/urandom &> /dev/null &
$ PID2=$!
$ cat /dev/urandom &> /dev/null &
$ PID3=$!
Enter fullscreen mode Exit fullscreen mode

Assign process to every user

$ echo $PID1 > /mycg/cpu/user1/tasks
$ echo $PID2 > /mycg/cpu/user2/tasks
$ echo $PID3 > /mycg/cpu/user3/tasks
Enter fullscreen mode Exit fullscreen mode

Monitoring process

$ top
Enter fullscreen mode Exit fullscreen mode

Create a container from scratch

So far we know how does linux namespaces works, now lets create a container using overlayfs, network namespaces, cgroups and process namespaces from scratch. Let's see how a linux container is created.

Download and extract debian container fs from docker

$ docker pull debian
$ docker save debian -o debian.tar
$ mkdir debian_layer
$ mkdir -p fs/{lower,upper,work,merged}
$ tar xf debian.tar -C debian_layer
$ find debian_layer -name 'layer.tar' -exec tar xf {} -C fs/lower \;
Enter fullscreen mode Exit fullscreen mode

Create bridge interface

$ ip netns add cnt
$ ip link add br-cnt type bridge
$ ip addr add 192.168.22.1/24 brd + dev br-cnt
$ ip link set br-cnt up
$ sysctl -w net.ipv4.ip_forward=1
$ iptables -t nat -I POSTROUTING 1 -s 192.168.22.0/24 ! -o br-cnt -j MASQUERADE
Enter fullscreen mode Exit fullscreen mode

Create overlay Filesystem from debian container fs

$ mount -vt overlay -o lowerdir=./fs/lower,upperdir=./fs/upper,workdir=./fs/work none ./fs/merged
Enter fullscreen mode Exit fullscreen mode

Mounting Virtual File Systems

$ mount -v --bind /dev ./fs/merged/dev
Enter fullscreen mode Exit fullscreen mode

Launch process namespace within fs/merged fs

$ unshare --fork --pid --net=/var/run/netns/cnt chroot ./fs/merged \
    /usr/bin/env -i PATH=/bin:/usr/bin:/sbin:/usr/sbin TERM="$TERM" \
    /bin/bash --login +h
# Mount proc within container
$ mount -vt proc proc /proc
Enter fullscreen mode Exit fullscreen mode

Connect the container with br-cnt

$ ip link add veth-cnt type veth peer name br-veth-cnt
$ ip link set veth-cnt netns cnt
$ ip link set br-veth-cnt master br-cnt
$ ip link set br-veth-cnt up
$ ip -n cnt addr add 192.168.22.2/24 dev veth-cnt
$ ip -n cnt link set lo up
$ ip -n cnt link set veth-cnt up
$ ip -n cnt route add default via 192.168.22.1 dev veth-cnt
$ ip netns exec cnt ping -c 3 1.1.1.1
Enter fullscreen mode Exit fullscreen mode

Mount cgroup

$ mkdir /sys/fs/cgroup/memory/cnt
$ echo 10000000 > /sys/fs/cgroup/memory/cnt/memory.limit_in_bytes
$ echo 0 > /sys/fs/cgroup/memory/cnt/memory.swappiness
$ CHILD_PID=$(lsns -t pid | grep "[/]bin/bash --login +h" | awk '{print $4}')
$ echo $CHILD_PID > /sys/fs/cgroup/memory/cnt/tasks
Enter fullscreen mode Exit fullscreen mode

Run commands within container

$ apt update
$ apt install nginx procps curl -y
$ nginx
$ curl 127.0.0.1:80
$ curl 192.168.22.2:80 # from host
$ cat <( </dev/zero head -c 15m) <(sleep 15) | tail
Enter fullscreen mode Exit fullscreen mode

Clean all

$ umount /proc # within container
$ exit # within container
$ umount -R ./fs/merged
$ ip link del br-veth-cnt
$ ip link del br-cnt
$ ip netns del cnt # grep cnt /proc/mounts
Enter fullscreen mode Exit fullscreen mode

Inspect Namespaces within a docker container

Fortunately for us there is a program that simplifies the usage of containers, for us this program is docker who manage the life-cycle of running a container. Let's see how does docker implement the namespaces running a container.

Install docker CE

Install docker community edition from official script in get.docker.com

$ curl -fsSL https://get.docker.com -o install_docker.sh
$ less install_docker.sh # optional
$ sh install_docker.sh
$ usermod -aG docker $USER
$ newgrp docker # Or logout and login
Enter fullscreen mode Exit fullscreen mode

Inspect Docker Network

Create a bridge network using docker

$ docker network create mynet
Enter fullscreen mode Exit fullscreen mode

Inspect bridge network, see subnet using IP

$ BR_NAME=$(ip link | grep -v '@' | awk '/br-/{gsub(":",""); print $2}')
$ ip addr show ${BR_NAME}
Enter fullscreen mode Exit fullscreen mode

Inspect Docker bridge network, see subnet using docker

$ docker network inspect mynet | grep Subnet
Enter fullscreen mode Exit fullscreen mode

Run an nginx web server

$ docker run --name nginx --net mynet -d --rm -p 8080:80 nginx
Enter fullscreen mode Exit fullscreen mode

Inspect network namespace from nginx container

Create symlink from /proc to /var/run/netns

$ CONTAINER_ID=$(docker container ps | awk '/nginx/{print $1}')
$ CONTAINER_PID=$(docker inspect -f '{{.State.Pid}}' ${CONTAINER_ID})
$ mkdir -p /var/run/netns/
$ ln -sfT /proc/${CONTAINER_PID}/ns/net /var/run/netns/${CONTAINER_ID}
Enter fullscreen mode Exit fullscreen mode

Check network interface within namespace

$ ip netns list
$ ip -n ${CONTAINER_ID} link show eth0
Enter fullscreen mode Exit fullscreen mode

Check IP address of nginx container

$ ip -n ${CONTAINER_ID} addr show eth0
$ docker container inspect nginx | grep IPAddress
Enter fullscreen mode Exit fullscreen mode

Check port forwarding from 8080 to 80

$ iptables -t nat -nvL
Enter fullscreen mode Exit fullscreen mode

Inspect cgroups in a docker container

Run a Ubuntu container with limited resources

$ docker run --name test_cg --memory=10m --cpus=.1 -it --rm ubuntu
Enter fullscreen mode Exit fullscreen mode

See cgroup fs hierarchy

$ CONTAINER_ID=$(docker container ps --no-trunc | awk '/test_cg/{print $1}')
$ tree /sys/fs/cgroup/{memory,cpu}/docker/${CONTAINER_ID}
Enter fullscreen mode Exit fullscreen mode

See attached task to container cgroup

$ docker container top test_cg | tail -n 1 | awk '{print $2}' # container parent PID
$ cat /sys/fs/cgroup/{memory,cpu}/docker/${CONTAINER_ID}/tasks # the same as container parent PID
Enter fullscreen mode Exit fullscreen mode

Monitoring the container

$ docker container stats test_cg
Enter fullscreen mode Exit fullscreen mode

Generate CPU load

$ cat /dev/urandom &> /dev/null
Enter fullscreen mode Exit fullscreen mode

Generate Memory load

$ cat <( </dev/zero head -c 50m) <(sleep 30) | tail
Enter fullscreen mode Exit fullscreen mode

Inspect overlay fs in a docker container

Run a ubuntu container with limited resources

$ docker run --name test_overlayfs -it --rm debian
Enter fullscreen mode Exit fullscreen mode

NOTE: The merged layer is the actual container Filesystem

Inspect lower layers with tree and less

$ docker container inspect test_overlayfs -f '{{.GraphDriver.Data.LowerDir}}' | awk 'BEGIN{FS=":"}{for (i=1; i<= NF; i++) print $i}' | while read low; do tree -L 2 $low; done | less
Enter fullscreen mode Exit fullscreen mode

Inspect upper layer (It's empty)

$ docker container inspect test_overlayfs -f '{{.GraphDriver.Data.UpperDir}}' | while read upper; do tree $upper; done | less
Enter fullscreen mode Exit fullscreen mode

Run command withing the container

$ apt update && apt install nmap -y
Enter fullscreen mode Exit fullscreen mode

Inspect (again) upper layer (now it's not empty)

$ docker container inspect test_overlayfs -f '{{.GraphDriver.Data.UpperDir}}' | while read upper; do tree $upper; done | less
Enter fullscreen mode Exit fullscreen mode

Inspect docker process namespace

Run docker container

$ docker run --name test_ps -it --rm ubuntu
Enter fullscreen mode Exit fullscreen mode

Launch process within container

$ sleep 600 &
$ sleep 600 &
$ sleep 600 &
$ sleep 600 &
$ sleep 600 &
$ top
Enter fullscreen mode Exit fullscreen mode

See container tree process from container

$ CONTAINER_PID=$(docker container top test_ps | sed -n '2p' | awk '{print $2}')
$ ps f -g ${CONTAINER_PID}
Enter fullscreen mode Exit fullscreen mode

List PID namespaces

$ lsns -t pid
Enter fullscreen mode Exit fullscreen mode

See process using docker

$ docker container top test_ps
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this tutorial we create our first container from scratch understanding what happen behind the scenes when we run a container. I hope this tutorial helps you to understand the technologies behind the linux containers.

Source code

💖 💪 🙅 🚩
ivanmoreno
Iván Moreno

Posted on April 18, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related