How Does a Docker Container Work Internally?
Eduardo Zepeda
Posted on May 19, 2024
Containers, especially Docker containers, are used everywhere, we tend to see them as small isolated operating systems that are inside our system. Using the Docker commands we can modify them, create them, delete them and even get inside them and run commands, but have you ever wondered how they work internally?
We know that a container is a linux process with several characteristics:
- A linux process, or group of processes, executed by a user.
- It is isolated from the operating system that hosts it (Namespaces).
- It has a limited amount of resources (Cgroups).
- It has a file system independent of the operating system in which it runs (Chroot).
To achieve this, docker, and other container technologies, take advantage of some features of GNU/Linux (from now on only linux):
- Processes
- Namespaces
- Cgroups
- Chroot
I am going to explain them briefly but you can go deeper on your own if you want to.
Processes, namespaces and cgroups on linux
Process
In simple words, a process is an instance of a running program. What is important here is that each process in linux has a PID, which is a number used to identify the process.
As you know, you can view the processes using the ps, top, htop commands, etc.
A container is a process, or a group of processes, isolated from the rest of the operating system, by means of a namespace.
Namespace
A namespace limits what we can see.
Namespaces are a Linux abstraction layer that isolates system resources. Processes inside a namespace are aware of other processes inside that namespace, but processes inside a namespace cannot interact with processes outside that namespace. Each process can belong to only one namespace.
A namespace is what makes a container feel like another operating system.
In linux, a namespace will be deactivated when its last process has finished running.
Types of namespaces in linux
There are different types of namespaces that control the resources to which a process has access:
- UTS(Unix Time Sharing) namespace: Isolates hostname and domain.
- PID namespace: Isolates process identifiers.
- Mounts namespace: Isolates mount points.
- IPC namespace: Isolates communication resources between processes.
- Network namespace: Isolates network resources.
- User namespace: isolates user and group identifiers.
- cgroups: Isolates /proc/[pid]/cgroup and /proc/[pid]/mountinfo view.
For example, if we use a namespace of type UTS, the changes we make to the hostname from our namespace will not affect the hostname of the main operating system.
Each namespace has its own hostname and domainname
cgroup
In linux cgroups limit what we can use.
The cgroups, or control groups provided by the linux kernel, allow us to organize our processes into groups, and limit the CPU, memory, input, output, number of processes and network packets generated by each of these groups.
Linux takes this configuration reading a series of files inside the path /sys/fs/cgroup/, we can create new cgroups, or modify the existing ones, creating folders and files inside this location.
For example, using cgroups we can tell linux: “limit the number of CPUs this process can use to only one, and that it can only use 20% of the CPU capacity, and also assign it a maximum of 1GB of RAM”.
cgroups allow you to limit system resources
Create a container from scratch with Go
Simplifying the above we need:
- Namespaces: to isolate the processes of our container from the main operating system.
- Chroot: to provide our container with a file system different from that of the main operating system.
- Cgroups: to limit the resources of our system to which our container can access
Now let’s create the container base in the same way as Docker, using the Go programming language.
package main
import (
"fmt"
"os"
"os/exec"
)
// ./container.go run <comando> <argumentos>
func main() {
switch os.Args[1] {
case "run":
run()
default:
panic("This command doesn't exist")
}
}
func run() {
fmt.Printf("Code executing %v with the process id (PID): %d \n", os.Args[2:], os.Getpid())
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Run()
}
I explain the code below.
Inside the main function, os.Args[1] returns the first argument of the program, in case the first argument is run, it will execute the run function. Easy, isn’t it?
./container run <cmd> <args>
exec.Command will take care of executing whatever we pass it after run, as a command to execute, along with its arguments, this can be a echo, a bash, an ls, or whatever you want.
./container run echo "Hello world"
code executing [echo Hello world] with the Process Id (PID): 292753
Hello world
The following lines with the cmd prefix are summarized as follows.
- Redirect the standard input of the command to the standard input of the operating system.
- Redirect the standard output of the command to the standard output of the operating system.
- Redirect the error output of the command to the error output of the operating system.
What does this mean? It means that, in this process, everything we type in our terminal will go directly to the standard command entry that is stored in cmd.
To conclude:
- cmd.Run, execute the command that we created with exec.Command.
Containers and namespaces
So far we have a program that creates a process from the arguments we pass to it.
So far so good, but we have a problem; we are not using namespaces, so our program is not isolated from the rest of the system; we can see all the processes of the main operating system and we are using its file system, instead of our own file system for the container.
To assign a namespace to our program, we will use the SysProcAttr method to create a new namespace of type UTS.
func run() {
fmt.Printf("Code executing %v with the process id (PID): %d \n", os.Args[2:], os.Getpid())
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS,
}
cmd.Run()
}
As you read in the list of namespaces, UTS is the namespace for isolating hostname and domain names.
Namespace UTS
After setting the Cloneflags, any changes we make to the hostname will be made only within the namespace. In other words, changes inside our container will not affect anything outside of it.
# original hostname
hostname
OriginalHostname
# renaming hostname inside the container
./container run /bin/bash
hostname anotherName
# Hostname changed inside the container
hostname
anotherName
# Exiting container
exit
# The original hostname didn't change
hostname
OriginalHostname
Isolating processes with the PID namespace
Since we saw how a namespace works, let’s use it for the main function of a container; isolating processes.
We will make the following changes to the main code.
- In the run function, we will make sure that child is always an argument and therefore the function of the same name is executed.
- exec.Command("/proc/self/exe", args…) will be used to fork our process with our commands.
- CLONE_NEWPID will be used to create a new namespace to isolate the processes in our container.
- The Sethostname method will be in charge of setting the hostname automatically, useful to know that we are inside the container.
The rest of the code does exactly the same.
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
// go run container.go run <cmd> <args>
func main() {
switch os.Args[1] {
case "run":
run()
case "child":
child()
default:
panic("This command doesn't exist")
}
}
func run() {
args := append([]string{"child"}, os.Args[2:]...)
cmd := exec.Command("/proc/self/exe", args...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
cmd.Run()
}
func child() {
fmt.Printf("Code executing %v with the process id (PID): %d \n", os.Args[2:], os.Getpid())
syscall.Sethostname([]byte("container"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Run()
}
Now, if we run the code we will see that the PID is 1, the first process, we have already isolated the processes! However, as we have not changed the file system, we will see the same processes of our main operating system.
Remember that the ps command gets the processes from the /proc directory of the file system you are using. In other words, we need another file system.
Set up a new file system for the container
To use a unique file system for the container, other than the file system of our operating system, we will use the linux command chroot.
Chroot changes the default root location to a directory of your choice.
ls /another_file_system
bin dev home lib ... proc
This new file system can have other libraries installed, configurations and be designed to our liking, it can be a copy of the one you are using or a completely different one.
To isolate the processes of our container we go to:
- Change the file system to the new one with Chroot
- Move to the root directory
- Mount the proc folder in proc
func child() {
fmt.Printf("Executing code %v with Process Id (PID): %d \n", os.Args[2:], os.Getpid())
syscall.Sethostname([]byte("container"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
syscall.Chroot("/another_file_system")
os.Chdir("/")
syscall.Mount("proc", "proc", "proc", 0, "")
cmd.Run()
}
Now our container is going to read the processes from our new file system, instead of the file system of the main operating system.
Limiting container resources with cgroups
Finally, we are going to limit the resources that our container can access using the linux cgroups.
The cgroups are located inside the path /sys/fs/cgroup/ and we can create a new one by creating a new folder inside the cgroup type.
In this case we will limit the memory, so our cgroup will be inside /sys/fs/cgroup/memory/. Remember I told you that cgroups worked by reading a series of directories and files?
func child() {
// ...
setcgroup()
// ...
}
func setcgroup() {
cgPath := filepath.Join("/sys/fs/cgroup/memory", "new_c_group")
os.Mkdir(cgPath, 0755)
ioutil.WriteFile(filepath.Join(cgPath, "memory.limit_in_bytes"), []byte("100000000"), 0700)
ioutil.WriteFile(filepath.Join(cgPath, "tasks"), []byte(strconv.Itoa(os.Getpid())), 0700)
}
We create a directory for our cgroup with the linux permissions 0755
We will generate two files, inside our cgroup, to set the guidelines we want to implement
- memory.limit_in_bytes, to limit the maximum memory to 100 MB (100000000 bytes).
- tasks to tell linux that this cgroup configuration is applicable to the process number (PID) of our container, which we obtain with the Getpid method.
And that’s it, with that we have a process with its own file system, isolated from the main operating system and can access only a part of the resources.
Summary
It is possible to create a container using namespaces, cgroups and chroot, to isolate from the outside, limit resources, and provide its own file system, respectively.
The code in this post is based on a talk by LizRice at ContainerCamp.
Other resources for further study
Posted on May 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.