What is a containerd snapshotter?
Nicola Apicella
Posted on February 14, 2023
I have recently invested some time to understand how containerd works, in particular the containerd snapshotters. During the process, I took various notes that helped me along the way. Shortly after, I realized my notes could make for a boring article for whoever is curious about containers, containerd, containerd snapshotter and would like to know more. The main questions this article tries to answer are: What is a containerd snapshotter? And how does it work?
The article is a bit long, so make sure you are comfortable before diving in.
Disclaimer: I make no promise that these notes are 100% accurate or complete. The opposite, I am sure they are not, but they should be an ok account of how things work.
ToC
- Concepts
- What's a snapshotter?
- Snapshot key and parent strings
- Snapshot ParentIDs
- Mounts
- containerd client.Pull
- ctr run
- What's a remote snapshotter?
- Conclusions
Concepts
The section contains concepts mentioned in the containerd documentation and code that we will be utilizing in the rest of this document. I found that once I better understood them, everything else was much easier to digest.
Serialized Mount
It's a representation of the parameters of a Linux mount command/syscall. For example, if you were to mount a device on your system you will type something like this:
mount -t <type> -o <options if any> <device> <mount-point>
The serialization of the mount is represented in containerd with a go struct:
// Mount is the lingua franca of containerd. A mount represents a
// serialized mount syscall. Components either emit or consume mounts.
type Mount struct {
// Type specifies the host-specific of the mount.
Type string
// Source specifies where to mount from. Depending on the host system, this
// can be a source path or device.
Source string
// Options contains zero or more fstab-style mount options. Typically,
// these are platform specific.
Options []string
}
func All(mounts []Mount, target string) error {
// omitted
Note that the target is missing from the struct because it's up to the caller to decide the target dir of the mount.
Side note: the Target field was recently added (Feb 2023 circa) as an optional field to the struct. The snapshotters currently baked in containerd do not use/do not need it so for now/rest of the document we will just say that the field was added because some future snapshotters might need it. The github issue and PR contain more details on the reasoning
containerd is a daemon that can be interacted with via socket with grpc calls. What we will be focusing on in this document are the Snapshotter Service and the Snapshotter module.
containerd supports some default snapshotters (overlay, native, btrfs, etc.) but also allows the implementation of new snapshotters without having to recompile the containerd binary, by building a snapshotter plugin.
containerd uses a plugin architecture for the Snapshotter. To build one, you need to build a grpc service exposed via a socket that can receive requests from containerd. If you build yours in golang, this is much easier since containerd provides bindings for grpc, so "all needed" is to implement an interface and pass that to the grpc server binder.
To make it more concrete, this is how creating a snapshotter looks like in code (error handling omitted for clarity, the complete example is available in the containerd repo):
package main
import (
"net"
"google.golang.org/grpc"
snapshotsapi "github.com/containerd/containerd/api/services/snapshots/v1"
"github.com/containerd/containerd/contrib/snapshotservice"
)
func main() {
// Create a gRPC server
rpc := grpc.NewServer()
// Configure your custom snapshotter
sn, _ := NewMyCoolSnapshotter()
// Convert the snapshotter to a gRPC service,
service := snapshotservice.FromSnapshotter(sn)
// Register the service with the gRPC server
snapshotsapi.RegisterSnapshotsServer(rpc, service)
// Listen and serve
l, _ := net.Listen("unix", "/var/my-cool-snapshooter-socket")
_ := rpc.Serve(l)
}
Note: when using a custom plugin snapshotter,
ctr snapshot list
does not show the snapshots that were created by the custom snapshotter.
To see the snapshots created by the plugin, use the--snapshotter
flag, e.g.ctr snapshot --snapshotter mysnapshotter list
.
The Snapshots Service (also depicted in the architecture diagram above) is the component that takes care of redirecting requests to the right Snapshotter among other things.
NewMyCoolSnapshotter
needs to return a type that implements the Snapshotter
interface. How this interface works is the focus of the rest of the document:
type Snapshotter interface {
// Stat returns the info for an active or committed snapshot by name or
// key.
//
// Should be used for parent resolution, existence checks and to discern
// the kind of snapshot.
Stat(ctx context.Context, key string) (Info, error)
// Update updates the info for a snapshot.
//
// Only mutable properties of a snapshot may be updated.
Update(ctx context.Context, info Info, fieldpaths ...string) (Info, error)
// Usage returns the resource usage of an active or committed snapshot
// excluding the usage of parent snapshots.
//
// The running time of this call for active snapshots is dependent on
// implementation, but may be proportional to the size of the resource.
// Callers should take this into consideration. Implementations should
// attempt to honer context cancellation and avoid taking locks when making
// the calculation.
Usage(ctx context.Context, key string) (Usage, error)
// Mounts returns the mounts for the active snapshot transaction identified
// by key. Can be called on an read-write or readonly transaction. This is
// available only for active snapshots.
//
// This can be used to recover mounts after calling View or Prepare.
Mounts(ctx context.Context, key string) ([]mount.Mount, error)
// Prepare creates an active snapshot identified by key descending from the
// provided parent. The returned mounts can be used to mount the snapshot
// to capture changes.
//
// If a parent is provided, after performing the mounts, the destination
// will start with the content of the parent. The parent must be a
// committed snapshot. Changes to the mounted destination will be captured
// in relation to the parent. The default parent, "", is an empty
// directory.
//
// The changes may be saved to a committed snapshot by calling Commit. When
// one is done with the transaction, Remove should be called on the key.
//
// Multiple calls to Prepare or View with the same key should fail.
Prepare(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error)
// View behaves identically to Prepare except the result may not be
// committed back to the snapshot snapshotter. View returns a readonly view on
// the parent, with the active snapshot being tracked by the given key.
//
// This method operates identically to Prepare, except that Mounts returned
// may have the readonly flag set. Any modifications to the underlying
// filesystem will be ignored. Implementations may perform this in a more
// efficient manner that differs from what would be attempted with
// `Prepare`.
//
// Commit may not be called on the provided key and will return an error.
// To collect the resources associated with key, Remove must be called with
// key as the argument.
View(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error)
// Commit captures the changes between key and its parent into a snapshot
// identified by name. The name can then be used with the snapshotter's other
// methods to create subsequent snapshots.
//
// A committed snapshot will be created under name with the parent of the
// active snapshot.
//
// After commit, the snapshot identified by key is removed.
Commit(ctx context.Context, name, key string, opts ...Opt) error
// Remove the committed or active snapshot by the provided key.
//
// All resources associated with the key will be removed.
//
// If the snapshot is a parent of another snapshot, its children must be
// removed before proceeding.
Remove(ctx context.Context, key string) error
// Walk will call the provided function for each snapshot in the
// snapshotter which match the provided filters. If no filters are
// given all items will be walked.
// Filters:
// name
// parent
// kind (active,view,committed)
// labels.(label)
Walk(ctx context.Context, fn WalkFunc, filters ...string) error
// Close releases the internal resources.
//
// Close is expected to be called on the end of the lifecycle of the snapshotter,
// but not mandatory.
//
// Close returns nil when it is already closed.
Close() error
}
containerd "smart" client
containerd provides an opinionated client (written in go) which abstract grpc calls to containerd as well as providing utilities, that's why it's called a "smart" client. The ctr
command line, uses the client to interact with containerd and its utilities. Other tools that want to build on top of containerd, e.g. nerdctl leverage the smart client utils to perform operations that otherwise they would need to implement themself.
The main point and the reason why I am mentioning the smart client is that I was expecting to find some of the snapshot-related operations in containerd (the agent), but they were not. In particular, at the beginning of my research, I thought snapshotters (or containerd) would perform the download of the layers, before discovering that the download is implemented as util in the "smart client".
DiffService
DiffService also referred to as DiffApplierService or simply DiffApplier is a service registered in containerd. The service (applies) untar the content of the layer into the mount.
This will be much clear later in the doc.
What's a snapshotter?
The main job of a snapshotter is to create a folder that can be used by containerd to unpack a layer. The folder must be prepared so that the act of unpacking layer n (applying in the containerd terminology) into the folder results in having a folder with the content of the layer n plus all other layers that proceeds the current one (n-1...1). The resulting folder is then a filesystem snapshot for the layers 1...n.
During the process of creating the filesystem snapshot, the snapshotter also stores some metadata in the containerd metadata storage to reflect the status of the snapshot.
Let's give an example. The client calls the snapshotter for the first time to "Prepare" a Snapshot for the first layer. Since this is the first layer, it does not have a parent Snapshot. So the snapshotter will create an empty dir, store the key and parent for this snapshot in the containerd metadata storage and return the dir to the caller (it does not return "just" a dir, but a Mount array. Bear with me a little longer). The client will then call the DiffApplier which untar the layer and merges it with the dir returned by the snapshotter (which was an empty dir). After this step, the dir contains the untarred first layer.
At this point, the client is ready for the second layer. It calls the snapshotter, this time providing a parent. The snapshotter will create an empty dir, then will store the key and parent for this snapshot (exactly as before).
This time since the parent is not empty, the snapshotter will take an additional action: copy all the files from the parent dir (which is the one that contains the uncompressed layer) to the empty dir. Finally, it returns the dir to the caller.
The client will then call the DiffApplier which uncompresses the layer and merge it with the dir returned by the snapshotter. This dir now contains the untarred first layer merged with the untarred second layer.
The process goes on for all the layers, and the result is going to be a snapshot that contains all the layers untarred.
Note how in the process we created N Snapshot, with N being the number of layers. The first snapshot contains one untarred layer, the second snapshot contains the first and the second, etc.
This seems to be a little space inefficient. If you have an image with 4 layers, each one 10MB uncompressed, the snapshotter will be creating
4 Snapshots with (10MB, 20MB, 30MB and 40MB respectively) for a total of 100 MB. In other words, we started with an image with 40MB big uncompressed and booked 100MB worth of space on the device.
What we have just described is the "native" snapshotter. Other snapshotters utilize different strategies that will mitigate or completely eliminate space inefficiency. We are going to look into one of them later, the overlay snapshotter.
Before going a little deeper, there is an easy way to see this space inefficiency issue. Create a docker file with 4 layers.
FROM alpine:latest # alpine image is made of a single layer
RUN dd if=/dev/zero of=file_a bs=1024 count=10240 # 10MB of rubbish
RUN dd if=/dev/zero of=file_b bs=1024 count=10240 # 10MB of rubbish
RUN dd if=/dev/zero of=file_c bs=1024 count=10240 # 10MB of rubbish
After building and pushing the image, we can ask "ctr" to pull it, passing as a parameter the native snapshotter. Then we can see the size of the snapshots.
# snapshot with alpine layer
> ls -lh /tmp/root-dir-example-snapshot/snapshots/1 | head -n 1
total 68K
# snapshot with alpine layer + 10 MB
> ls -lh /tmp/root-dir-example-snapshot/snapshots/16 | head -n 1
total 11M
# snapshot with alpine layer + 10 MB + 10 MB
> ls -lh /tmp/root-dir-example-snapshot/snapshots/17 | head -n 1
total 21M
# snapshot with alpine layer + 10 MB + 10 MB + 10MB
> ls -lh /tmp/root-dir-example-snapshot/snapshots/18 | head -n 1
total 31M
And if we look at the folders, we will see exactly what we expect:
ls -la /tmp/root-dir-example-snapshot/snapshots/1
total 10316
drwxr-xr-x 19 root root 4096 Feb 10 18:48 .
drwx------ 16 nicola domain^users 4096 Feb 10 18:49 ..
drwxr-xr-x 2 root root 4096 Feb 10 18:47 bin
drwxr-xr-x 2 root root 4096 Jan 9 13:46 dev
drwxr-xr-x 17 root root 4096 Feb 10 18:27 etc
... omitted
ls -la /tmp/root-dir-example-snapshot/snapshots/16
total 10316
drwxr-xr-x 19 root root 4096 Feb 10 18:49 .
drwx------ 16 nicola domain^users 4096 Feb 10 18:49 ..
drwxr-xr-x 2 root root 4096 Feb 10 18:49 bin
drwxr-xr-x 2 root root 4096 Jan 9 13:46 dev
drwxr-xr-x 17 root root 4096 Feb 10 18:49 etc
-rw-r--r-- 1 root root 10485760 Feb 10 18:27 file_a
... omitted
ls -la /tmp/root-dir-example-snapshot/snapshots/17
total 20556
drwxr-xr-x 19 root root 4096 Feb 10 18:49 .
drwx------ 16 nicola domain^users 4096 Feb 10 18:49 ..
drwxr-xr-x 2 root root 4096 Feb 10 18:49 bin
drwxr-xr-x 2 root root 4096 Jan 9 13:46 dev
drwxr-xr-x 17 root root 4096 Feb 10 18:27 etc
-rw-r--r-- 1 root root 10485760 Feb 10 18:27 file_a
-rw-r--r-- 1 root root 10485760 Feb 10 18:27 file_b
... omitted
ls -la /tmp/root-dir-example-snapshot/snapshots/18
total 30796
drwxr-xr-x 19 root root 4096 Feb 10 18:52 .
drwx------ 17 nicola domain^users 4096 Feb 10 18:51 ..
drwxr-xr-x 2 root root 4096 Feb 10 18:51 bin
drwxr-xr-x 2 root root 4096 Jan 9 13:46 dev
drwxr-xr-x 17 root root 4096 Feb 10 18:27 etc
-rw-r--r-- 1 root root 10485760 Feb 10 18:27 file_a
-rw-r--r-- 1 root root 10485760 Feb 10 18:27 file_b
-rw-r--r-- 1 root root 10485760 Feb 10 18:27 file_c
... omitted
Snapshot key and parent strings
Looking at the Snapshotter interface, we see that most methods receive a key
and a parent
string. From the snapshotter point of view, these strings are completely opaque and together make up a token to be exchanged for an object. Snapshotter uses the storage.CreateSnapshot(ctx, kind, key, parent, opts...)
which returns a Snapshot record object. The method creates one if it does not exist, otherwise it returns the existing one. One of the fields of the Snapshot object is ID
, which identifies the snapshot.
Snapshot ParentIDs
The Snapshot object returned by the storage
contains an array called ParentIDs
. The first element in the array is always the parent ID
of the current snapshot (being Prepared, being Committed, etc.)
In the previous example, the Snapshot with ID 3
will have the following ParentIDs
list, with 2
being its parent.
ParentIDs
0 = {string} "2"
1 = {string} "1"
Mounts
Earlier I mentioned the snapshotter does not just return a dir. The Snapshotter.Prepare API returns an array of mounts.
Looking at some of the existing snapshotter (like native and overlay), I would assume that most snapshotter only ever need a single mount. The Prepare API interface return type is an array, probably to make the API more generic for snapshotters that might need multiple mounts.
Something obvious, but still worth mentioning. The Type
in the Mount returned by the Snapshotter needs to be a filesystem type supported by the underlying kernel. For example, assume someone wants to write a Snapshotter to expose the container filesystem as a RaiserFS. The snapshotter would need to return "raiserfs" as the mount type. The client will call the Unix mount and that will succeed only if raiserfs is enabled in the kernel.
containerd smart client client.Pull
This section answers the question - what happens when calling the smart client "client.Pull"?
image, err := client.Pull(ctx, "docker.io/library/redis:alpine", containerd.WithPullUnpack)
if err != nil {
return err
}
Following the chain of calls from client.Pull, we see: unpacker.Unpack -> diffRemote -> over grpc -> diffService -> Applier -> which on linux eventually calls the apply function in the apply_linux file. That calls the Apply function in the archive/tar.go file.
Note that containerd only performs the mount for the first layer (see code). That's because the overlay snapshotter returns a bind mount for the first layer so the apply function ends up in this branch. All the mounts returned by the snapshotter.Prepare after that are of type overlay
and are not mounted by containerd.
Also note that tar.go does not just untar the content of the layer, but also converts the OCI whiteouts and opaque whiteouts to Overlay whiteouts. This is a topic for a different article, but in short: suppose the file called my-file
was removed in layer N. This is stored in the layer as .wh.my-file
. During untar for that layer, when the function encounters the OCI whiteout .wh.my-file
, it is going to store it in the snapshot folder as an overlay whiteout (a character file with device 0 0 with the same name as the original file, eg. my-file
)
Simplifying the chain of calls and leaving out details, we can summarize the process of pulling and creating snapshots with the following pseudocode:
for each layer in the image manifest
m, err = snapshotter.Prepare
if prepare returns ErrAlreadyExist continue
// this check allows remote snapshots.
// In reality, the client will also call
// snapshotter.Stat before skipping the creation of the snapshot.
// Will be discussed in the section about remote snapshotter
download layer
load layer in the content store
if m is overlay
untar + convert from OCI spec to overlay in the overlay upperdir
else if bind mount
create temp mount
untar + convert from OCI spec to overlay in the temp mount
snapshotter.Commit
end
Note 1: ctr pull
uses the client differently, which results in a slightly different behavior. In particular, all the layers are first downloaded and loaded in the containerd content store, before asking the snapshotter to create new snapshots.
Note 2: In the example, we use the containerd.WithPullUnpack
so that we not only fetch and download the content into containerd's content store but also unpack it into a snapshotter for use as a root filesystem.
If you do not pass that parameter, the snapshotter is not called.
ctr run
This section answers the question, what happens when running ctr run
from a snapshotter point of view?
During container start, the client is going to call the snapshotter again to create an Active Snapshot that serves as a read-write layer for the container. The active Snapshot is going to be mounted before the container starts and unmounted when the container exit. If we start a long-running container in one terminal, we can hop into another terminal and see the mounts. For example, for an overlay snapshotter we will see:
mount | grep overlay
overlay on /run/containerd/io.containerd.runtime.v2.task/default/example/rootfs type overlay (rw,relatime,lowerdir=/tmp/root-dir-example-snapshot/snapshots/22/fs:/tmp/root-dir-example-snapshot/snapshots/21/fs:/tmp/root-dir-example-snapshot/snapshots/20/fs:/tmp/root-dir-example-snapshot/snapshots/19/fs,upperdir=/tmp/root-dir-example-snapshot/snapshots/24/fs,workdir=/tmp/root-dir-example-snapshot/snapshots/24/work,xino=off)
for a native snapshotter we will see:
mount
....
/dev/mapper/vg0--aed748-root on /run/containerd/io.containerd.runtime.v2.task/default/example/rootfs type ext4 (rw,relatime,errors=remount-ro)
If you add something to the mount folder, the container is also gonna see it.
At this point, we are in a much better position to understand the documentation of the Snapshotter interface.
Besides the Prepare, Commit, Stat and Mount that we have already mentioned above, the interface also contains Remove which is called when deleting a snapshot (egctr snapshot rm <key>
) and during container clean up (to remove the active snapshot).
What's a remote snapshotter?
containerd docs description of remote snapshotters is really clear:
Containerd allows snapshotters to reuse snapshots existing somewhere managed by them.
Remote Snapshotter is a snapshotter that leverages this functionality and reuses snapshots that are stored in a remotely shared place. These remotely shared snapshots are called remote snapshots. Remote snapshotter allows containerd to prepare these remote snapshots without pulling layers from registries, which hopefully shortens the time to take for image pull.
image, err := client.Pull(ctx, ref,
containerd.WithPullUnpack,
containerd.WithPullSnapshotter("my-remote-snapshotter"),
)
client.Pull with the option "WithPullUnpack" informs the client that it needs to call Snapshotter.Prepare before downloading the layers. If the call returns ErrAlreadyExist
, the client calls the Snapshotter.Stat to confirm. If the call returns no error, it skips downloading the layer and skips loading it into the content store.
If the Snapshotter.Prepare instead returns nil, the client is going to download the layer and load it to the content store before calling the DiffApplierService.
This behavior allows us to set up snapshots out of band, that is outside the "normal" workflow. One such example is the soci snapshotter which allows image lazy loading. The snapshotter ships with a "rpull" command which performs this out of band prep. During rpull, the command calls the soci snapshotter which creates fuse mounts for each layer that has an index (remote snapshot). For layers that do not have an index, it downloads them as usual (local snapshot).
The local snapshot is created with overlay mount. Anyway, this is just a detail, the important bit is that folders created for a local snapshot only contain that layer, which is exactly what overlay does.
For example, after rpull we'll see something like the following:
> mount
soci on /var/lib/soci-snapshotter-grpc/snapshotter/snapshots/1/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
soci on /var/lib/soci-snapshotter-grpc/snapshotter/snapshots/4/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
soci on /var/lib/soci-snapshotter-grpc/snapshotter/snapshots/6/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
... omitted
During ctr run, the client asks the snapshotter (Prepare) for one more Active snapshot (which is going to be used as a read-write layer for the container). The snapshotter returns an overlay mount that overlays the fuse mounts and the local snapshots together.
> mount
soci on /var/lib/soci-snapshotter-grpc/snapshotter/snapshots/1/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
soci on /var/lib/soci-snapshotter-grpc/snapshotter/snapshots/4/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
soci on /var/lib/soci-snapshotter-grpc/snapshotter/snapshots/6/fs type fuse.rawBridge (rw,nodev,relatime,user_id=0,group_id=0,allow_other)
overlay on /run/containerd/io.containerd.runtime.v2.task/default/sociExample/rootfs type overlay (rw,relatime,lowerdir=/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/10/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/9/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/8/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/7/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/6/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/5/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/4/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/3/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/2/fs:/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/1/fs,upperdir=/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/11/fs,workdir=/var/lib/soci-snapshotter-grpc/snapshotter/snapshots/11/work,xino=off)
Conclusions
A special thanks to Majd for reviewing and helping me come up with the right questions.
Kudos to Jin, who works on the soci-snapshotter and has been so kind to clarify some of the doubts I had.
Thanks to qiutongs for the comment, which accurately highlighted that containerd performs only one mount operation when performing a pull operation. I have updated the article to reflect that.
If you liked the article you might want to also check out How are docker images built?
For same history on containerd snapshotters, see the moby project blog
Check out Twitter to get new posts in your feed.
Credit for the cover image to Joshua Woroniecki via Unsplash
Posted on February 14, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.