I’ve always used a FROM _base_image_ on my docker images, and I started wondering how those base images were created. This lead me to a rabbit hole of how containers work and the isolation gained by using them.

In this post we will see a little bit about what is inside a container and how to create some pretty simple ones without using base images (or even docker build). Then, we start a journey to understand all projects that take part from when you run docker run until the container is up and running.

Creating a Hello World

Docker already has a guide about base images available here, so first thing I did was writing a Hello World C program. Just a main function printing “Hello world”.

#include <stdio.h>

int main() {
	printf("Hello container!\n");
	return 0;
}

I compiled it with the -static flag so all dependencies should be included in the binary, that means no dynamic linking. This made the binary go from 17kB to 756kB on disk, and using objdump on the binary compiled without the -static flag I can see the following:

$ objdump -T bin/hello 

bin/hello:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
0000000000000000  w   D  *UND*	0000000000000000              _ITM_deregisterTMCloneTable
0000000000000000      DF *UND*	0000000000000000  GLIBC_2.2.5 puts
0000000000000000      DF *UND*	0000000000000000  GLIBC_2.2.5 __libc_start_main
0000000000000000  w   D  *UND*	0000000000000000              __gmon_start__
0000000000000000  w   D  *UND*	0000000000000000              _ITM_registerTMCloneTable
0000000000000000  w   DF *UND*	0000000000000000  GLIBC_2.2.5 __cxa_finalize

I have a call to printf on my code, so the implementation of libc available on my system is linked, and since I am using debian, that is glibc. If I compile the same code on Alpine, that uses musl, my static binary will have 83kB, close to 10% it was with glibc. I’ve heard before that musl was supposed to be lighter (but not complete), so I wanted to do a quick test, but this is a subject for another post. The point here is: to build the docker image we need to pack everything, so -static is used in this example.

The docker hello world image avoid linking libc by directly calling the syscall:

syscall(SYS_write, STDOUT_FILENO, message, sizeof(message) - 1);

Building and then running the following dockerfile on my debian VM, where the image was build, succeeds!

# syntax=docker/dockerfile:1
FROM `scratch` 
ADD bin /
CMD ["/hello"]

And running sudo docker images indicates that the image has the same size of the binary we saw above (756Kb). If we use docker save to generate a tar for our image and extract this file, we can see our compiled “hello” binary (and also some metadata).

Another way to create this image is simply importing a tar. Using the command from the docker page about base images, we write:

sudo tar -C bin -c . | docker import - hellotar

But this creates an image without the RUN keyword, and we need to call it like

sudo docker run hellotar /hello

What if we tried to build an image without the -static flag?

standard_init_linux.go:228: exec user process caused: no such file or directory

And more, since the image we just created has only one binary, if we add a command like the following to our Dockerfile:

RUN ls

We receive this other error:

OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown

Let’s look into those functions!

Digging deeper

First of all, let’s understand a little bit better what is going on. We just used docker commands up until now, but while studying the workings of containers I got to understand a little bit more of what happens under the hood when we call docker run.

Docker and Containerd

From the Docker website “containers virtualize the operating system instead of hardware”, so the kernel is shared, and that is why we are able to run a single simple hello world binary on a container. That would be much harder on a VM. Docker provides a layer just above containerd, that is a container runtime. There are some other, kubernetes for example supports this list.

Containerd and runc

Containerd uses runc, that implements the OCI specification, to actually run each container. The files we received as error when trying to run without linking or running something not present are defined in this repository, container_linux.go here and standard_init_linux.go here (attention to the tag used, so the code lines will match the ones in our error message).

The code that triggered the errors above.

On standard_init_linux.go we see the following code:

if err := system.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
    return newSystemErrorWithCause(err, "exec user process")
}

Here, system is a package inside runc, and the Exec function is defined here. Diving to this function, it calls unix.Exec, where unix is from the sys/unix go package, and the exec method simply replaces the current process with the target one. Since we received this error when trying to execute a dynamically linked binary where the needed libraries were not present, it fails with a 127 return, that means that “command not found”.

On container_linux.go the code present is this:

if err := parent.start(); err != nil {
    return newSystemErrorWithCause(err, "starting container process")
}

Here, parent is created from c.newParentProcess, where c is a pointer to a struct linuxContainer defined in the same file, and newParentProcess creates an initProcess. The method start inside initProcess is defined here, and it basically call exec.Cmd.Start from the os/exec go library.

Understanding the role of both functions

Both methods drill down to a similar problem, that is some kind of exec being called where it is not valid. I suspected that the difference was simply due to the commands: in one place we used CMD in the dockerfile, and RUN in the other, but if I replace CMD [/hello] with CMD [ls], I still receive an error on file container_linux.go:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "ls": executable file not found in $PATH: unknown.
ERRO[0000] error waiting for container: context canceled

That makes me wonder which roles does container_linux and standard_init_linux plays.

Following references from standard_init_linux leads us to the definition of the initCommand, that is exposed to the runc CLI with the following documentation:

init        initialize the namespaces and launch the process (do not call it outside of runc)

Calling runc init on bash has no effect because, as the docs says, we need to call inside a runc context, and we can see on init.go that it expects a _LIBCONTAINER_LOGLEVEL environment variable to be defined, for example. This is where things starts to connect, searching where this variable is defined lead me to container_linux.go, at method commandTemplate. This a a method from the same linuxContainer struct, and it is called on newParentProcess.

Calls to start method on container_linux can come from some CLI calls:

create      create a container
exec        execute new process inside the container
restore     restore a container from a previous checkpoint
run         create and run a container

Ok, so standard_init_linux is called after container_linux, but what are the roles they play here? From our experiments, we received an error on standard_init_linux when the binary was not present, and on container_linux for cases where it was not even possible to start our binary.

Let’s study a lit bit more about runc before we explore what in the runc code makes the errors different.

Diving into runc

Following with some changes the section at runc’s README about running a container, we could:

# hello is the name of the container we just built, this creates a container and exports its filesystem
docker export $(docker create hello) | tar -C rootfs -xvf - 
# creates a default runc config.json, that instructs how to launch the container
runc spec
# default process is sh, but there is no sh in our hello container, so set it to our binary
jq '.process.args[0] = "/hello"' config.json | sponge config.json 
# to call runc start (that creates the container but does not run the binary) we need to provide a tty
# recvtty implements a tty needed to use the create command, and should be built from runc with "make rcvtty"
/path/to/recvtty hello.sock & # we send it to background, you could run it in other terminal to see the output of out code!
# creates the conteiner
sudo runc create --console-socket hello.sock rawhello
# Should list "runc init"
ps aux | grep runc 
sudo runc start rawhello # Should output to recvtty - try it on other terminal! :)
sudo runc delete rawhello # Clean up

We should see something like this:

carlos    6070  0.0  0.3 1078640 9052 pts/1    Tl   19:05   0:00 /home/carlos/projects/dfs/runc/contrib/cmd/recvtty/recvtty hello.sock
root      6086  0.0  0.5 1159560 17300 pts/0   Ssl+ 19:05   0:00 runc init
carlos    6143  0.0  0.0   6076   884 pts/1    R+   19:05   0:00 grep runc

and the output of “hello” image printed on the terminal running recvtty.

There is out runc init! What happens is, the isolation of a container is obtained by using a list of kernel features, such as namespaces, cgroups and LSMs, and you can read more about that here. This achieves our isolation, and the way runc works, it creates the isolated container and starts a process runc init inside it, that is replaced by our command when we execute runc start. init act as a placeholder while start is not called.

runc execution flow Image from unit42.paloaltonetworks.com. The post linked above helped me understand better the inner workings of runc.

There are a lot of interesting things going on here, the config.json generated by runc spec has a lot of information! I want to study more about that later :)

The whole model

Taking what we just discussed in the last topic, the whole process from docker run to the actual container running would be something like this:

docker execution flow

But it is a bit more complicated than that, and we will see it by running strace, a tracing tool that lets us see syscalls made by a program. In order to capture the whole execution flow, we will need three terminals:

Terminal #1

sudo strace -ff -o containerd.txt -p $(pgrep containerd)

Terminal #2

sudo strace -s 80 -ff -o dockerd.txt -p $(pgrep dockerd)

Terminal #3

sudo strace -ff -o docker.txt docker run hello

Docker to dockerd

This will generate a list of files, one for each thread spawned by each of those processes, and we can inspect those files to see what the threads were up to.

$ grep connect docker.txt* 

...
docker.txt.13457:connect(7, {sa_family=AF_UNIX, sun_path="/var/run/docker.sock"}, 23) = 0

Here we can see docker communicating with dockerd through a unix domain socket. (Note: each output will be different, the point here is that those threads are opening docker.sock to communicate with dockerd)

Dockerd to containerd

Next, we can check dockerd talking to containerd. This one is trickier since the connection to conteinerd.sock is not open on demand like we saw above on docker.sock. We can in fact check that there is a connection from dockerd to containerd.sock by running:

# This lists files opened by containerd and specifies that we want endpoint information on UNIX sockets
$ sudo lsof +E -aUc containerd 

...
container 15851 root   10u  unix 0x000000004fe13ce6      0t0 8035230 /run/containerd/containerd.sock type=STREAM ->INO=8035882 15861,dockerd,10u
container 15851 root   11u  unix 0x000000004979028a      0t0 8035883 /run/containerd/containerd.sock type=STREAM ->INO=8044007 15861,dockerd,11u
dockerd   15861 root   10u  unix 0x00000000f58b2dcb      0t0 8035882 type=STREAM ->INO=8035230 15851,container,10u
dockerd   15861 root   11u  unix 0x00000000872ba96d      0t0 8044007 type=STREAM ->INO=8035883 15851,container,11u

With the number of the file descriptor (10 and 11, listed as 10u and 11u above) we can look for calls to write on the generated txts, using grep 'write(10' dockerd.txt\*, and we will find thinks like this:

dockerd.txt.15864:write(10, "\0\0\10\6\1\0\0\0\0\2\4\20\20\t\16\7\7\0\0\4\10\0\0\0\0\0\0\0$\345\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7", 47) = 47
dockerd.txt.15864:write(10, "\0\0\10\1\4\0\0\0\211\203\206\315\323\322\321\320\317\0\0G\0\1\0\0\0\211\0\0\0\0B\n@056f171aa8aaf7e32871c732f3bb98f44047ec34bb69744"..., 97) = 97

If we search the string 056f171aa8aaf7e32871c732f3bb98f44047ec34bb69744 present above in the syscalls log generated by strace we can find it appears many times on the path /var/lib/docker/containers/056f171aa8aaf7e32.

If we run sudo docker ps -a --filter "id=056f171aa8aaf7e32871c732f3bb98f44047ec34bb69744" we can find our hello container with this ID there:

CONTAINER ID   IMAGE     COMMAND     CREATED       STATUS                   PORTS     NAMES
056f171aa8aa   hello     "./hello"   2 hours ago   Exited (0) 2 hours ago             loving_mccarthy

Tracing dockerd communications with containerd was a little bit tricky, but with the help of lsof and some poking around it was possible to connect the dots.

The meaning of the text we saw on the write calls is out of scope here, I imagine that on those first calls to write dockerd is instructing containerd to create a container with the given ID, but we should check that on the docs.

Containerd to runc

The last step is: how is containerd talking to runc? This step is a bit easier and shows us what we already saw with runc init. Searching for execve on the txts generated by our trace on containerd gives us what we need: (just some lines of the output are shown)

$ grep execve containerd.txt*

containerd.txt.17327:execve("/usr/bin/runc", ["runc", "--root", "/var/run/docker/runtime-runc/mob"..., "--log", "/run/containerd/io.containerd.ru"..., "--log-format", "json", "create", "--bundle", "/run/containerd/io.containerd.ru"..., "--pid-file", "/run/containerd/io.containerd.ru"..., "2f879386efacc25f502353dbbf6e5d60"...], 0xc000236180 /* 7 vars */) = 0
containerd.txt.17335:execve("/proc/self/exe", ["runc", "init"], 0xc000096140 /* 7 vars */) = 0
containerd.txt.17337:execve("./hello", ["./hello"], 0xc0001c3a40 /* 3 vars */

Here we can see containerd starting runc, then starting runc init and finally our hello is executed.

In the end, there is one more step to out flow above: docker execution flow complete

So why are the errors different?

Now, lets use all this tooling that we got to know to get to the bottom of the errors we got before. We were able to create and run a container using runc, let’s change the config.js to start ls instead of /hello and trace the error from there.

Running non present binary

Running and tracing gives us the following output, as expected:

$ sudo strace -s 120 -ff -o runc.txt runc run hellols
ERRO[0000] container_linux.go:380: starting container process caused: exec: "ls": executable file not found in $PATH 

So container_linux appears again, now lets look into the trace. As we can see below, on one thread runc init is called with success, but on other we see that it was not possible to locate the executable ls.

runc.txt.18743:execve("/proc/self/exe", ["runc", "init"], 0xc0002ae000 /* 8 vars */) = 0
runc.txt.18738:write(9, "{\"type\":\"procRun\"}", 18)  = 18
runc.txt.18738:read(9, "{\"type\":\"procError\"}", 512) = 20
runc.txt.18738:read(9, "{\"Timestamp\":\"2021-08-15T19:13:07.442446078Z\",\"ECode\":10,\"Cause\":\"\",\"Message\":\"exec: \\\"ls\\\": executable file not found i"..., 512) = 512

Ok, what is this file descriptor 9?

Running strace again with -yy to bring more information about file descriptor (you may need to delete the old container hellols we used above, or use another name).

$ sudo strace -yy -s 120 -ff -o runc_fd.txt runc run hellols

With this we get some more details about fd number 9:

runc_fd.txt.18928:write(9<UNIX:[8126329->8126328]>, "{\"type\":\"procRun\"}", 18) = 18
runc_fd.txt.18928:read(9<UNIX:[8126329->8126328]>, "{\"type\":\"procError\"}", 512) = 20
runc_fd.txt.18928:read(9<UNIX:[8126329->8126328]>, "{\"Timestamp\":\"2021-08-15T19:42:17.814404294Z\",\"ECode\":10,\"Cause\":\"\",\"Message\":\"exec: \\\"ls\\\": executable file not found i"..., 512) = 512

Ok, what are those numbers in <UNIX:[8126329->8126328]>? After some digging, looking for socket into the logs I found something:

runc_fd.txt.19077:socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [8<UNIX:[8253085->8253086]>, 9<UNIX:[8253086->8253085]>]) = 0

At this point I believe what happens is that runc init communicates with the original runc process using this socket and tells about the error when trying to locate the binary to start the designated process.

Looking for procError inside the code, we can find it inside method StartInitialization on factory_linux.go, before Init from standard_init_linux.go being called:

defer func() {
    // We have an error during the initialization of the container's init,
    // send it back to the parent process in the form of an initError.
    if werr := utils.WriteJSON(pipe, syncT{procError}); werr != nil {
        fmt.Fprintln(os.Stderr, err)
        return
    }
    if werr := utils.WriteJSON(pipe, newSystemError(err)); werr != nil {
        fmt.Fprintln(os.Stderr, err)
        return
    }
}()

And to answer my last question as to why the defer is always sending an error, we have a nice documented line below:

// If Init succeeds, syscall.Exec will not return, hence none of the defers will be called.
return i.Init()

So, when we are able to initialize successfully out container, we replace the original process, and that defer is not called. But from the trace files, there was no call to execve telling it to start ls, so it must have failed before line 228.

And, there it is:

// Check for the arg before waiting to make sure it exists and it is
// returned as a create time error.
name, err := exec.LookPath(l.config.Args[0])
if err != nil {
    return err
}

Runc outsmarts me by looking for my binary before actually trying to execute it, that is why the last execve is not called, and that is why the line reported is different. That was a long run to understand why both commands gave different errors, and the reason behind is pretty reasonable, the runtime check before trying to start a binary that does not exist, and fails early.

The failure of the dynamic linked binary is an interesting one because execve returns a failure, so the binary hello is not even started. From a quick search it seems that it has to do with the way the dynamic linker and the kernel works, and since our container does not have libc but also does not have the linker, it must not even be capable of loading hello. But this is a subject for another post :)

Conclusion

My initial intention was to build a base image with LFS, but after understanding a little bit more about the way containers are isolated I saw it did not make much sense. Since there is no “boot”, we miss the fun part of LFS that is seeing your kernel come to life. Here we would be simply building a very light “toolkit”.

I’ve always heard about how containers isolation was different and lighter than VMs, but trying to create a base image and poking around really helped me learn more, not only about containers but also about linux. All the tooling used to inspect calls between docker up to runc was very interesting.

I’m curious now about how namespaces, selinux and cgroups works, I’ll investigate and post some more poking around linux here soon!