Testcontainers' reaper problem

By akash_rawal, created: 2024-08-18, last modified: 2024-08-18

How to run external dependencies in containers to support your unit tests and clean them up reliably, without Docker socket access or additional privileges?

How it all started

I was once building a service that checks a MariaDB cluster for common problems and do some basic repair work automatically. So I needed a MariaDB instance to test my project. For that, I used Testcontainers. Testcontainers is a library to run external dependencies in ephemeral containers, so no need to setup environment variables or managing test servers. Go check out their homepage.

I set up a quick test to check out how it works... and we have a problem.

2024/08/18 21:26:46 🐳 Creating container for image testcontainers/ryuk:0.7.0
2024/08/18 21:26:46 ✅ Container created: 912f450a9bc3
2024/08/18 21:26:46 🐳 Starting container: 912f450a9bc3
2024/08/18 21:26:46 ✅ Container started: 912f450a9bc3
2024/08/18 21:26:46 ⏳ Waiting for container id 912f450a9bc3 image: testcontainers/ryuk:0
.7.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/08/18 21:26:46 failed accessing container logs: Error response from daemon: can not 
get logs from container which is dead or marked for removal
--- FAIL: TestMain (0.41s)

Ended up wasting the rest of the day troubleshooting the issue, recognizing that the issue is with testcontainers/ryuk container not being able to do its work, trying some workaround, and then grudgingly disabling the user namespace feature that I had enabled in Docker daemon.

Testcontainers' implementation detail: its reaper

When you start containers within your tests, it is desirable to reliably terminate them.

Many programming languages offer deterministic destructors. (Or similar features like Go's defer statement, or Rust's Drop trait) Some test runners offer functionality to run user-defined cleanup function at the end of each test, or at the end of all tests.

None of these are reliable. If the test crashes or terminates abnormally, these containers will not be cleaned up.

The solution made by Testcontainers is docker.io/testcontainers/ryuk. From the dockerhub description, using it involves only 6 simple steps.

1. Start it:

 $ docker run -v /var/run/docker.sock:/var/run/docker.sock -e RYUK_PORT=8080 -p 8080:8080 docker.io/testcontainers/ryuk

2. Connect via TCP:

 $ nc localhost 8080

3. Send some filters:

 label=testing=true&health=unhealthy
 ACK
 label=something
 ACK

4. Close the connection

5. Send more filters with "one-off" style:

 printf "label=something_else" | nc localhost 8080

6. See containers/networks/volumes deleted after 10s:

Its weakness

It requires access to the docker socket.

Access to the docker socket usually implies root access on the system the docker daemon runs on.

This also means that ryuk cannot function if it is running in a user namespace or a similarly unprivileged situation.

I am not a fan of tests requiring root access. If anything, I want software development work to be as contained as I can make it. If a test goes haywire, I don't want to waste my next few hours of my life restoring my PC from backup. It is not even a ridiculous concern, I once encountered a buggy test case that performed rm -rf $HOME because of hardcoded Windows paths which didn't work on linux.

Podman is a very attractive option to docker. It requires no root access, needs no daemons, and only setup it requires is setting up /etc/subuid and /etc/subgid, which likely your operating system already does.

But testcontainers requires a docker socket, and at the time of writing, its podman support is experimental.

How do we reliably cleanup dependencies running in containers reliably?

Just replace the entrypoint

For each dependency container, we can replace its entrypoint to include a cleanup functionality.

This is what it needs to do:

  1. Fork off the prior entrypoint.
  2. Open a TCP socket and wait for a connection
  3. Wait for the connection to close.
  4. When the connection closes, simply quit.
  5. As the new entrypoint that quit is the PID 1 of the container, the container terminates.

The following was the first iteration of such an entrypoint.

#!/bin/sh 

#Run the base container's entrypoint 
"$@" & 

#Exit after the keep-alive connection is closed 
exec socat TCP-LISTEN:4,accept-timeout=15 EXEC:cat

This is what the test fixture does:

  1. Start the container with replaced entrypoint.
  2. Connect to TCP port 4 and keep the connection open.
  3. At the end of test suite, close the TCP connection. Or the connection is closed by the operating system if the test crashes.

I have a more robust version of the entrypoint at https://gitlab.com/akash_rawal/selfterm/-/tree/master/test_entrypoint. In the gitlab project you can also find prebuilt images for MariaDB and Postgres container images.

Conclusion

Testcontainers is an awesome project and I like the idea, but I think the implementation for cleaning up containers is too complicated and a bit flawed. Its reliance on the docker daemon is an ongoing problem. But with a dash of trickery this can be turned into a false statement. I hope we get more tooling for testing which is compatible with user namespaces or Podman, or similarly require less privileges.

What else, test with safety, folks!