For each test we run, we clone the worker process, then make a bunch of Linux syscalls to set everything up for the container, then exec the test. We use the trick of having the child process share the virtual memory of the parent until the test is exec'ed.
We also use a technique where we build up a "program" of simple operations (each operation more or less maps to a syscall) in the parent before cloning, then evaluate the program in the child. This gives us the same performance benefits of using posix_spawn or vfork, but lets us configure all of the namespace stuff while we're spawning.
It’s very easy to write one. I’ve done it in half an hour in bash. (Most of the half hour was spent cursing at various versions of util-linux that were broken in creative ways.)
If anyone is doing it from scratch, in a real programming language (which, for better or for worse, seems to currently mean C or Go or futzing with the FFI raw syscalls), one shouldn’t use chroot or the mount syscall. The new mount API is much better.
Cgroups are nice and add some fun features, but they’re just icing on the cake and are also not necessary, even for a very functional and nicely secure container, unless the stuff inside the container needs cgroup delegation.
Using iptables to make a container is IMO pathetic, and I’m hoping to find time at some point to work out something better.