A container implementation that depends neither on Docker *nor on runc* is at th...

nfachan · 2024-07-10T04:09:05 1720584545

If you're interested, the main part of the container implementation is here: https://github.com/maelstrom-software/maelstrom/blob/main/cr...

For each test we run, we clone the worker process, then make a bunch of Linux syscalls to set everything up for the container, then exec the test. We use the trick of having the child process share the virtual memory of the parent until the test is exec'ed.

We also use a technique where we build up a "program" of simple operations (each operation more or less maps to a syscall) in the parent before cloning, then evaluate the program in the child. This gives us the same performance benefits of using posix_spawn or vfork, but lets us configure all of the namespace stuff while we're spawning.

The code that's run in the child can be found here: https://github.com/maelstrom-software/maelstrom/blob/main/cr...

amluto · 2024-07-10T02:51:04 1720579864

It’s very easy to write one. I’ve done it in half an hour in bash. (Most of the half hour was spent cursing at various versions of util-linux that were broken in creative ways.)

Doing it well is a different story.

nine_k · 2024-07-10T03:16:59 1720581419

Well, yes, chroot, cgroups, mount --bind, and some ipfw / iptables stuff is enough to create a makeshift container.

I hope these guys are into doing it well, else runc would be more than adequate for low-level stuff.

amluto · 2024-07-10T04:30:22 1720585822

If anyone is doing it from scratch, in a real programming language (which, for better or for worse, seems to currently mean C or Go or futzing with the FFI raw syscalls), one shouldn’t use chroot or the mount syscall. The new mount API is much better.

Cgroups are nice and add some fun features, but they’re just icing on the cake and are also not necessary, even for a very functional and nicely secure container, unless the stuff inside the container needs cgroup delegation.

Using iptables to make a container is IMO pathetic, and I’m hoping to find time at some point to work out something better.

Joker_vD · 2024-07-10T14:18:21 1720621101

> The new mount API

Could you please tell what exactly this API is? I'd like to try and use it.

amluto · 2024-07-11T02:38:39 1720665519

open_tree() and related APIs. I’m not sure why the manpages never seem to have been applied, but they’re available from old posts:

https://lwn.net/Articles/829496/

And here’s an article about an old version of the syscalls:

https://lwn.net/Articles/759499/

nfachan · 2024-07-11T19:19:25 1720725565

We use our own small wrappers for these syscalls, built on top of Rust's libc crate. All our wrappers live here:

https://github.com/maelstrom-software/maelstrom/blob/main/cr...

For bind mounts, you want to look at open_tree and move_mount. For "regular" mounts, you want to look at fsopen, fsconfig, fsmount, and move_mount.

I found this video very useful: https://www.youtube.com/watch?v=gMWKFPnmJSc