Remote Code Execution as a Service (earthly.dev)
127 points by dijit on March 7, 2023 | hide | past | favorite | 42 comments

Thanks for sharing this. One of the authors here.

We built a service that executes arbitrary user-submitted code. An RCE service. It's the thing you're not supposed to build, but we had to do it.

Running arbitrary code means containers weren't a good fit ( container breakouts happen), so we are spinning up and down ec2 instances. This means we have actual infrastructure as code (i.e. not just piles of terraform but go code running in a service that spins up and down VMs based on API calls).

The service spins up and down EC2 instances based on user requests and executes user-submitted build scripts inside them.

It's not the standard web service we were used to building, so we thought we'd write it up and share it with anyone interested.

One cool thing we learned was how quickly you can Hibernate and wake up x86 EC2 instances. That ended up being a game-changer for us.

Corey and Brandon did the building, I'm mainly just the person who wrote things down, but hopefully, people find this interesting.

Container break outs are rare and they typically require the attacker being able to control either the container creation parameters and/or the actual image being run. If you control those things and apply process isolation best practices (seccomp, cap drops, etc) then you are in pretty good shape.

Source: ran a container based RCE service that ran millions of arbitrary workloads per month. We had sophisticated network and system anomaly detection, high priced pentesters etc and never had a breakout.

> ... never had a breakout.

Would "never detected a breakout" be better wording? :)

> We had sophisticated network and system anomaly detection, high priced pentesters

I assume GP wrote that in order to say that they have a high confidence that they never actually had a breakout.

You are technically correct. But your logic applies to everything. Is the isolation provided by VMs good enough? Is airgapping enough to prevent breakout?

There are many things that factor in when you decide what's reasonable. Some are first principle arguments (containers use the same kernel as the host, the kernel has a large surface area, ...). Others are statistical arguments: there have been past breakouts with this stack, it's thus reasonable to expect more in the future, ...

Interesting! What was the service? IN our case we control the container, which is BuildkitD, but it has to be run privileged, which means lots of solutions are off the table.

Rather not say. Yea building and then running containers where users get to pick the base image is a risk.

We found that privileged is a pretty big hammer and thought we needed it too but we found ways to give us the functionality we needed without all the extra stuff we didn't need the privileged brings in.

Have you used things like gvisor?

I would suggest also looking at https://github.com/bazelbuild/remote-apis. Its essentially a standard API for remote (any binary) execution as a service and there are several reference implementations of it (Buildgrid, BuildBarn, Google's own service etc).

And you can consider using gVisor to minimize container breakouts to a great extent.

I'll checkout that remote-apis link.

gVisor was considered but so far it looks like the next iteration with be using firecracker vms. Our backend is buildkit and it can't run in gvisor containers without some work.

Firecracker looks great but it requires bare metal instances or nested virtualization (which is not supported by EC2 instances IIRC).

How do you run firecracker?

EC2 metal instances are expensive but they let you run FC.

> One cool thing we learned was how quickly you can Hibernate and wake up x86 EC2 instances. That ended up being a game-changer for us.

Could you talk more about it? Are you keeping a cache of hibernated EC2 instances and re-launching them per request? What sort of relaunch latency profile do you see as function of instance memory size?

A specific Ec2 instance is always serving one customer max. And builds are highly cachable, so the ec2 instance has an EBS volume on it with a big cache that Earthly uses to prevent rework.

That instance is just sitting around waiting for gRPC requests that tell it to run another build. If it's idle for 30 minutes, it hibernates and then if another call comes back in a gRPC proxy wakes it back up.

I don't know if the wake up time increases per the size of the cache in memory, I can check with Brandon but its much faster starting up an instance cold, mainly because buildkit is designed for throughput and not a quick startup.

There are more details in the blog.

how do you prevent build 1 from modifying the VM in a way that impacts build 2?

if a build 1 happens to install a specific libc, do you un-install that libc before running build 2?

if you just say that stuff is the responsibility of the user, okay, but then the artifacts produced by this system aren't deterministic, which seems like a problem?

Good question, so the builds are specified in Earthfiles, which are run by our buildkit backend.

Buildkit runs the builds in runC, so basically containers are used to keep things deterministic, but the buildkit backend isn't shared, each is in own EC2 instance.

if your security model assumes user code is able to break out of containers, then it also has to assume user code will be able to modify the VM

if user code can modify the VM, and VMs are stateful, then how do you ensure hermetic builds?

So if you could break out or access cache entries that weren't correct you would have found a way to break reproducibility, but not access anything you shouldn't.

ah, okay, so you're using containers, you're running those containers on VMs that are per-customer rather than shared/multi-tenant, and the post is about optimizing how those VMs are managed -- got it

> but the buildkit backend isn't shared, each is in own EC2 instance

it isn't shared between different customers, but it is shared between different builds for the same customer, right?

While container breakouts do happen, they're pretty rare. I'd be more concerned about any potential injection vectors in the go code, which could lead to a cloud breach if you're not careful ;)

There have been a bunch of Linux kernel privesc vulns that can be converted to container breakouts from standard Linux containers, just look at bounties from Google's kCTF (AFAIK they've had 10 different kernel vulns in 2 years)

It's possible to mitigate/reduce them for sure, with appropriate hardening, but the Linux kernel is still quite a big attack surface.

What about kernels like seL4? I think everyone will abandon monolithic kernels one day because they have too much attack surface.

Is anyone running normal workloads (node/java/php/python/whatever) on seL4 without sticking Linux in the middle?

Oh interesting. What were you imagining as the injection vector?

The earthly backend runs on a modified buildkit so it is running the arbitrary code in a container, but it's also in its own VM. This was simpler then firecracker to get started but turned out to have pretty good performance and alright cost once we started suspending things.

More if you're running `provision --vm-name "$UserSuppliedData"` or similar. I don't know how you've built your wrapping tool, so I can't comment on how likely it would be, but I've seen such breakages IRL (I break things for a living ;) )

Good point, we do have things locked down pretty well in our go code though. The instances can only be provisioned using an API, and that API doesn't allow for arbitrary user-supplied input.

I had the impression that's what fly.io was doing.

Converting Docker images to run in VMs instead of containers.

Fly uses Firecracker like AWS Lambda, not full-fledged VMs like EC2.

That’s not accurate. Fly creates a FC VM from a Dockerfile but injects a pid1 supervisor and behaves more like EC2 than Lambda (no 15 mins execution limit for instance).

Did you consider using firecracker?

Yeah, we totally did and actually Corey has been playing around with a POC for backing with firecracker. It's likely the next revision of this will use firecracker.

So it wasn't so much disqualified as decided it wouldn't be the v1 solution. We wanted to get something out and get people on it and get feedback. So weren't afraid to spend some more computer dollars to do so.

Perhaps a dumb question but container breakouts are a problem for all sorts of services which have been addressed in different ways. Since your goal is not to prevent container breakouts but rather securely run third-party code, why did you choose to use EC2 over something like ECS or Lambda or Google Cloud Run which is already dealing with the security aspect on your behalf? Virtual machines seem less secure and less convenient.

Good question. Our goal is not just to run arbitrary code but to run it fast and cache rework. We are a CI service and speed is important. Brandon may be able to jump in with why not various options but it's hard to beat giving users powerful cloud machines to run their builds on.

I myself did try to run buildkit in a Lambda as I think that would be low cost option. But I found it you couldn't make gRPC calls against a lambda and that is a hard requirement for us.

Some of the reason we went with EC2 over something like ECS is that we would need to run the container in privileged mode for some of our features to work. We also considered options like gVisor, but ultimately the EC2 route was a simple enough implementation that made it easy to manage the user's cache volumes, etc. We're also hoping to use Firecracker VMs in the near future.

Are you concerned about AWS launching a competing service if you get significant traction?

This is an interesting question. Originally we were and had a BSL license. We changed our mind on that and went with a mozzila public license though.

Remote Code Execution as a Service (AKA sandboxing distributed executable code remotely) -- is a brilliant business idea in my opinion, but IF AND ONLY IF such code execution can be guaranteed secure on end user's machines -- and my thoughts are that given all of the (lack of) security related issues that we read about in this day and age -- that it would be difficult, bordering on impossible, to create, much less rigorously prove such a security guarantee...

Still, if one were to attempt this exercise -- one path might be via extremely small emulated instruction set in an extremely tiny and open source (much easier to audit by end users) virtual machine where the emulated instruction set is NOT the same as the one on the host machine (i.e., RISC-V emulated instructions on x86 or ARM host...).

Would such an emulator be slower than native host execution?

Probably, possibly many times slower...

But, what might be lost in speed -- might be made up for in the ability to audit (at least better) for security concerns...

So in summation, is it a good idea?

Possibly -- but it needs that extra rigorous security guarantee which is very difficult to get exactly right in this day and age...

Also -- supposing that this business model was completely viable from a security perspective -- the next step to building such a system might be to build an auction site for CPU cycles (or compute ability/compute time) -- sort of like an "EBay for highly distributed but sandboxed code execution" (clients bid on renting compute resources, end users bid on selling theirs) which would be an interesting business model, IMHO...

Related: SETI@home: https://en.wikipedia.org/wiki/SETI@home

> Also -- supposing that this business model was completely viable from a security perspective -- the next step to building such a system might be to build an auction site for CPU cycles (or compute ability/compute time) -- sort of like an "EBay for highly distributed but sandboxed code execution" (clients bid on renting compute resources, end users bid on selling theirs) which would be an interesting business model, IMHO...

In Commonwealth saga from Peter F. Hamilton, people have access to nearby computing nodes like we have currently to free wifi, for accelerating any tasks they would need (like personal assistants, called e-butlers). That's not 100% secure and is sometimes used for nefarious purposes (and plot devices), but it's just accepted as a nuisance like other things that can be weaponised in our own reality (like cars).

I also allow remote code execution, though interactively [0]. Though I'm not using any VMs or anything like that for isolation.

Note: All the interesting stuff is in /opt/appfs

[0] https://rkeene.dev/js-repl/?arg=bash

Reminds me of the time in 2016 when I added a single line to one of our internal node modules to demonstrate how you can take over our entire Mesos cluster with a single compromised dependency.

Spawned a program on every single node that wrote "yes" into the file /etc/am_i_compromised..

You should try LXD containers. They are secured by default and very fast to spin up down. Can also snapshot

Aren’t servers remote execution services?

You mean a VPS? Yes. An HTTP server, no. At least not by default / design.

