More

xjia · 2024-02-22T20:24:50 1708633490

Before this we were using https://github.com/actions/actions-runner-controller but that's running on K8s instead of VMs. So along with common limitations of running CI jos in K8s/container, it cannot have exactly the same environment as the official GitHub runners. Maintaining a K8s cluster was also very difficult.

xjia · 2024-02-22T20:22:53 1708633373

Persistent disks are implemented as EBS snapshots, so the process is something like:

1. Create EC2 instance for runner #1. Find out there is no existing snapshot, so an empty volume is created and attached.

2. Runner #1 runs exactly 1 job and shuts down. A snapshot is taken for the persistent volume. That's going to be used by later runners.

3. Create EC2 instance for runner #2. Create a new volume based on the last snapshot.

4. Assuming #2 is still running while a new job comes in. Create EC2 instance for runner #3. Create volume based on the same last snapshot.

5. Whenever a runner finishes, its persistent volume gets a snapshot taken. Outdated snapshots are automatically removed.

And yes we manage the AMI that the runner uses. We try out best to follow https://github.com/actions/runner-images and will automate this process very soon so it's always up-to-date.

Edit: formatting

cswilliams · 2024-02-23T03:20:30 1708658430

Thanks for answering! Unless I'm misunderstanding, one issue with this method is since you're creating a new EBS volume from a snapshot every time the runner starts, the volume will be cold and there will be additional latency on the first reads from the volume. Seems like you could run into this penalty fairly often if you were constantly spinning up and down runners due to inactivity. Maybe something worth considering for v3 (spot instances would be nice to have too).

xjia · 2024-02-23T03:22:34 1708658554

Indeed. We are already optimizing this. And spot instances are coming soon as well :-)

xjia · 2024-01-11T19:13:53 1705000433

We are working on an alternative that's 100% compatible with GHA but much cheaper and faster. Check out https://dime.run/

xjia · 2023-12-31T03:23:33 1703993013

From [0]:

> gShoe is an internal joke about a product [Google] shouldn't make, and is used in many internal trainings as a placeholder for any Google product

[0]: https://www.reddit.com/r/google/s/fHCBJipqdP

xjia · 2023-12-24T17:22:26 1703438546

What is the benefit of using a remote cache instead of a local ~/.cache directory? Is it only for sharing build results among team members? How do you make sure the build results are not spoofed?

aseipp · 2023-12-24T18:08:46 1703441326

Not just team members; if you make your cache publicly readable, contributors to e.g. your GitHub/GitLab/Whatever project can also use them and get really fast builds, the first time they try to contribute. So a remote cache is nice to have, if it's seamless.

Nix works this way by default (and much of the community operates caches like this) and it can be a massive, massive time saver.

> How do you make sure the build results are not spoofed?

What do you mean "spoofed?" As in, someone put an evil artifact in the cache? Or overwrote an existing artifact with a new one? Or someone just stole your developers access and started shoving shit in there? There's a whole bunch of small details here that really matter to understand what security/integrity properties you want the cache to uphold.

FWIW, I've been looking into this in Buck2/Bazel land, and my understanding is that most large orgs just use some kind of terminating auth proxy that the underlying connection/flow/build artifacts can be correlated back to. So you know this cache artifact was first inserted by build B, done by user X, who authenticated with their key K, etc etc.

sgammon · 2023-12-24T19:03:06 1703444586

Exactly — just like Git, everything is ultimately identified with a key which can tie back to a stable identity thru OIDC or similar mechanisms. At least that’s how we did it.

yjftsjthsd-h · 2023-12-24T20:20:40 1703449240

Nix only caches at the package level, doesn't it?

sgammon · 2023-12-24T20:25:11 1703449511

Nix is different, yeah, and it won’t wire together a build cache for you. Nix is great for many things of course, it’s just not a replacement for sccache per se

Nix + sccache would probably be pretty great for preserving paths and environment, which is really healthy for build caching in general.

mikepurvis · 2023-12-24T23:22:00 1703460120

Properly handling cargo and bazel builds in nix is very much not a solved problem, and Nix's resistance to allowing ccache into the sandbox for purity reasons definitely exacerbates the problem.

aseipp · 2023-12-25T00:10:55 1703463055

sccache should not be allowed in the sandbox, at least not just bolted on; realistically such cases are probably better handled in the long run by Recursive Nix, so that you can build new derivations inside of existing ones, and the results are cached in the same (outer) store. This means there won't be duplicate caches; /nix/store and where-ever sccache puts results.

Cargo is a good example. For a number of (practical) reasons, cargo -> nix translators are a lot of effort and often have bugs, so for "simplicity" all upstream crates just compile every crate dependency every time. That means if two crates use the same dependency, it gets compiled twice. It's important to understand this is no worse than the way Cargo works already for most people. Cargo does not have a content address storage model, unfortunately. But it's pretty annoying for Nix users and costs a lot.

In theory, we could wrap rustc with a recursive-nix enabled wrapper so that rlib's etc each get a granular derivation and then get put into the host store. So assuming a crate gets built with the same flags, between two "outer" expressions, they'll get to share the work and it will go into the store. A working example of this for C++ code is nix-ccache, but a fully robust implementation is a bit of work.[1]

Recursive Nix is still experimental but there is some use of it (privately and publicly) that I know of for these purposes.

[1] https://github.com/edolstra/nix-ccache

sgammon · 2023-12-25T04:05:44 1703477144

It sounds like you’re really experienced with Nix or perhaps a contributor. We’re new to Nix at Elide (makers of Buildless) but we’d love to collaborate :)

I just tried it for the first time the other day and although I’m not ready to move to it yet, I can already see the brilliance.

mikepurvis · 2023-12-26T04:16:27 1703564187

As a user (and occasional recompiler) of the tensorflow derivation in NixOS, I'd love to see Nix able to somehow do a poetry2nix style transformation on bazel dependencies so they would be properly cached in individual store paths.

aseipp · 2023-12-25T00:04:42 1703462682

It can cache at the level of any set of files, technically speaking (a single file, directory of files, and so on); so you could even use it as a Makefile replacement or something for example. But most people don't do that; the ecosystem is broadly much more coarse-grained and designed around "packages", yes.

I wasn't really referring to the coarseness though; just that a lot of Nix projects provide build caches to speed things up for contributors. It's not just something for internal teams. And it really does help.

sgammon · 2023-12-24T17:26:48 1703438808

Sharing with team members, sharing with CI, and the ability to pull from more than just what's on your machine (i.e. a larger addressable cache than you are willing to keep on disk). Cache objects also compound across projects, so it's nice to ship them up somewhere and have them nearby when you need them.

Re/spoofing, obviously it's all protected with API keys and tokens, and we're working on mechanisms to perform end-to-end encryption. In general, build cache objects are usually addressed by a content-addressable-hash, so that also helps because your build typically knows the content it's looking for and can verify.

That isn't true for all tools, though, so we're working to understand where the gaps are and fix them.

xjia · 2023-12-24T19:34:27 1703446467

IIUC the actual computation (e.g. compiling, linking, ...) happens on client (CI or developer) machines and the results are written to the server-side cache.

By spoofing I meant to say that an authenticated but malicious client (intentionally or not, e.g. a clueless intern) may be able to write malicious contents to the cache. For example, their build toolchain could be contaminated and the resulting build outputs are contaminated. The "action" per se and its hash is still legit, but the hash is only used as the lookup key -- their corresponding value is "spoofed."

The only safe way I can imagine to use such a remote cache is for CI to publish its build results so that they could be reused by developers. The direction from developers to developers or even to CI seems difficult to handle and has less value. But I might be missing some important insights here so my conclusion could be wrong.

But if that's the case, is the most valuable use case to just configure the CI to read from / write to the remote cache, and developers to only read from the remote cache? And given such an assumption, is it much easier to design/implememt a remote cache product?

sgammon · 2023-12-24T20:03:59 1703448239

All great points but in practice, tools like Bazel and sccache are incredibly conservative about hashes matching, to include file path on disk and even env var state.

One goal of these tools is to guarantee that such misconfiguration results in a cache key mismatch, rather than a hit and a bug.

There are tons of challenges designing a remote build cache product, like anything, but that one has turned out to be a reliable truth.

Some other interesting insights:

- transmitting large objects is often not profitable, so we found that setting reasonable caps on what’s shared with the cache can be really effective for keeping transmissions small and hits fast

- deferring uploads is important because you can’t penalize individual devs for contributing to the cache, and not everybody has a fast upload link. making this part smooth is important so that everyone can benefit from every compile.

- build caching is ancient, Make does its own simple form of build caching, but the protocols for it vary in robustness greatly, from WebDAV in ccache to Bazel’s gRPC interface

- most GitHub Actions builds occur in a small physical area, so accelerating build artifacts is an easier problem than, say, full blown CDN serving

The assumptions that definitely help:

- it’s a cache, not a database; things can be missing, it doesn’t need strong consistency

- replication lag is okay because a build cache entry is typically not requested multiple times in a short window of time; the client that created it has it locally

- it’s much better to give a fast miss than a slow hit, since the compiler is quite fast

- it’s much better to give a fast miss than an error. You can NEVER break a build; at worst it should just not be accelerated.

It’s an interesting problem to work on for sure.

Thorrez · 2023-12-24T18:21:36 1703442096

>In general, build cache objects are usually addressed by a content-addressable-hash

How does that work? I would think the simplest case of a build object that needs to be cached is a .o file created from a .c file. The compiler sees the .c file and can determine its hash, but how can the compiler determine the hash of the .o file to know what to look up in the cache? I think the compiler would need to perform the lookup using the hash of the .c file, which isn't a hash of the data in the cache.

aseipp · 2023-12-24T19:08:07 1703444887

In the case of the Remote Execution/Cache API used by Bazel among others[1] at least, it's a bit more detailed. There's an "ActionCache" and an actual content-addressed cache that just stores blobs ("ContentAddressableStorage"). When you run a `gcc -O2 foo.c -o foo.o` command (locally or remotely; doesn't matter), you upload an "Action" into the action cache, which basically said "This command was run. As a result it had this stderr, stdout, error code, and these input files read and output files written." The input and output files are referenced by the hash of their contents, in this case, and they get uploaded into the CAS system.

Most importantly you can look up an action in the ActionCache without actually running it, provided you have the inputs at hand. So now when another person comes by and runs the same build command, they say "Has this Action, with these inputs, been run before?" and the server can say "Yes, and the output is a file identified by hash XYZ" where XYZ is the hash of foo.o, so you can just instantly download it from the CAS.

So there are a few more moving parts to make it all work. But the system really is ultimately content-addressed, for the most part.

[1] https://github.com/bazelbuild/remote-apis/blob/main/build/ba...

bgmeister · 2023-12-25T13:53:43 1703512423

If you're only using remote caching (ie no remote execution) then all cache clients need to trust each other, because a malicious client can upload any result it wants to a given ActionCache key, and there's no way to verify the ActionCache entries are correct unless the actions are reproducible. (And verifying ActionCache entries by rerunning the actions kind of defeats the purpose of using a build cache.)

If you don't want clients to have to trust each other, then you can block ActionCache write access to the clients and add remote execution. In this setup clients upload an action to the CAS, remote executors run the action and then upload the result to the ActionCache, using the hash of the action as the key. This way malicious clients can't spoof cache results for other clients, because other clients won't ever look for the malicious action's key in the ActionCache.

sgammon · 2023-12-24T18:57:15 1703444235

In Bazel’s case and other cases, build cache objects are held in CAS and then referenced from other keys. I believe BuildXL from Microsoft also works this way.

Of course one other advantage to build caches is they are verifiable: the intent is to produce the exact same output as a normal call, and that’s easily checked on the client side.

No question that build caching poses inherent supply chain risks though and that’s part of what we want to solve. I think people are hesitant to trust build caching for good reason until there are safer mechanisms and better cryptographic patterns applied.

sgammon · 2023-12-24T19:12:24 1703445144

Yep, aseipp, and we support the full gRPC interface for remote caching offered by Bazel, including the newer APIs.

Explained better than I could for sure. I find it very interesting how BuildXL and Bazel ended up at similar models for this problem. I don’t yet know the history of which informed which.

(As compared to, say, Gradle, which works based on input hashes instead.)

krupan · 2023-12-24T19:03:20 1703444600

When a .o is stored in the cache it is associated with the hash of the .c file

sgammon · 2023-12-24T17:27:40 1703438860

(Fwiw, group conversation encryption tech like MLS is somewhat applicable, and that's the sort of pattern we're looking at, but it would be cool to know if that's moving to you on the problem of safety w.r.t. builds.)

throwawaaarrgh · 2023-12-24T19:39:37 1703446777

It's for sharing and aggregating. Ccache is useful locally, but really shines when combined with Distcc, a distributed compiler. Every host contributes a cache object that other hosts can use, and every host can use the cache object contributed by other hosts. So you don't even have to built it once yourself to benefit from the cache of everyone else. It therefore speeds up multiple hosts/users builds, distributed builds and the dev experience of individuals.

mgaunard · 2023-12-24T18:41:27 1703443287

I built my own build system that does something similar.

I've set it up at work with two S3 buckets: trusted and untrusted. CI/CD read/write from trusted only. Developers read/write from untrusted, and read-only from trusted.

sgammon · 2023-12-24T19:30:03 1703446203

We decided to back our main cache with in-memory storage for spicier performance. I’m curious how well S3 has worked for you here? Is it fast enough?

Or, maybe the blobs you’re dealing with are on the bigger end? That would also make sense

mgaunard · 2023-12-24T19:58:55 1703447935

Each object file (.o) has a unique hash and is stored as thehash.o.

It's certainly much faster to download the .o than it is to build it. Once it's downloaded it stays on the local filesystem until it's garbage-collected.

rurban · 2023-12-25T08:12:17 1703491937

> It's certainly much faster to download the .o than it is to build it.

That depends on your network latency and throughput, plus the performance of your compiler.

With fast compilers and less bloat (not C++ nor Rust) such a distcc is certainly slower.

sgammon · 2023-12-24T20:23:31 1703449411

Hm, interesting. Our free tier is planned to be this plus R2, so I’m happy to hear S3-style data exchange is working for people. Thanks for sharing

mgaunard · 2023-12-24T22:05:54 1703455554

The whole point of S3 is that it is inexpensive. You don't want to pay premium money for terabytes of data that are usually invalidated everyone someone makes a significant change.

sgammon · 2023-12-25T01:56:49 1703469409

Our blobs end up quite a bit smaller than that, and I can see using S3 for many build cache problems. In our case in-memory worked better with overflow to disk for large or infrequently used objects.

R2 is S3 to me, btw, or at least it’s the same API. Object storage as a model is really what I’m asking about. I’m genuinely glad to hear S3 was enough to be impactful.

mgaunard · 2023-12-27T16:49:08 1703695748

I mostly saw R2 as a faster alternative to S3 with a compatible API.

But it looks like it could be competitive pricing-wise as well, especially if you need to access the same data across regions.

xjia · 2023-12-18T02:33:17 1702866797

If you build it from source it's actually a bunch of regular executables. Running as a container is for the purpose of making the environments hermetic/reproducible so that it's easier to support users.

1over137 · 2023-12-18T02:58:02 1702868282

How do you conclude that? The instructions here:

https://github.com/naivesystems/analyze/wiki/How-to-run-on-m...

say docker is required.

xjia · 2023-12-18T03:02:21 1702868541

That's running the prebuilt container image. The README has instructions on how to build from source.

1over137 · 2023-12-18T03:26:38 1702869998

The README says "To build from source, follow the steps below on Fedora 36 or 37. Other versions may also work but are not officially supported" So it seems on Mac and Windows docker is the only way. :(

xjia · 2023-12-18T04:22:56 1702873376

Yes, the code of the analyzer itself cannot be built on macOS or Windows directly. In those cases you will have to use either podman or docker to use the binaries prebuilt on/for Linux. The whole system depends on a lot of other stuff on Linux so it won't easily run on other OSes.

1over137 · 2023-12-18T14:55:39 1702911339

Ah, that's too bad. For me at least, that makes the barrier to entry too high. Having to learn some new software (docker) just to try some other software is just too much.

xjia · 2023-12-18T02:30:53 1702866653

No. Using GPL software has nothing to do with the code processed by it. Think about the GCC compiler :)

mcqueenjordan · 2023-12-18T05:32:15 1702877535

I think just using it locally in manual or ad-hoc senses is fine, but IIRC it can become tricky or at least a legal grey area if you commit code that automates using GPL developer tooling and/or pulls it into your development toolchain, e.g. via GitHub action or some other CI automation.

Disclaimer: Not a lawyer, this isn't legal advice.

girvo · 2023-12-18T04:27:24 1702873644

Totally fair :) I'm going to have a play with it at work!

xjia · 2023-12-18T05:18:54 1702876734

Awesome! If there is any problem just post it in GitHub issues. We will also be releasing 2023.4 very soon.

xjia · 2023-12-17T20:23:35 1702844615

Clang has its own limitations. And it takes more effort than just writing the checkers. We open sourced our previously proprietary static analyzer (mostly based on Clang but also integrated other useful tools) but the commercial/enterprise edition still has its own value in stability, quality assurance, and technical support. It's more like building a Linux distro (e.g. RHEL) from various FOSS components.

xjia · 2023-12-17T18:00:47 1702836047

If you are interested in this, take a look at NaiveSystems Analyze [0] which is a free and open source static analyzer for checking MISRA compliance etc.

Disclaimer: I'm the founder.

It has been battle tested with real customers in automotive, medical devices, and semiconductors. AFAIK this is the first FOSS tool that achieves commercial use standards (extensive ruleset coverage, low false positives based symbolic execution (which Coverity relies heavily on) and SMT solver, ...)

[0]: https://github.com/naivesystems/analyze

ska · 2023-12-17T18:09:17 1702836557

It doesn’t support this version yet, right?

xjia · 2023-12-17T18:14:40 1702836880

2023 is supported in the enterprise edition but not in the community edition yet. We gradually move features from EE to CE as new features added to EE. So you can expect 2023 support in CE in the future :-)

hsaliak · 2023-12-18T01:53:50 1702864430

When will that be? What’s a good way to contact for enterprise support that covers misra 2023?

xjia · 2023-12-18T02:35:23 1702866923

I'm not sure about the timeline because that depends on a lot of things. In the meantime I guess you could start with AUTOSAR C++14. MISRA C++:2023 is essentially built on top of that.

For enterprise edition simply email to hello[AT]naivesystems.com as noted in the README on GitHub

ska · 2023-12-17T19:11:32 1702840292

Good stuff!

xjia · 2023-12-09T16:11:20 1702138280

I had a similar experience with ARC (actions-runner-controller).

One of the machines in the fleet failed to sync its clock via NTP. Once a job X got scheduled to it, the runner pod failed authentication due to incorrect clock time, and then the whole ARC system started to behave incorrectly: job X was stuck without runners, until another workflow job Y was created, and then X got run but Y became stuck. There were also other wierd behaviors like this so I eventually rebuilt everything based on VMs and stopped using ARC.

Using VMs also allowed me to support the use of the official runner images [0], which is good for compatibility.

I feel more people would benefit from managed "self-hosted" runners, so I started DimeRun [1] to provide cheaper GHA runners for people who don't have the time/willingness to troubleshoot low-level infra issues.

[0]: https://github.com/actions/runner-images [1]: https://dime.run

eez0 · 2023-12-09T16:23:56 1702139036

Exactly what you're are describing is what I explained to my colleagues as "stealing runners" :)

If something fails and you don't have idle runners (hence wasting unnecessary resources), things start to snowball.

ta988 · 2023-12-10T04:58:41 1702184321

The module posted here as a way to avoid that where only runners requested by a job can be used by it (or idle runners if you have those)

neoromantique · 2023-12-09T16:31:50 1702139510

It's only really usable for anything that doesn't involve secrets, I'd be very concerned using anything third party in CI, let alone the runner itself. Supply chain attack senses tingling :).

xjia · 2023-12-09T16:47:03 1702140423

Yes I totally understand the concern. We are actively working on SOC 2 and other compliance stuff to help with this. But honestly I feel the compliance requirements are weaker than what we actually implemented. For example proper secure boot and whole disk encryption (without sacrificing performance) are mandatory in our mindset but these specific things don't get reflected in compliance.

Instead of being a service, I'm also open to sell the software+hardware solution behind it, so you can have it on-prem. Do you think that's something you would consider given the constraints on supply chain security?

neoromantique · 2023-12-09T18:03:09 1702144989

We're too small for on-prem services, so not your target market, just shared my 2c as someone who had been burned by self-hosting github runners too many a time.