More

cfallin · 2025-09-17T22:30:28 1758148228

> It'll be interesting to see what the second non-browser-based WASM runtime to fully support 3.0 will be (I'm guessing wasmtime will be first; ...)

Wasmtime already supports every major feature in the Wasm 3.0 release, I believe. Of the big ones: garbage collection was implemented by my colleague Nick Fitzgerald a few years ago; tail calls by Jamey Sharp and Trevor Elliott last year (with full generality, any signature to any signature, no trampolines required!); and I built our exceptions support which merged last month and is about to go out in Wasmtime 37 in 3 days.

The "3.0" release of the Wasm spec is meant to show progress and provide a shorthand for a level of features, I think, but the individual proposals have been in progress for a long time so all the engine maintainers have known about them, given their feedback, and built their implementations for the most part already.

(Obligatory: I'm a core maintainer of Wasmtime and its compiler Cranelift)

azakai · 2025-09-17T23:14:44 1758150884

> garbage collection was implemented by my colleague Nick Fitzgerald a few years ago

The wasm features page says it is still behind a flag on wasmtime (--wasm=gc). Is that page out of date?

cfallin · 2025-09-17T23:22:08 1758151328

No, it's still behind a flag (and so transitively, exceptions are too, because we built exception objects on top of GC).

Our docs (https://docs.wasmtime.dev/stability-tiers.html) put GC at tier 2 with reason "production quality" and I believe the remaining concerns there are that we want to do a semi-space copying implementation rather than current DRC eventually. Nick could say more. But we're spec-compliant as-is and the question was whether we've implemented these features -- which we have :-)

azakai · 2025-09-17T23:30:34 1758151834

Great, thanks for the info!

cfallin · 2025-06-02T22:16:13 1748902573

They're hitting another design point on the compile time vs. code-quality tradeoff curve, which is interesting. They compile 4.27x faster than Cranelift with default (higher quality) regalloc, but Cranelift produces code that runs 1.64x faster (section 6.2.2).

This isn't too surprising to me, as the person who wrote Cranelift's current regalloc (hi!) -- regalloc is super important to run-time perf, so for Wasmtime's use-case at least, we've judged that it's worth the compile time.

TPDE is pretty cool and it's great to see more exploration in compiler architectures!

cfallin · on May 20, 2024

Indeed; the tl;dr is "yes it works on SpiderMonkey, and was designed to do so". Blog post in a few months, I promise!

cfallin · on May 20, 2024

> That is, if x and y are determined only at runtime for power(x, y) then I don't see what can be optimized.

Yes, the example in Max's post is specifically assuming one wants to generate a specialized version of `power` where `y` is fixed.

To take it back to weval: we can know what the bytecode input to the interpreter is; we provide an intrinsic (part of the "wevaling" request) to indicate that some function argument is a pointer to memory with constant, guaranteed-not-to-change content. That, together with context specialization on PC (another intrinsic), allows us to unroll the interpreter loop and branch-fold it so we get the equivalent of a template method compiler that reconstitutes the CFG embedded in the bytecode.

azakai · on May 20, 2024

Thanks, I think I see now. So `y` is the bytecode, in the analogy. Makes sense.

(For me at least a concrete example would have helped, something like showing the specialized output of running on the bytecode for `power` with that interpreter. But maybe that would be too verbose...)

cfallin · on May 19, 2024

This is indeed a good point and something I want to write about when I eventually do a blog post on weval.

A few counterpoints that I'd offer (and what led me to still take this approach):

- If the target has sub-par debugging infrastructure, it can be easier to debug an interpreter (which is portable) then apply the semantics-preserving PE. In particular when targeting Wasm outside the browser, there is... not really a good debug experience, anywhere, for that. It was way easier to get an interpreter right by developing on native with gdb/rr/whatever, and then separately ensure weval preserves semantics (which I tested with lockstep differential execution).

- Maintenance: if one is going to have an interpreter and a compiler anyway (and one often wants this or needs this e.g. to handle eval()), easier for them both to come from the same source.

- Amortized cost: in the Wasm world we want AOT compilers for many languages eventually; there are interpreter ports with no Wasm backends; developing weval was a one-time cost and we can eventually apply it multiple times.

- If the semantics of the existing interpreter are quite nontrivial, that can push the balance the other way. I designed weval as part of my work on SpiderMonkey; extremely nontrivial interpreter with all sorts of edge cases; replicating that in a from-scratch compiler seemed a far harder path. (It's since been done by someone else and you can find the "wasm32 codegen" patchset in Bugzilla but there are other phasing issues with it from our use-case's PoV; it's not true AOT, it requires codegen at runtime.)

I don't think the tradeoff is always clear and if one is building a language from scratch, and targeting a simple ISA, by all means write a direct compiler! But other interesting use-cases do exist.

JonChesterfield · on May 19, 2024

The divergent semantics risk between the interpreter and the compiler is a really big deal. It's genuinely difficult to get a language implementation to behave exactly as specified, even when the spec is do the same as some other implementation. Treating "compiled code" as specialising the interpreter with respect to the program is a great solution to that, since the bugs in the optimiser/partial-evaluator (they're kind of the same thing) are unlikely to be of the same class as bugs from independent implementations.

Wasm is a really solid target for heroic compiler optimisations. It's relatively precisely specified, user facing semantic diagnostics are in some language front end out of sight, aliasing is limited and syscalls are finite with known semantics. Pretty much because it was designed by compiler people. You've picked a good target for this technique.

achierius · on May 20, 2024

One problem with Wasm is that in practice there's not as much optimization work to do as you might expect, as the higher-level compiler which produced it already di a lot of the work. Of course you still need to lower the stack machine + locals to actual spills/fills/etc., but still a chunk of the work is already done for you.

cfallin · on May 19, 2024

weval author here (thanks Max for the blog post!). Also AMA!

The talk about weval that Max mentions was at NEU and also CMU; the latter was recorded and is here: https://vimeo.com/940568191

I also plan to write up a blog post on weval in depth, plus its application to SpiderMonkey-on-Wasm, in a few months; it's pretty exciting though, currently getting 4-5x speedups on some benchmarks on a decidedly nontrivial interpreter!

sitkack · on May 20, 2024

This is amazing! I really do look forward to that writeup.

Would investigating the mapping from interpreters down the assembly that Wasmtime generates be something worth looking into? I got distracted into the vimeo presentation, the weval blog post covers this? How high can the interpreter/compiler tower go?

Is Weval looking to become the PyPy of Wasm?

cfallin · on March 19, 2024

The approach that Cranelift uses is what we call the "aegraph" (talk I gave about it: slides https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf, video https://vimeo.com/843540328). The basic idea is that eclasses hold sets of operator expressions (think sea-of-node IR), and we keep that alongside the CFG with a "skeleton" of side-effecting ops. We build that, do rewrites, then extract out of it back to a conventional CFG-of-basic-blocks for lowering. The "equivalence" part comes in when doing rewrites: the main difference between an egraph and a conventional sea-of-nodes IR with rewrite rules is that one keeps all representations in the equivalence class around, then chooses the best one later.

We had to solve a few novel problems in working out how to handle control flow, and we're still polishing off some rough edges (search recent issues in the repo for egraphs); but we're mostly happy how it turned out!

(Disclosure: tech lead of CL for a while; the e-graphs optimizer is "my fault")

cfallin · on March 19, 2024

We actually take a fairly unconventional approach to e-graphs: we have a few linear passes and we do all rewrites eagerly, so we use them to provide a general framework for the fixpoint problem into which we plug in all our rewrites, but we don't have the usual "apply as much CPU time as you want to get better results" property of conventional equality saturation.

I gave a talk about this approach, aegraphs (acyclic e-graphs), here: slides (https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf), video (https://vimeo.com/843540328)

(disclosure: Cranelift tech lead 2020-2022 and main author of the e-graphs mid-end, as well as regalloc, isel and its custom DSL, and other bits, along with the excellent team)

scoopr · on March 19, 2024

I found the video to be great! Informative and understandable even when I'm just casually interested in compilers, with not a lot of real world experience writing them.

cfallin · on March 19, 2024

Right, it's about algorithmic tradeoffs throughout. A good example I wrote about is here: https://cfallin.org/blog/2021/01/22/cranelift-isel-2/ where we use a single-pass algorithm to solve a problem that LLVM has a multi-part fixpoint loop to solve.

Most CPU time during compile is in the register allocator and I took a really careful approach to optimization when I rewrote it a few years ago (more details https://cfallin.org/blog/2022/06/09/cranelift-regalloc2/). We generally try to pay close attention to algorithmic efficiency and avoid altogether the fixpoint loops, etc that one often finds elsewhere. (RA2 does backtrack and have a worklist loop, though it's pretty minimal, and also we're planning to add a single-pass mode.)

(disclosure: I was Cranelift tech lead in 2020-2022)

cfallin · on March 19, 2024

There are some benchmarks of Cranelift-based Wasm VMs (Wasmtime) vs. LLVM-based Wasm VMs here: https://00f.net/2023/01/04/webassembly-benchmark-2023/

The (perhaps slightly exaggerated but encouraging to me at least!) money quote there is:

> That’s right. The cranelift code generator has become as fast as LLVM. This is extremely impressive considering the fact that cranelift is a relatively young project, written from scratch by a very small (but obviously very talented) team.

In practice anywhere from 10%-30% slower maybe is reasonable to expect. Compiler microbenchmarks are interesting because they're very "quantized": for any particular benchmark, often either you get the right transforms and achieve the correct optimized inner loop, or you don't. So the game is about getting more and more cases right and we're slowly getting there.

(disclosure: I was tech lead of Cranelift in 2020-2022)