A high-throughput parser for the Zig programming language

ww520 · 2025-04-16T21:06:21 1744837581

This is very cool. Extremely fast lexical tokenizer is the basis for a fast compiler. Zig has good integration and support for SIMD operations that's perfect for this kind of things. It's definitely doable. I did a proof of concept on using SIMD to operate on 32-byte chunk to parse identifiers a while back.

https://github.com/williamw520/misc_zig/blob/main/identifier...

norir · 2025-04-16T21:35:13 1744839313

When I run a profiler on a compiler I wrote (which parses at somewhere between 500K-1MM lines per second without a separate lexer), parsing barely shows up. I'd be very surprised if the zig compiler is spending more than 5% of the time tokenizing.

I assume there is some other use case that is motivating this work.

tarix29 · 2025-04-16T23:31:55 1744846315

I imagine it would be quite useful for building a responsive language server, where parsing is a more significant portion of the work

seanmcdirmid · 2025-04-17T03:13:32 1744859612

No, the problem for a language server is incremental performance, not batch performance. Although there are a lot of bad implementations out there that just reparse the entire buffer on each edit (without the error recovery benefits an incremental parser would give you).

adev_ · 2025-04-17T06:17:54 1744870674

> No, the problem for a language server is incremental performance, not batch performance

"When something is fast enough, people start to use it differently" - Linus Torvalds.

Make your parser able to parse the current file at 30FPS and you do not need incremental parsing anymore nor error recovery. That is probably part of the idea here.

dzaima · 2025-04-17T06:29:36 1744871376

Here that can go both ways - SIMD parsing can allow handling arbitrary changes in reasonable time for files below like maybe 100MB (i.e. everything non-insane), whereas incremental parsing can allow handling small changes in truly-arbitrary-size files in microseconds. A trade-off between better average-case and worst-case time. (of course the ideal thing would be both, but that's even more non-trivial)

awson · 2025-04-17T12:41:02 1744893662

Absolutely.

Quite a long time ago I was working on a some business application's reporting facility.

It used to take about an hour, and my development reduced this time to a 1 or 2 seconds ballpark.

This was HUGE. And changed the way users create these reports forever.

seanmcdirmid · 2025-04-17T12:27:50 1744892870

It’s not good enough. Incremental parsers can save trees across edits, and you can hang type information off of those trees, so you just aren’t saving parsing time, you are saving type checking time as well. Even if you have a super fast batch parser, you are screwing yourself in other areas that are actually much more expensive.

adev_ · 2025-04-17T13:17:00 1744895820

Agreed. But all things considered:

The runtime cost of type checking is highly dependent on the type system / meta-programming complexity of your language.

For simple languages (Golang?) with a pretty well designed module system: it should be doable to reach ~500KLOC/sec (probably even 1MLOC/s in some case) so more than enough for an interactive usage.

And for complex languages with meta-programming capabilities: they are indeed slow to type check. But are also a giant pain in the butt to cache without side effects for incremental parsing. It is 2025 and clangd / intellisense still fail to do that reliably for C++ codebases that rely heavily on template usage.

So it does not seem a so-crazy approach to me: It is trading a complexity problem for a performance one.

dreamoffire · 2025-04-16T17:05:06 1744823106

The talks that Niles gave at the Utah Zig meetups (linked in the repo) were great, just wished the AV setup was a little smoother. There seemed like there some really neat visualizations that Niles prepared that flopped. Either way, I recommend it. Inspired me to read a lot more machine code these days.

Validark · 2025-04-21T07:16:22 1745219782

I am glad to hear you have been inspired to read more machine code! Was the audio good for the talk on the ideas behind this specific tokenizer [1]? I'm going to tryhard on the next talk in July on this subject and hopefully get as many as possible to thoroughly understand how it all works. Link:

[1] https://www.youtube.com/watch?v=NM1FNB5nagk

neerajsi · 2025-04-16T16:43:47 1744821827

Very interesting project!

I wonder if there's a way to make this set of techniques less brittle and more applicable to any language. I guess you're looking at a new backend or some enhancements to one of the parser generator tools.

adev_ · 2025-04-16T16:55:44 1744822544

I have applied a subset of these techniques in a tokenizer in C++ to parse a language syntactically similar to Swift: no inline assembly, no intrinsics, no SWAR but reduce branching, cache optimization and SIMD parsing + explicit vectorization.

I get:

- ~4 MLOC/sec/core on a laptop

- ~ 8-9MLOC/sec/core on a modern AMD sever grade CPU with AVX512.

So yes, it is definitively possible.

matu3ba · 2025-04-16T17:47:08 1744825628

Would be very cool, if once finished, the techniques are applied to user-schedulable languages https://www.hytradboi.com/2025/7d2e91c8-aced-415d-b993-f6f85....

I guess they are too tailored to the actual memory layout with respective memory access delay of the architecture, but I would like to be shown that I am wrong and it is feasible.

Validark · 2025-04-22T03:24:28 1745292268

Classic Matu3ba, at it again, coaxing me down new rabbit holes. Hahaha. I hope you know I became a SWAR wizard for you.

That video is very interesting to me. I don't yet know enough about all those languages to speak intelligently on how they could be used to simplify my code, but I do agree it would be amazing if we could generalize my techniques so that every project could easily achieve what I have through a lot of hard work.

I'm not sure if the specific thing demo'ed in that video is helpful for my use-case. That workload, as well as other examples on the site, seem(s) to get a ton of mileage because their data is two-dimensional. Therefore there are a lot of different choices that could be made about the order in which things should be done. Being able to switch those orders quickly to test out how well vectorization works for different choices is therefore a major benefit to those examples.

The tokenizer, on the other hand, is one-dimensional. It's obvious what's the best way to traverse a source file: Just go forward.

Automatic scaling to bigger or smaller vectors could be easier though, but how I want that to be achieved atm is for LLVM to have:

1. u512 operations happening in AVX-512 vectors

2. ARM backend support for automatically preferring interleaved vectors over regularly-ordered vectors. Maybe even better would be a type at the language level.

3. A SWAR emitter for LLVM vectors for machines without vectors

4. Movemask support in the PowerPC, WASM, and MIPS backends. I don't want to write intrinsics for every little bitcast.

I'm also just skeptical in general that offloading to the compiler is a good idea with the current state of the technology. There are just too many routines it doesn't know about, it misses canonicalization too much, it doesn't seem to be architected correctly for having several register files, it doesn't seem to be architected correctly for supporting bit manipulation efficiently, and the fact that SWAR support hasn't arrived yet at this point in history doesn't inspire confidence either.

In cases like the video shows where you're just coercing the compiler to do basic auto-vectorization, I'm sure that works perfectly.

But in this work, my hunch is that the optimal solution could not be trivially written in a nice functional style. This code does not use the "easy solution" of being able to reach forwards and backwards for data. I.e. The easy solution expressed as some pattern matching over `{ file[p-1], file[p], file[p+1], ... }` (I believe) imposes a constraint that we have to have read in the next vector and produce the bitstrings for it in order to process the current vector. For simple languages this is fine but we have to do a lot of work for something like Zig and you just don't have enough registers to go around. That's why my solution uses pseudo-starts (putting a fake start position at the beginning of a chunk due to carried-over values) and a more manual carry-over system. The logic here is NOT position-independent, solely for performance reasons. Do I think the necessary transformation between position-agnostic logic and the logic I'm talking about could be converted automatically? Maybe. But my suspicion is that a parser generator makes more sense as the sort of tool that should solve this.

asdfman123 · 2025-04-16T23:48:16 1744847296

This really moves Zig