Performance of coroutine-style lexers in Go

jchw · on June 7, 2022

Honestly, while that talk is cool, I don't think you should actually write lexers or parsers that way. The best way is probably the good ol' hand-written way, using something like recursive descent. It's ugly, tedious and repetitive, but if you don't want to accept any compromises, it's difficult to beat it. All of the JS parsers I've looked at do this, from V8 to Spidermonkey, to modern software like ESbuild (also written in Go and blindingly fast.)

The coroutine style of lexer is probably bound to not have such a bad slow down on larger stuff, but it depends. The amount of contention and locking introduced by adding a channel in such a hot path seems like it is probably not going to be worth it, especially if you are likely to do many parsing tasks in parallel and can simply make the boundary for parallelism further down the line.

edit: Somehow while writing this, I managed to forget that I also wrote a JS parser, albeit I never finished it. It uses a typical manually written lexer + parser, too, although it's probably pretty slow from being written stupidly. It lacks the ability to deal with a couple constructs due to ambiguity (most notably, I didn't get past the part where you parse fat arrow lambdas; it is ridiculously hard to tell them apart from other grammar elements.)

https://github.com/jchv/cleansheets / https://cleansheets.io/parser/

alex_smart · on June 7, 2022

You don't have to use a channel for coordination. Here is a lexer implementation (in C!) that very closely follows Rob Pike's talk and uses a ring buffer for coordination and is plenty fast.

https://github.com/lpereira/lwan/blob/master/src/lib/lwan-te...

If you watch the talk carefully, Rob Pike himself mentions this near the end of the talk.

jchw · on June 7, 2022

Even if you do this, the CPU still ultimately has to synchronize if you put the workflow across threads, so you have to get a contention hit somewhere. (Lockfree data structures still hit locks in the silicon, after all.) I could be wrong, but when it comes to performance it seems like your best option is moving the concurrency elsewhere. For design? I dunno. I find the design to be elegant for sure, but I also don’t mind handwriting parsers in general.

alex_smart · on June 7, 2022

The lexer is entirely single-threaded, as in the original Rob Pike talk.

>I find the design to be elegant for sure, but I also don’t mind handwriting parsers in general.

But this is a hand-written parser! Just using function pointers to store the state instead of using an enum and switch statements.

jchw · on June 7, 2022

Yep, fair enough. As a single-threaded design, I have no particular problems with it.

johnthescott · on June 9, 2022

i prefer lexers as nothing more than case statements. i recall no less than brian kernighan discussing this back in the day. keep the lexers simple. that is why we have backus-naur.

donatj · on June 7, 2022

While I don't have a ton of experience when it comes to writing lexers and parsers, I have in fact written a couple small ones based on that specific Rob Pike talk. I honestly found the talk itself enlightening and have recommended it to many people.

I regularly use one of them to parse multi-gigabyte files for work and have been mildly disappointed with the results.

I'd be curious the performance implications of straight up serializing the process with maybe an RWMutex wrapping the token handoff to see what amount of that is the channel itself?

One of the lexers/parser based on the talk

- https://github.com/donatj/sqlread

zasdffaa · on June 7, 2022

This may be a daft suggestion, but have you profiled yours?

throwaway81523 · on June 7, 2022

The advantage of coroutines isn't inside the lexer itself, but in synchronizing the lexer (or let's say parser, which can handle nested structure) with whatever consumes its output. Look at SAX-style XML parsers like expat or libxml2: they recursively parse the XML events (tags etc.) to the client through user-supplied callbacks, an annoying control inversion. An alternative could be to reify the parse stack inside some data structure but recursion is way more natural. Coroutines solve the problem: you call the parser, and it returns tags as it encounters them, from arbitrarily deep inside its own parse tree.

I think goroutines were intended more for i/o concurrency than as a control flow mechanism, so the performance issues aren't surprising. I wonder how the benchmark would look with C++20 coroutines. I was going to say protothreads but I think those are not allowed to be recursive.

As for listening to your professors, well, Knuth vol 1 all the way back in 1968 advocated coroutines as a more powerful generalization of subroutines ;).

The ease of implementing very lightweight and fast multitaskers is one of the traditional attractions of Forth, if that matters.

throwaway2016a · on June 7, 2022

Lexing and parsing doesn't strike me as a task that would benefit significantly from multithreading.

I always use golang.org/x/tools/cmd/goyacc to generate my lexers and parsers. It's a Go port of Lex/Yacc (Flex/Bison) that works pretty well and is very fast.

It basically works by writing your lexes and parser in a Domain Specific Language (DSL) and compiling into go. It's pretty fast.

chewxy · on June 7, 2022

Multithreading and coroutines are different concepts though in practice they may end up being the same thing.

Coroutines were invented surprisingly enough, for lexing and parsing[0].

[0] http://melconway.com/Home/pdf/compiler.pdf

Edit1: added source. I'm surprised Melvin Conway is still alive

throwaway2016a · on June 7, 2022

Interesting information. Thank you for the link, I'll have to take a look.

Though, the "in practice" part is the thing that gets me. I can see how the code might be easier to read but performance wise it seems like a lot of overhead since the coroutines are unlikely to be doing things like waiting for I/O.

jonstewart · on June 7, 2022

Lexing is more-or-less regular expression searching. Most regex engines are a mix of CPU- and memory-bound when searching for complicated REs (as one would have with a lexer; lex just combines patterns into a giant hardcoded DFA).

So, there’s room on the CPU for multithreading to improve performance, but it must be done carefully to minimize both contention and cache evictions.

tgv · on June 7, 2022

I don't see the advantage neither. Since tokenization is usually sequential, it only seems to add overhead. If you can identify chunks in your input at a much lower cost than actually parsing that chunk, it might be of use.

kubb · on June 7, 2022

I'm glad I went to the university and learned about how to do things like lexing and parsing from my professors. I feel for people who have to substitute their education with blog posts and Rob Pike talks.

Those people wil get things done, but often in a roundabout way, reinventing the wheel multiple times along the way. Meanwhile the fortunate ones, like you, will pick the right tool, apply it, get state of the art results on their first try and move on.

ghusbands · on June 7, 2022

I also went to university, where they also over-focused on lex and yacc and similar. However, most people who work on parsers where the user or developer experience matters don't use them, because they are quite restrictive and cumbersome and typically don't give nice error messages.

Academia doesn't necessarily teach you the most useful way to do things.

kubb · on June 7, 2022

Would you painstaikingly handcraft a recursive descent parser because the generator doesn't give you nice error messages?

ghusbands · on June 11, 2022

That's a fairly normal approach, yes [1]. It's not so painstaking. There are parser generators other than lex and yacc that do a better job, but making most generators produce good errors and handle your context-sensitive tokenisation is more effort overall than making an equivalent hand-written parser (be it near-Pratt, PEG, general recursive descent or anything else).

[1] https://notes.eatonphil.com/parser-generators-vs-handwritten...

convolvatron · on June 7, 2022

personally I think the literature and academia contain great treasures that are ignored.

but your argument that the academy is center of knowledge about the pragmatics of software development falls a bit flat. if you've been around enough you know 'grad student code' when you see it.

wahern · on June 7, 2022

The problem is that it's not really benchmarking coroutines; it's benchmarking goroutines. Goroutines support preemptive parallelism, so channels require atomic operations. Atomic operations are performance killers.

IIRC/IIUC, Go in 2011 only supported a single logical processor (P) in its runtime model of (G)oroutines, Logical (P)rocessors, and (M)achine threads. It's possible channels didn't require any atomic operations at all. I'd be surprised if a coroutine style implementation would have been faster, as resuming and yielding coroutines typically involve more operations then entering into and returning from a function[1], but the overhead might have been low-enough to be negligible relative to elegance of the lexer code. (Inverting consumer/producer calling patterns is one of the areas at which coroutines excel, especially stackful coroutines where you can make full use of the stack as an implicit data structure.)

[1] Coroutines in Lua are relatively cheap, but because they have to do exception handling bookkeeping (i.e. setjmp) they're still more expensive than a regular function call. This is specific to Lua and not a necessary part of implementing coroutines. However, I can't imagine a situation where coroutines would be faster than a regular function call, and difficult to imagine them having the same cost unless a language uses a completely discontiguous stack for all call frames. In general coroutines have the same cost as a function call plus the need to save/restore at least some additional stack information.

alex_smart · on June 7, 2022

The funny thing is, the final version of the lexer from Rob Pike's talk did not involve any goroutines at all.

wahern · on June 7, 2022

Ah, from slide #39

  A problem
  Can't run a goroutine to completion during initialization.
  Forbidden by the language specification.
  (Raises awful issues about order of init, best avoided.)

  That means we can't lex & parse a template during init.

  The goroutine is a problem....

https://talks.golang.org/2011/lex.slide#39

Removing the use of goroutines seems to have been simple. Not too surprised as lexers don't usually need very deep call stacks, anyhow. (A single loop and switch can suffice in most situations. Or a single level of calls when one is inclined to break things out into a function dispatch table.) In Lua I've found coroutines more useful for writing iterators over trees, where there's mutual recursion.

zamalek · on June 7, 2022

> Channels in Go are fully synchronized, which implies locking inside the inner loop.

That smells strongly like an optimization opportunity waiting to happen, and here we have a profile that can be used to test it. chanrecv seems to have no fast-path for buffered channels with data and always locks. Has nobody thought of making this at least partially lock-free?

jonstewart · on June 7, 2022

Pike’s talks emphasize that concurrency is not the same as parallelism. Unfortunately for Go, CPUs are designed to exploit parallelism — in so many ways, that stack — so CSP-style concurrency comes with massive overhead when used on fine-grained problems like this.

If you want performance, find parallelism and exploit it.

Beltalowda · on June 7, 2022

Trying this on my own lexer, increasing the channel buffer size from 10 (which is what it is now) to 512 almost doubles the runtime speed(!) I tried some other values, and it seems that about 5 to 15 is the "optimal size".

The input file the author uses is about 1M in size, and most of my inputs are much smaller (which is realistic for my use case); my feeling is that it's due to this, but I didn't verify; I'll have to check later and maybe set a more dynamic size based on the size of the input.

The lesson here is what we should (hopefully!) all know already: benchmark these changes because what works brilliant in situation A may be detrimental in situation B.

YesThatTom2 · on June 7, 2022

Thank you for writing this. When I saw Pike’s talk I had the same concern about performance. I thought it was a brilliant way to make the programmer’s job easier but I questioned the performance. (Not that performance is always important.)

I actually added to my “some day list” an item to benchmark both versions as you did but I never got around to it.

Now that I’ve read your article I can remove that from my list.

My hope is that some motivated smart person uses your benchmarks to optimize the Go channel system. Often having a specific use-case is exact what’s needed.