Textbook here would be using goto-based DFA state machine (ragel can generate on...

PhilWright · on Jan 11, 2019

If that would be faster, why not do that then?

wahern · on Jan 11, 2019

Because they're using Flex & Bison, not Ragel. The latter is specialized for and excels at defining and applying finite state machines. But while the design of Ragel makes it relatively trivial to build ASTs during the parse, you still have to do it manually. Flex & Bison automate much more of the process of transforming an input stream to a tree. Dropping them and moving to Ragel would be a serious refactor.

FWIW, Flex also builds DFAs. It's just that the design of Ragel is so elegant that Ragel DFA machines are more flexible, easier to compose (both in terms of combining abstract machines and in terms of language-level integration), and yet still incomparably faster. While a straight DFA match should in principle be at least as efficient as a perfect hash[1], this is really only realized with something like Ragel or Intel Hyperscan. As awesome as RE2 is, the performance isn't in the same league as those two; ditto for JIT'd PCREs and re2c--neither are comparable to Ragel or Hyperscan.

[1] At least until the input set grows into the thousands or millions of items--at some point the size of the generated DFA grows faster than and overtakes the size of the generated perfect hash function.

goldenkey · on Jan 11, 2019

Wouldn't a DFA be faster than perfect hashing here if only because non-keywords can collide with keyword hash values, so a strcmp is always going to be necessary regardless of the actual keyword differentiation?

I tried looking at the code to see if they were excluding non-keywords at some other stage - because then this would seem redundant?

wahern · on Jan 11, 2019

Yeah, conceptually for smaller sets a DFA should easily meet or beat perfect hashing for the reason you describe.

But because of pipelining, branch prediction, etc, it depends on the details of how the generated parser scans the input stream. A simple hash function is more likely to execute in a tight loop with minimal data dependencies because it's more natural to code it that way; it might be able to consume the stream faster even if performing more work simply because of how it slices the work. I'm not positive, but AFAICT Flex only generates table-driven machines, so at the very minimum the scanner is already handicapped. A tiny table that fits in cache might not be much of a handicap, but I'd be surprised if Flex does an especially good job of that. Any why would it--it's designed to work in tandem with Bison to solve more complex problems, plus it's hemmed in by having to support historical program structures and interfaces, such as liberal use of functions and function pointers for composition.

rurban · on Jan 11, 2019

No, too branchy and bytewise only. This PH is word wise mostly, and the keywords are longer, ~4-5.

goldenkey · on Jan 14, 2019

Still, even the best strcmp is going to involve some byte checking unless the string is exactly divisible into large words.

Wouldn't the efficiency between the two methods mainly depend on the relative odds of non-keywords occurring?