Some guy called Ilan Schnell wrote a python code generator which uses the very same algorithm, and comes with language templates for Python, C and C++ (and is easy to expand to others), and a dot file generator so you can visualize the graph:
Ilan Schnell is not "some guy". He's the original primary author of the Anaconda distribution. One of the main reasons that so many data scientists use Python.
I think I'm missing something conceptually as to why perfect hashing is even needed in this case. Since postgres only has only about 450 or so keywords, shouldn't it suffice to just use a standard string hashing algorithm like:
int Hash(string x)
{
unsigned int h = 31;
for(int i = 0; i < x.length(); i++)
{
h = (h * 76991) ^ (x[i] * 77003);
}
return h;
}
Which I tested and seems to work fine with no collisions for the ~440 tokens postgres has. It's definitely not as theoretically fulfilling as a constructive way to generate perfect hashes, but unless they plan to add hundreds of new keywords in the future shouldn't standard hash functions work (after tweaking the constants to ensure that no collisions occur for their particular set of words)?
Your function returns 32 bit values, which is far from minimal. CHM92 can generate a minimal function i.e. the numbers are all between 0 and 441.
Furthermore, if you have a function for each keyword, you can create a CHM92 function that returns the address of the correct function to be called. (NB: This approach assumes the input does not contain invalid keywords)
Alternatively, you can order your keywords according to some kind of priority and number them from 0 to 439. So you generate a CHM92 function that generates those numbers. When you parse the input, you apply the hash function. You then look up the number in a table of keywords and do a strcmp() check that it's valid. Then you can compare the priority of it with any other keyword without any further lookups.
Binary search is a pretty bad algorithm. Sure, it runs in O(log n) time and it's more or less the best you can do to solve the general problem of finding something in a sorted list, but that does not mean it's a good way to create an immutable list of things that can be searched using comparisons.
In particular, binary search has awful cache locality. Until you get very close to your target, you are essentially hitting a random cache line (and maybe even a random page for large lists) for every comparison. This is bad. A B-tree-like structure that is never modified is very easy to make, and it will be much faster.
This came up with the ORC unwinder in Linux. We got something like a 4x speedup by changing from binary search to something a bit more clever to find unwind entries in the table.
> In particular, binary search has awful cache locality.
That makes me think of an idea which I'm certain can't be original, but I can't think of the jargon for what it's called.
Suppose you have a fixed-length sorted array, and its binary-search can be visualized as a tree. Traverse that tree in level-order and pop it into a helper-array, arranged much like a heap: The midpoint is first and all left and right children are proceed from there at offsets that you can predict.
So the original array [A,B,C,D,E,F,G] has a data-structure for searching of [(D,3), (B,1), (F,5), (A,0), (C,2), (E,4), (G,6)]
While the high-indexed items will get spread out, the most common and sequential activity occurs all together near the lower indexes, minimizing large jumps. Since each entry has the original index stored, you can always switch back to the initial array when a different strategy (e.g. linear scan) becomes preferable.
Yes this is a nice trick. You can use all kinds of implicit search trees for this. See also my comment to a sibling with a link to a paper evaluating the performance of this, including the time it takes to compute that new layout: https://news.ycombinator.com/item?id=18881938
Actually cache locality might not be quite as bad as you think: the first few levels of the "pivot tree" will definitely be cached (although wasting a whole cache line on a single element is certainly not optimal use of the cache). To get better locality one could use a breadth-first layout of the implicit binary search tree, like a binary heap. One compromise could be to duplicate the first few levels of pivots at the beginning of the array, in breadth-first order, while storing the array in the usual sorted order. See "Array Layouts for Comparison-Based Search" for more details.
I came across this last night (http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performa...) and went down the rabbit hole of Rust and its BTreeSet (http://cglab.ca/~abeinges/blah/rust-btree-case/). They came to a very similar conclusion. It all makes me wonder if there's low hanging fruit in a lot of languages/libraries to simply swap out red black trees for B-trees (though I'm guessing the higher constant time cost of B-trees would make it necessary to swap between tree implementations for a truly general case).
> It all makes me wonder if there's low hanging fruit in a lot of languages/libraries to simply swap out red black trees for B-trees
I think the short answer is "yes." Although maybe not quite as low hanging as we'd hope.
The nice thing about RB trees and Splay trees is that they are embeddable at relatively low memory bloat (3 pointers and a color for RB, and 2 pointers for splay) to the object they link into an ordered index.
Embedding list and tree pointers has a variety of advantages for operating system and similar code:
1. The memory is already allocated; you can avoid allocating memory in paths with locks held, which prevents deadlocks / hangs.
2. You can put a single object on multiple lists or indices. I.e., maybe your cached file page is on a eligible-for-reclamation linked list but also indexed by offset into the file (to simplify somewhat).
B-trees are not as trivially embeddable in the same way. As wikipedia points out[0], RB-trees are a kind of low-order B-tree, and it's possible to increase the width to something more like a traditional database B-tree. But this increases the overhead of embedding tree metadata in your object.
I suspect that you're right — dropping in a B-tree of branching factor >2 in place of existing BSTs would improve performance for some indices on some platforms, at the cost of additional memory. But I suspect evaluating where it makes sense would require some profiling or insight into a particular use case. I don't think universal replacement is a slam dunk, at least in low level systems code.
It makes a lot of sense for Rust because their lifetime checker basically prevents the kind of multi-list ownership I've described above. It may make sense for other high level languages for the same reason.
My friend is working on the paper which shows how to do intrusive data structure (I think that's what you meant by multi-list ownership) in Rust. I've seen the draft, and it's surprising and beautiful. Stay tuned...
Yes, intrusive data structure is exactly the correct word for it. I was trying to remember it but couldn't recall it for my comment. Thanks! I'm curious to see how it will work in Rust.
Yes, you probably shouldn't use any kind of binary trees. This has been a sort of open secret for years. I first learned about it from phk(Poul-Henning Kamp)'s 2010 article, "You're doing it wrong". See http://phk.freebsd.dk/B-Heap/queue.html.
What would you recommend as a good book on Virtual Memory ? I learned about the insides of computers during school days more than 20 yers ago and I would like to refresh/update my knowledge.
Yes, I don't think the article is very good, but it was where I first learned whole IO complexity thing, and I think it was widely read and effective in popularizing the idea.
That's exactly what phk wrote about: theoretically perfect algorithms such as binary search are never practically perfect, because they never account for real life things like disk latency and memory pressure. This has lead to developers come up with inefficient softwares.
Very true. Here's a paper from last year that measures the performance of searches on several implicit search tree layouts on CPU and GPU: https://algoparc.ics.hawaii.edu/pubs/18-ipdps.pdf. You can see the results in figures 7 and 9 (pages 9 and 10).
tl;dr: in an array of 500 million elements, if you're doing more than ~5 million queries (1%), it pays off to spend the time to construct a more efficient representation of the data and search on that. On a GPU, you need ~15 million queries (3%).
SQLite already implemented such mechanism[0] that determines whether or not a given identifier is really an SQL keyword. The result is a single static hashtable (C array of chars) with O(1) lookup result. This is the fastest known implementation for keyword lookup.
I'm not clear as to why that'd be faster than what we/PG has done. On a quick skim that allows for conflicts in the hashtable, which our new implementation doesn't? Ours is also a C array (albeit of offsets into one blob of string srather than the strings directly, the higher data density achievable due to that was worth it).
The way I've seen most languages do it is something like this:
switch (c[0]) {
case 'c':
if (strcmp(c, "case") == 0) return CASE;
if (strcmp(c, "continue") == 0) return CONTINUE;
...
case 'f':
if (strcmp(c, "for") == 0) return FOR;
if (strcmp(c, "friend") == 0) return FRIEND;
...
}
I suspect that's good enough if you have a typical keyword list of say 20-100.
I just looked at Clang and it appears to build the list at runtime, using a regular hash table. I think this is because it has to handle C, C++, and Objective C, and they have different sets of keywords.
It seems like they could change it to use something like perfect hashing, and if that's even a 1% win, it would be worth it for the extra build complexity, since C and C++ compilation take up so many CPU cycles in the world.
See lib/Basic/IdentifierTable.cpp.
Although another problem is that the list of keywords depends on whether you're compiling C89, C99, C++98, C++11, etc. But that is a finite list so it seems like it still can be done with code generation ahead of time like PostGres does.
In Oil [1] I use re2c [1] recognize keywords, along with the rest of the complex lexing process, and it works well. It builds a DFA expressed in C code. The slower parts are elsewhere :)
EDIT: In theory, the DFA seems better than perfect hashing because you can bail on the second character if you get "xzoo" and no keyword starts with "xz". I think with perfect hashing yo still have to look at every byte.
The switch statement above also avoids looking at every byte in certain cases.
You don't necessarily have to look at every byte to compute the hash - gperf will generate hash functions that uses only a few characters at seemingly random offsets. You do however have to do a full strcmp() to rule out a hash collision (with words outside of your set) so DFA wins there.
Imho though, the most significant difference between DFA and a PHF would be the DFA would likely generate code that introduces a lot of branches and control flow, which your CPU branch predictor might have a harder time with than a series of arithmetic ops and a single table lookup.
By the way, thanks for oilshell. I don't use it, but your blog is a superb resource on parsing techniques.
Nobody is really using Oil yet because I made it more of a programming language than an interactive shell, but that's changing right now :) There should be something interesting for people to use within a few months.
I would like to see a comparison between DFA and perfect hashing sometime. I wanted to do it myself, but I couldn't justify it because there are other slow parts of the parser. I am surprised that postgres got such a large speedup, to be honest. I guess it has to do with the large number of keywords.
I match thousands of fixed strings "in parallel", and compare different regex engines, and compilation vs. interpretation.
The biggest surprise is that interpreted regexes like grep and ripgrep are faster than compiled DFAs with re2c for 13 strings. When you search for thousands of strings in parallel, re2c takes over.
This could be compared to perfect hashing, I think.
This will be mentioned on my blog, but I'm not sure if I'll have time to do a detailed breakdown.
The motivation for this was somewhat different. I was trying to understand if the Aho-Corasick algorithm to generate a DFA from fixed strings is "important", given that you can already generate a DFA from a regex. In other words, Aho-Corasick seems like more or less a special case of the typical regex -> DFA compilation.
Depending on whether an identifier is a type name or not affects the grammar.
x * y;
Is this a multiplication expression or a variable pointer to a value of type x?
(x) * y
Is this a typecast of *y to the x type, or is it x multiplied by y?
This is what people mean when they say that typedefs introduce keywords in C. An easy way to solve it is to look up type names from the lexer and return a different token type (i.e. not identifier). This puts type names into a different category of lexeme to identifiers.
I think you are confused about the concept of keywords. Yes x*y can mean two things, but neither of the interpretation has anything to do with keywords. According to the C99 standard the set of keywords is exactly and exhaustively defined in Section 6.4.1. Doing something like `typedef unsigned mynum;` doesn't make `mynum` a keyword.
I doubt barrkel is confused, as (s)he writes: ”This is what people mean when they say that typedefs introduce keywords in C.“ (it also is why caf included essentially in their comment)
Just as you cannot have a variable or function named int, you cannot have one with the same name as a typedef, for example. Also, typedef names legally can appear in the same places as the built-in types int, long, etc.
That’s why, when you write a parser for C, many times, you often need to include typedef names in the set of things to check for.
Given that there typically are way more typedef names than keywords, and you often have to check “is this a built-in type or typedef name”, keeping the two checks separate and optimizing the latter doesn’t make much sense.
Normally keywords are reserved words that can't be used as identifiers because they'd make the grammar ambiguous. C typedef names aren't like this; by definition, you use them as an identifier, it's how they come into being.
Fortran is an oddball example, though: whitespace is optional (Fortan77), and keywords can be used as identifiers, making parsing difficult. There are reasons more modern languages make different choices in design.
Why is this relevant to keyword lookup? Anyway you're not supposed to use reserved keywords as either a variable or type name except primitive types (which is obviously a type).
Because it happens in the lexer, and depending on how you design your symbol table lookups, you may want to use the same hash function for all table lookups, instead of calculating different hashes for each table. Symbol table lookups can compare hash value first (store the full hash code per entry) to skip the expensive string comparison too.
Perfect hashing isn't always what you want in any case. For many languages, especially symbolically flavoured ones like C, most strings found by the lexer aren't keywords. A hash table which is mostly empty and is statically verified not to contain keyword collisions is ideal. Keep the hash code calculation separate, and do it only once. Little point in repeatedly calculating it.
There's a lot of overlap with those keywords, so perhaps the best bet is to create a single superset and use an attribute returned by the lookup to see if it's needed by the target language.
We recently started using perfect hashing in V8's parser, also with measurable performance improvement (even over what we had previously, a first character jump table followed by linear search). It's always nice to apply that computer science education to a real world problem.
They changed one textbook algorithm by another one. It turns out to be faster, but it might as well be slower for certain inputs (which might be common). Also, when they increase the number of keywords in the future, they might have to reconsider the implementation.
1) Yes, a 20% improvement is impressive, especially in such a major component of such an important and long-lived piece of software.
2) What do you mean "certain inputs"? There is a fixed set of keywords. The set of inputs is known. It's a great application for perfect hashing. Or do you mean that some keys are going to be slower to search for than others? The best case of binary search is that you search for the key in the exact middle of the table, which requires to compare each character of the search key. (Searching for any other key will require cache misses, possibly, and additional scans of the key's characters.) By contrast, for perfect hashing, you scan the key once to compute the hash value and go directly to the correct slot. So the hash lookup for any key is pretty close to identical to the best case of the binary search.
3) When they add new keywords, they generate a different hash function. That's easy. The implementation does not have to be reconsidered.
I think it's impressive, but maybe not for the same reasons as the OP : that means that keyword lookup accounted for at least 20% of total SQL parsing time. That's a surprise to me !
I'm having trouble coming up with a scenario where perfect hashing Vs a binary search tree is slower.
Also, yes, the very nature of perfect hashing means it will very potential need to be updated with additional keywords. However, that doesn't happen often and this is a small amount of work for what all would be entailed with a new keyword.
Reading the discussion thread is fantastic. The postgres community culture definitely feels a lot different than some of the other open source projects I've seen.
let token_raw = "FROM";
let index = keywords.binary_search(token_raw);
let token = tokens[index];
As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.
With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:
… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:
let token_raw = "FROM";
let token = tokens[hash(token_raw)];
The lexer knows it has a token, but it still has to match the token identifier to the corresponding keyword. Whether this happens at the lexing or parsing stage doesn't matter much, it's still expensive when strcmp is called a bunch of times for every keyword token.
Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:
switch(str[i++]){
case 'A':
switch(str[i++]){
case 'L': ...
case 'N':
switch(str[i++]){
case 'D':
found token!
case 'Y:
found token!
default:
syntax error!
}
...
}
case 'S': ...
case 'F': ...
...
}
Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.
If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.
Perfect hashing trades more computing for less cache misses.
[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).
That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.
I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?
What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.
A perfect hash is one in which their are no collisions for a given data set, which guarantees O(1), or constant, lookup time. This patch introduces a perl script which computes a perfect hash function at build time given the current set of SQL keywords.
The old keyword lookup method uses binary search. Binary search is a search algorithm that must compare at most O(log n) entries before finding a match.
Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.
> Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.
Worthwhile to note that that speedup is for the raw parsing, without parse-analysis (i.e. looking up the tables referenced and such), planning (figuring out how to execute the query) and execution. So end-user impact will be much smaller than 20%. But for some cases, e.g. when many columns are referenced, it'll be measurable.
Using a perfect hash table is definitely cool, but it's hard to imagine a query with so many keywords that this would have a measurable impact on total query execution time.
Due to the desire to have keywords that aren't fully reserved (we have keyword classes that are fully reserved, can appear as type names, can appear as column names, or can appear nearly everywhere) we/PG need to look up whether a specific string is a keyword or an identifier in a number of places in the scanner/parser.
Addendum:
And that indeed means that column references, unless quoted with "", used to go through the binary search and now go through the perfect hash.
Because they're using Flex & Bison, not Ragel. The latter is specialized for and excels at defining and applying finite state machines. But while the design of Ragel makes it relatively trivial to build ASTs during the parse, you still have to do it manually. Flex & Bison automate much more of the process of transforming an input stream to a tree. Dropping them and moving to Ragel would be a serious refactor.
FWIW, Flex also builds DFAs. It's just that the design of Ragel is so elegant that Ragel DFA machines are more flexible, easier to compose (both in terms of combining abstract machines and in terms of language-level integration), and yet still incomparably faster. While a straight DFA match should in principle be at least as efficient as a perfect hash[1], this is really only realized with something like Ragel or Intel Hyperscan. As awesome as RE2 is, the performance isn't in the same league as those two; ditto for JIT'd PCREs and re2c--neither are comparable to Ragel or Hyperscan.
[1] At least until the input set grows into the thousands or millions of items--at some point the size of the generated DFA grows faster than and overtakes the size of the generated perfect hash function.
Wouldn't a DFA be faster than perfect hashing here if only because non-keywords can collide with keyword hash values, so a strcmp is always going to be necessary regardless of the actual keyword differentiation?
I tried looking at the code to see if they were excluding non-keywords at some other stage - because then this would seem redundant?
Yeah, conceptually for smaller sets a DFA should easily meet or beat perfect hashing for the reason you describe.
But because of pipelining, branch prediction, etc, it depends on the details of how the generated parser scans the input stream. A simple hash function is more likely to execute in a tight loop with minimal data dependencies because it's more natural to code it that way; it might be able to consume the stream faster even if performing more work simply because of how it slices the work. I'm not positive, but AFAICT Flex only generates table-driven machines, so at the very minimum the scanner is already handicapped. A tiny table that fits in cache might not be much of a handicap, but I'd be surprised if Flex does an especially good job of that. Any why would it--it's designed to work in tandem with Bison to solve more complex problems, plus it's hemmed in by having to support historical program structures and interfaces, such as liberal use of functions and function pointers for composition.
It seems that this is a special case of hashing, indicated by the word 'perfect', where it somehow (I did not read the article and I don't know how this is done) generates a guaranteed collision-free hash table. While you can do the same with cryptographic hash functions, I'm sure that those are much slower than what is implemented as non-cryptographically secure hash function in databases.
That remind me, some years ago, I’d created a pull request for redis to use perfect hashing with gperf to parse commands. IIRC, the author turned it down because he was wary of adding the build time dependency on gperf.
That's awkward, because the gperf dependency is only needed every 2 years or so, when a new keyword is added. The genererated C code needs to be added to git, and only needs a regen if changed. I do that all the time via gperf or via some perl-generated perfect hash tables, eg for static keywords, Config tables, unicode tables, ...
The memory win is always dramatic, like 20x better typically.
I also use some standard gperf-fixup scripts for easier inclusion into my code.
Yes, you can, but this approach is better. Pearson would beat CHM on very small CPU's (8-16bit) and it is not guaranteed to find a good permutation. With CHM it's much easier.
Most important thing is to store the keywords for the false positive verification in an continuous const char const* array.
Since there is a lot of discussion about alternative solutions to the keyword lookup problem I was wondering, why Tries were not mentioned yet.
Since the current implementation has to do a string compare after the hash, wouldn't a Trie be at least as fast? Am I missing something? Are there other disadvantages?
For the trie, you still would have to do multiple character comparisons (on level 1, on average about 13, I guess, given that I read there are 400 keywords, and 26 characters in the alphabet).
Presumably, hashing the string is faster. I think that
partly is because it introduces fewer cache misses than walking a trie. Both a this and a trie have to walk the input string, but code using a trie jumps around the trie, while this code only compare to the keyword, stored in contiguous bytes.
Another reason might be that trie-based code is worse for branch prediction logic.
Regarding your first point: One possible implementation of tries uses an array with length "size of alphabet" to store the children of a node. So at each node it's just an offset into an array, not multiple character comparisons.
Of course your other points are valid. Thank you for your answer. So I gather the differences are rather subtle. I was wondering if I was missing something bigger.
Even assuming just a-z, that array will be mostly empty at lower levels, and most nodes are at lower levels. That’s not good for avoiding cache misses.
There are tricks to counteract that (you don’t store pointers to lower levels, for example, but offsets, because at typical sizes of a trie you can store them in way fewer bits), and I haven’t timed any implementation for decades, but it would surprise me if such an approach would improve performance on lower levels.
You could use it at top level (basically replacing the trie with separate tries for each starting letter)
Also, the ‘search for a character in an array’ code can be sped up by using vector instructions nowadays (can’t find a link for the exact code, but http://0x80.pl/articles/simd-strfind.html may give ideas). Whether want to do that because of power usage concerns is a completely different can of worms.
Someone else can probably comment better but hashing seems more efficient memory wise than holding the trie in memory, and hashing might probably also better cache-wise compared to jumping between the nodes of the trie.
Building and peeling a graph should be identical but the assign step is different. I use non-minimal not order preserving BDZ scheme while Joerg's code is based on the original CHM scheme.
Given keys A, B, C and D, CHM always maps them to values 0, 1, 2 and 3. BDZ, on the other hand, can generate values in a range [0, 2.1 * 4) and it doesn't preserve an order. For instance, one possible mapping in the BDZ scheme is 7, 2, 5, 1.
I think this is one of the classes of problems that Jonathan Blow identifies. Instead of having a Turing Complete macro system you have regular functions that evaluate at compile time. You’d generate your perfect hash in the same language you’re working in.
D does this. Although it's not necessarily one or the other. C++ has the c preprocessor, templates, and constexpr. Rust has macros, but is looking at adding const function evaluation too.
It can be made good. I used it for parsing /proc/PID/status keywords at a very high rate to support "top" and "ps". You'll have to change the generated code, which likely means that you use gperf during development instead of running it during every build.
The first step was to put only the needed keywords in a file, without the colons. I ran gperf with options to choose 3 specific character offsets and to use the length in the calculation; this was best after I tried a few choices. I then cleaned up the code and optimized it. I rounded the table size up to a power of 2, padding it correctly, and changed the length check to a mask. I switched the code over to using gcc's computed goto extension. I provided the keyword lengths, letting me use memcmp instead of strcmp. I inlined everything.
It's a reasonable choice for the use-case at hand. I used a similar approach to generate C code using python with jinja2. Ironically, I generated a binary hash based lookup, because the hash values ought to stay constant even when new keys are added in future versions.
Devel::Tokenizer::C generates horrible code, even worse than gperf. Much better is my Perfect::Hash module, this postgresql module or the new perl5 mph script.
Haven't checked re2c but it looks like a naive lexer implementation generating a static DFA bytewise, not optimized for speed nor memory.
http://ilan.schnell-web.net/prog/perfect-hash/
He also has a superb illustrated explanation of the algorithm, a unique property of which is you get to choose the hash values:
http://ilan.schnell-web.net/prog/perfect-hash/algo.html
I've been using it for years.