Use perfect hashing, instead of binary search, for keyword lookup

nly · on Jan 11, 2019

Some guy called Ilan Schnell wrote a python code generator which uses the very same algorithm, and comes with language templates for Python, C and C++ (and is easy to expand to others), and a dot file generator so you can visualize the graph:

http://ilan.schnell-web.net/prog/perfect-hash/

He also has a superb illustrated explanation of the algorithm, a unique property of which is you get to choose the hash values:

http://ilan.schnell-web.net/prog/perfect-hash/algo.html

I've been using it for years.

onalark · on Jan 11, 2019

Ilan Schnell is not "some guy". He's the original primary author of the Anaconda distribution. One of the main reasons that so many data scientists use Python.

fao_ · on Jan 11, 2019

Well, he's just some guy, you know.

eesmith · on Jan 11, 2019

Gag Halfrunt! How's the brain-care biz doing these days? (https://en.wikipedia.org/wiki/Zaphod_Beeblebrox )

Spare_account · on Jan 11, 2019

FWIW my understanding of the GP's phrasing was that Ilan Schnell is a significant person, it was implied.

krackers · on Jan 11, 2019

I think I'm missing something conceptually as to why perfect hashing is even needed in this case. Since postgres only has only about 450 or so keywords, shouldn't it suffice to just use a standard string hashing algorithm like:

  int Hash(string x)
  {
	unsigned int h = 31;
	for(int i = 0; i < x.length(); i++)
	{
		h = (h * 76991) ^ (x[i] * 77003);
	}
	return h;
  }

Which I tested and seems to work fine with no collisions for the ~440 tokens postgres has. It's definitely not as theoretically fulfilling as a constructive way to generate perfect hashes, but unless they plan to add hundreds of new keywords in the future shouldn't standard hash functions work (after tweaking the constants to ensure that no collisions occur for their particular set of words)?

nroets · on Jan 11, 2019

Your function returns 32 bit values, which is far from minimal. CHM92 can generate a minimal function i.e. the numbers are all between 0 and 441.

Furthermore, if you have a function for each keyword, you can create a CHM92 function that returns the address of the correct function to be called. (NB: This approach assumes the input does not contain invalid keywords)

Alternatively, you can order your keywords according to some kind of priority and number them from 0 to 439. So you generate a CHM92 function that generates those numbers. When you parse the input, you apply the hash function. You then look up the number in a table of keywords and do a strcmp() check that it's valid. Then you can compare the priority of it with any other keyword without any further lookups.

asdfaoeu · on Jan 11, 2019

Because theirs both only returns a number between 0 and max_keywords - 1. Yours would require a 4gb lookup table.

amluto · on Jan 11, 2019

Binary search is a pretty bad algorithm. Sure, it runs in O(log n) time and it's more or less the best you can do to solve the general problem of finding something in a sorted list, but that does not mean it's a good way to create an immutable list of things that can be searched using comparisons.

In particular, binary search has awful cache locality. Until you get very close to your target, you are essentially hitting a random cache line (and maybe even a random page for large lists) for every comparison. This is bad. A B-tree-like structure that is never modified is very easy to make, and it will be much faster.

This came up with the ORC unwinder in Linux. We got something like a 4x speedup by changing from binary search to something a bit more clever to find unwind entries in the table.

Terr_ · on Jan 11, 2019

> In particular, binary search has awful cache locality.

That makes me think of an idea which I'm certain can't be original, but I can't think of the jargon for what it's called.

Suppose you have a fixed-length sorted array, and its binary-search can be visualized as a tree. Traverse that tree in level-order and pop it into a helper-array, arranged much like a heap: The midpoint is first and all left and right children are proceed from there at offsets that you can predict.

So the original array [A,B,C,D,E,F,G] has a data-structure for searching of [(D,3), (B,1), (F,5), (A,0), (C,2), (E,4), (G,6)]

While the high-indexed items will get spread out, the most common and sequential activity occurs all together near the lower indexes, minimizing large jumps. Since each entry has the original index stored, you can always switch back to the initial array when a different strategy (e.g. linear scan) becomes preferable.

macdice · on Jan 11, 2019

Take a look at ARRAY LAYOUTS FOR COMPARISON-BASED SEARCHING. TL;DR there are schemes like this called "Eytzinger" and "van Emde Boas".

https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf

The topic of arranging keys like this within index pages (or maybe elsewhere) has come up on the pgsql-hackers mailing list before:

https://www.postgresql.org/message-id/flat/3B774C9E-01E8-46A...

RcouF1uZ4gsC · on Jan 11, 2019

Here is a writeup of the idea and some numbers.

http://bannalia.blogspot.com/2015/06/cache-friendly-binary-s...

lorenzhs · on Jan 11, 2019

Yes this is a nice trick. You can use all kinds of implicit search trees for this. See also my comment to a sibling with a link to a paper evaluating the performance of this, including the time it takes to compute that new layout: https://news.ycombinator.com/item?id=18881938

senderista · on Jan 11, 2019

Actually cache locality might not be quite as bad as you think: the first few levels of the "pivot tree" will definitely be cached (although wasting a whole cache line on a single element is certainly not optimal use of the cache). To get better locality one could use a breadth-first layout of the implicit binary search tree, like a binary heap. One compromise could be to duplicate the first few levels of pivots at the beginning of the array, in breadth-first order, while storing the array in the usual sorted order. See "Array Layouts for Comparison-Based Search" for more details.

chug · on Jan 11, 2019

I came across this last night (http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performa...) and went down the rabbit hole of Rust and its BTreeSet (http://cglab.ca/~abeinges/blah/rust-btree-case/). They came to a very similar conclusion. It all makes me wonder if there's low hanging fruit in a lot of languages/libraries to simply swap out red black trees for B-trees (though I'm guessing the higher constant time cost of B-trees would make it necessary to swap between tree implementations for a truly general case).

loeg · on Jan 11, 2019

> It all makes me wonder if there's low hanging fruit in a lot of languages/libraries to simply swap out red black trees for B-trees

I think the short answer is "yes." Although maybe not quite as low hanging as we'd hope.

The nice thing about RB trees and Splay trees is that they are embeddable at relatively low memory bloat (3 pointers and a color for RB, and 2 pointers for splay) to the object they link into an ordered index.

Embedding list and tree pointers has a variety of advantages for operating system and similar code:

1. The memory is already allocated; you can avoid allocating memory in paths with locks held, which prevents deadlocks / hangs.

2. You can put a single object on multiple lists or indices. I.e., maybe your cached file page is on a eligible-for-reclamation linked list but also indexed by offset into the file (to simplify somewhat).

B-trees are not as trivially embeddable in the same way. As wikipedia points out[0], RB-trees are a kind of low-order B-tree, and it's possible to increase the width to something more like a traditional database B-tree. But this increases the overhead of embedding tree metadata in your object.

I suspect that you're right — dropping in a B-tree of branching factor >2 in place of existing BSTs would improve performance for some indices on some platforms, at the cost of additional memory. But I suspect evaluating where it makes sense would require some profiling or insight into a particular use case. I don't think universal replacement is a slam dunk, at least in low level systems code.

It makes a lot of sense for Rust because their lifetime checker basically prevents the kind of multi-list ownership I've described above. It may make sense for other high level languages for the same reason.

[0]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Analogy...

sanxiyn · on Jan 11, 2019

My friend is working on the paper which shows how to do intrusive data structure (I think that's what you meant by multi-list ownership) in Rust. I've seen the draft, and it's surprising and beautiful. Stay tuned...

senderista · on Jan 11, 2019

Well, there is this (requires nightly): https://github.com/Amanieu/intrusive-rs

sanxiyn · on Jan 11, 2019

Of course, my friends (and I) already know about this and the paper improves upon it.

loeg · on Jan 11, 2019

Yes, intrusive data structure is exactly the correct word for it. I was trying to remember it but couldn't recall it for my comment. Thanks! I'm curious to see how it will work in Rust.

sanxiyn · on Jan 11, 2019

Yes, you probably shouldn't use any kind of binary trees. This has been a sort of open secret for years. I first learned about it from phk(Poul-Henning Kamp)'s 2010 article, "You're doing it wrong". See http://phk.freebsd.dk/B-Heap/queue.html.

lkuty · on Jan 11, 2019

What would you recommend as a good book on Virtual Memory ? I learned about the insides of computers during school days more than 20 yers ago and I would like to refresh/update my knowledge.

senderista · on Jan 11, 2019

AKA how to reinvent cache-oblivious data structures, poorly.

sanxiyn · on Jan 11, 2019

Yes, I don't think the article is very good, but it was where I first learned whole IO complexity thing, and I think it was widely read and effective in popularizing the idea.

rakoo · on Jan 11, 2019

That's exactly what phk wrote about: theoretically perfect algorithms such as binary search are never practically perfect, because they never account for real life things like disk latency and memory pressure. This has lead to developers come up with inefficient softwares.

alecco · on Jan 11, 2019

This is uninformed. The problem is plain binary search over ordered array.

Binary search with heap order is cache friendly and much faster. And can speed up with SIMD. Like Intel’s FAST algorithm.

lorenzhs · on Jan 11, 2019

Very true. Here's a paper from last year that measures the performance of searches on several implicit search tree layouts on CPU and GPU: https://algoparc.ics.hawaii.edu/pubs/18-ipdps.pdf. You can see the results in figures 7 and 9 (pages 9 and 10).

tl;dr: in an array of 500 million elements, if you're doing more than ~5 million queries (1%), it pays off to spend the time to construct a more efficient representation of the data and search on that. On a GPU, you need ~15 million queries (3%).

symisc_devel · on Jan 11, 2019

SQLite already implemented such mechanism[0] that determines whether or not a given identifier is really an SQL keyword. The result is a single static hashtable (C array of chars) with O(1) lookup result. This is the fastest known implementation for keyword lookup.

[0]: https://sqlite.org/src/artifact/1f7f2ac1d9f262c0

anarazel · on Jan 11, 2019

I'm not clear as to why that'd be faster than what we/PG has done. On a quick skim that allows for conflicts in the hashtable, which our new implementation doesn't? Ours is also a C array (albeit of offsets into one blob of string srather than the strings directly, the higher data density achievable due to that was worth it).

koolba · on Jan 11, 2019

So is the idea here to generate a perfect hash table for the known set of SQL keywords in the parser rather than doing a binary search on the fly?

anarazel · on Jan 11, 2019

Correct.

chubot · on Jan 11, 2019

FYI PostGres appears to have over 400 keywords, shown here:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...

The way I've seen most languages do it is something like this:

    switch (c[0]) {
      case 'c':
        if (strcmp(c, "case") == 0) return CASE;
        if (strcmp(c, "continue") == 0) return CONTINUE;
        ...

      case 'f':
        if (strcmp(c, "for") == 0) return FOR;
        if (strcmp(c, "friend") == 0) return FRIEND;
        ...
 
    }

I suspect that's good enough if you have a typical keyword list of say 20-100.

I just looked at Clang and it appears to build the list at runtime, using a regular hash table. I think this is because it has to handle C, C++, and Objective C, and they have different sets of keywords.

It seems like they could change it to use something like perfect hashing, and if that's even a 1% win, it would be worth it for the extra build complexity, since C and C++ compilation take up so many CPU cycles in the world.

See lib/Basic/IdentifierTable.cpp.

Although another problem is that the list of keywords depends on whether you're compiling C89, C99, C++98, C++11, etc. But that is a finite list so it seems like it still can be done with code generation ahead of time like PostGres does.

In Oil [1] I use re2c [1] recognize keywords, along with the rest of the complex lexing process, and it works well. It builds a DFA expressed in C code. The slower parts are elsewhere :)

EDIT: In theory, the DFA seems better than perfect hashing because you can bail on the second character if you get "xzoo" and no keyword starts with "xz". I think with perfect hashing yo still have to look at every byte.

The switch statement above also avoids looking at every byte in certain cases.

[1] http://www.oilshell.org/

[2] http://re2c.org/

nly · on Jan 11, 2019

You don't necessarily have to look at every byte to compute the hash - gperf will generate hash functions that uses only a few characters at seemingly random offsets. You do however have to do a full strcmp() to rule out a hash collision (with words outside of your set) so DFA wins there.

Imho though, the most significant difference between DFA and a PHF would be the DFA would likely generate code that introduces a lot of branches and control flow, which your CPU branch predictor might have a harder time with than a series of arithmetic ops and a single table lookup.

By the way, thanks for oilshell. I don't use it, but your blog is a superb resource on parsing techniques.

chubot · on Jan 11, 2019

Thanks, glad you like the blog!

Nobody is really using Oil yet because I made it more of a programming language than an interactive shell, but that's changing right now :) There should be something interesting for people to use within a few months.

I would like to see a comparison between DFA and perfect hashing sometime. I wanted to do it myself, but I couldn't justify it because there are other slow parts of the parser. I am surprised that postgres got such a large speedup, to be honest. I guess it has to do with the large number of keywords.

BTW, this is related:

https://github.com/oilshell/blog-code/tree/master/fgrep-prob...

I match thousands of fixed strings "in parallel", and compare different regex engines, and compilation vs. interpretation.

The biggest surprise is that interpreted regexes like grep and ripgrep are faster than compiled DFAs with re2c for 13 strings. When you search for thousands of strings in parallel, re2c takes over.

This could be compared to perfect hashing, I think.

This will be mentioned on my blog, but I'm not sure if I'll have time to do a detailed breakdown.

The motivation for this was somewhat different. I was trying to understand if the Aho-Corasick algorithm to generate a DFA from fixed strings is "important", given that you can already generate a DFA from a regex. In other words, Aho-Corasick seems like more or less a special case of the typical regex -> DFA compilation.

caf · on Jan 11, 2019

When parsing C, typedef names essentially introduce new keywords so the table is constantly changing.

chubot · on Jan 11, 2019

Yes this is a good point! That's probably why it's dynamic in Clang and not done ahead of time.

FWIW I mention the C Lexer Hack here:

The Oil Lexer: Introduction and Recap

http://www.oilshell.org/blog/2017/12/15.html#toc_5

chubot · on Jan 11, 2019

Oops, now I agree with the people below: keywords and type names / identifiers are different :-/

kccqzy · on Jan 11, 2019

typedef introduces new type names not keywords.

barrkel · on Jan 11, 2019

Depending on whether an identifier is a type name or not affects the grammar.

    x * y;

Is this a multiplication expression or a variable pointer to a value of type x?

    (x) * y

Is this a typecast of *y to the x type, or is it x multiplied by y?

This is what people mean when they say that typedefs introduce keywords in C. An easy way to solve it is to look up type names from the lexer and return a different token type (i.e. not identifier). This puts type names into a different category of lexeme to identifiers.

https://en.wikipedia.org/wiki/The_lexer_hack

kccqzy · on Jan 11, 2019

I think you are confused about the concept of keywords. Yes x*y can mean two things, but neither of the interpretation has anything to do with keywords. According to the C99 standard the set of keywords is exactly and exhaustively defined in Section 6.4.1. Doing something like `typedef unsigned mynum;` doesn't make `mynum` a keyword.

Someone · on Jan 11, 2019

I doubt barrkel is confused, as (s)he writes: ”This is what people mean when they say that typedefs introduce keywords in C.“ (it also is why caf included essentially in their comment)

Just as you cannot have a variable or function named int, you cannot have one with the same name as a typedef, for example. Also, typedef names legally can appear in the same places as the built-in types int, long, etc.

That’s why, when you write a parser for C, many times, you often need to include typedef names in the set of things to check for.

Given that there typically are way more typedef names than keywords, and you often have to check “is this a built-in type or typedef name”, keeping the two checks separate and optimizing the latter doesn’t make much sense.

barrkel · on Jan 11, 2019

Nope, not confused about keyword concept.

Normally keywords are reserved words that can't be used as identifiers because they'd make the grammar ambiguous. C typedef names aren't like this; by definition, you use them as an identifier, it's how they come into being.

Fortran is an oddball example, though: whitespace is optional (Fortan77), and keywords can be used as identifiers, making parsing difficult. There are reasons more modern languages make different choices in design.

summerlight · on Jan 11, 2019

Why is this relevant to keyword lookup? Anyway you're not supposed to use reserved keywords as either a variable or type name except primitive types (which is obviously a type).

barrkel · on Jan 11, 2019

Because it happens in the lexer, and depending on how you design your symbol table lookups, you may want to use the same hash function for all table lookups, instead of calculating different hashes for each table. Symbol table lookups can compare hash value first (store the full hash code per entry) to skip the expensive string comparison too.

Perfect hashing isn't always what you want in any case. For many languages, especially symbolically flavoured ones like C, most strings found by the lexer aren't keywords. A hash table which is mostly empty and is statically verified not to contain keyword collisions is ideal. Keep the hash code calculation separate, and do it only once. Little point in repeatedly calculating it.

mark-r · on Jan 11, 2019

There's a lot of overlap with those keywords, so perhaps the best bet is to create a single superset and use an attribute returned by the lookup to see if it's needed by the target language.

Leszek · on Jan 11, 2019

We recently started using perfect hashing in V8's parser, also with measurable performance improvement (even over what we had previously, a first character jump table followed by linear search). It's always nice to apply that computer science education to a real world problem.

miclill · on Jan 11, 2019

Could you provide a link to the implementation, please?

Leszek · on Jan 12, 2019

Just a wrapper around gperf:

https://cs.chromium.org/chromium/src/v8/src/parsing/keywords...

https://cs.chromium.org/chromium/src/v8/tools/gen-keywords-g...

sjroot · on Jan 11, 2019

From the post:

"The time savings is indeed significant: preliminary testing suggests that the total time for raw [SQL] parsing (flex + bison phases) drops by ~20%."

Very impressive work!

amelius · on Jan 11, 2019

Impressive?

They changed one textbook algorithm by another one. It turns out to be faster, but it might as well be slower for certain inputs (which might be common). Also, when they increase the number of keywords in the future, they might have to reconsider the implementation.

geophile · on Jan 11, 2019

Wow. Three errors in three lines.

1) Yes, a 20% improvement is impressive, especially in such a major component of such an important and long-lived piece of software.

2) What do you mean "certain inputs"? There is a fixed set of keywords. The set of inputs is known. It's a great application for perfect hashing. Or do you mean that some keys are going to be slower to search for than others? The best case of binary search is that you search for the key in the exact middle of the table, which requires to compare each character of the search key. (Searching for any other key will require cache misses, possibly, and additional scans of the key's characters.) By contrast, for perfect hashing, you scan the key once to compute the hash value and go directly to the correct slot. So the hash lookup for any key is pretty close to identical to the best case of the binary search.

3) When they add new keywords, they generate a different hash function. That's easy. The implementation does not have to be reconsidered.

molyss · on Jan 11, 2019

I think it's impressive, but maybe not for the same reasons as the OP : that means that keyword lookup accounted for at least 20% of total SQL parsing time. That's a surprise to me !

anarazel · on Jan 11, 2019

Note it's the "raw" parsing time. If you include the semantic parse-analysis phase, i.e. looking up table names etc, it's a much smaller portion.

sjroot · on Jan 11, 2019

This is basically what I was thinking -- that a relatively minor change to the parsing process could yield such significant results.

jimktrains2 · on Jan 11, 2019

I'm having trouble coming up with a scenario where perfect hashing Vs a binary search tree is slower.

Also, yes, the very nature of perfect hashing means it will very potential need to be updated with additional keywords. However, that doesn't happen often and this is a small amount of work for what all would be entailed with a new keyword.

tachang · on Jan 11, 2019

Reading the discussion thread is fantastic. The postgres community culture definitely feels a lot different than some of the other open source projects I've seen.

canadev · on Jan 11, 2019

Can someone put this in lay terms for a casual postgresql user?

chrismorgan · on Jan 11, 2019

Parsing SQL is made faster.

Imagine a language with five keywords, for simplicity.

  keywords = ["AS", "FROM", "INSERT", "SELECT", "WHERE"];
  tokens = [As, From, Insert, Select, Where];

Now you have a token that you need to interpret:

  let token_raw = "FROM";
  let index = keywords.binary_search(token_raw);
  let token = tokens[index];

As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.

With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:

  tokens = {914576: As, 73932: From, 5791456: Insert, 21596: Select, 86548: Where};

… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:

  let token_raw = "FROM";
  let token = tokens[hash(token_raw)];

jules · on Jan 11, 2019

Why is this necessary at all? The lexer is recognising tokens with a finite state machine, so it already knows which keyword it's got.

gizmo · on Jan 11, 2019

The lexer knows it has a token, but it still has to match the token identifier to the corresponding keyword. Whether this happens at the lexing or parsing stage doesn't matter much, it's still expensive when strcmp is called a bunch of times for every keyword token.

jules · on Jan 11, 2019

Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:

  switch(str[i++]){
    case 'A': 
      switch(str[i++]){
        case 'L': ...
        case 'N': 
          switch(str[i++]){
            case 'D': 
              found token!
            case 'Y:
              found token!
            default:
              syntax error!
          }
       ...
      }
    case 'S': ...
    case 'F': ...
    ...
  }

Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.

Jyaif · on Jan 11, 2019

If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.

Perfect hashing trades more computing for less cache misses.

[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).

rurban · on Jan 11, 2019

That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.

jules · on Jan 11, 2019

I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?

What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.

lopopolo · on Jan 11, 2019

A perfect hash is one in which their are no collisions for a given data set, which guarantees O(1), or constant, lookup time. This patch introduces a perl script which computes a perfect hash function at build time given the current set of SQL keywords.

The old keyword lookup method uses binary search. Binary search is a search algorithm that must compare at most O(log n) entries before finding a match.

Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.

anarazel · on Jan 11, 2019

> Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.

Worthwhile to note that that speedup is for the raw parsing, without parse-analysis (i.e. looking up the tables referenced and such), planning (figuring out how to execute the query) and execution. So end-user impact will be much smaller than 20%. But for some cases, e.g. when many columns are referenced, it'll be measurable.

the_duke · on Jan 11, 2019

Column names aren't keywords though, are they?.

Using a perfect hash table is definitely cool, but it's hard to imagine a query with so many keywords that this would have a measurable impact on total query execution time.

anarazel · on Jan 11, 2019

Due to the desire to have keywords that aren't fully reserved (we have keyword classes that are fully reserved, can appear as type names, can appear as column names, or can appear nearly everywhere) we/PG need to look up whether a specific string is a keyword or an identifier in a number of places in the scanner/parser.

Addendum: And that indeed means that column references, unless quoted with "", used to go through the binary search and now go through the perfect hash.

Edit: Typo

the_duke · on Jan 11, 2019

Ah that makes sense.

Thanks for that insight!

innagadadavida · on Jan 11, 2019

The optimal algorithm is the CHD algorithm - the function can use as little as 1.6bits per key. http://cmph.sourceforge.net/papers/esa09.pdf There are implementations in Python: python chd perfect hashing and in go: https://github.com/alecthomas/mph In addition to the author's C code.

alnsn · on Jan 11, 2019

Space optimal isn’t necessarily the fastest.

Note that PostgreSQL code uses 2-graph instead of a more compact 3-graph version. The latter is noticeably more compact: 1.25 vs 2.1 factor.

Tostino · on Jan 11, 2019

I love seeing all the work and optimizations that get done on Postgres.

Very interesting.

jnwatson · on Jan 11, 2019

Isn’t this just textbook perfect hashing? This is literally a primary use case for it.

zzzcpan · on Jan 11, 2019

Textbook here would be using goto-based DFA state machine (ragel can generate one).

PhilWright · on Jan 11, 2019

If that would be faster, why not do that then?

wahern · on Jan 11, 2019

Because they're using Flex & Bison, not Ragel. The latter is specialized for and excels at defining and applying finite state machines. But while the design of Ragel makes it relatively trivial to build ASTs during the parse, you still have to do it manually. Flex & Bison automate much more of the process of transforming an input stream to a tree. Dropping them and moving to Ragel would be a serious refactor.

FWIW, Flex also builds DFAs. It's just that the design of Ragel is so elegant that Ragel DFA machines are more flexible, easier to compose (both in terms of combining abstract machines and in terms of language-level integration), and yet still incomparably faster. While a straight DFA match should in principle be at least as efficient as a perfect hash[1], this is really only realized with something like Ragel or Intel Hyperscan. As awesome as RE2 is, the performance isn't in the same league as those two; ditto for JIT'd PCREs and re2c--neither are comparable to Ragel or Hyperscan.

[1] At least until the input set grows into the thousands or millions of items--at some point the size of the generated DFA grows faster than and overtakes the size of the generated perfect hash function.

goldenkey · on Jan 11, 2019

Wouldn't a DFA be faster than perfect hashing here if only because non-keywords can collide with keyword hash values, so a strcmp is always going to be necessary regardless of the actual keyword differentiation?

I tried looking at the code to see if they were excluding non-keywords at some other stage - because then this would seem redundant?

wahern · on Jan 11, 2019

Yeah, conceptually for smaller sets a DFA should easily meet or beat perfect hashing for the reason you describe.

But because of pipelining, branch prediction, etc, it depends on the details of how the generated parser scans the input stream. A simple hash function is more likely to execute in a tight loop with minimal data dependencies because it's more natural to code it that way; it might be able to consume the stream faster even if performing more work simply because of how it slices the work. I'm not positive, but AFAICT Flex only generates table-driven machines, so at the very minimum the scanner is already handicapped. A tiny table that fits in cache might not be much of a handicap, but I'd be surprised if Flex does an especially good job of that. Any why would it--it's designed to work in tandem with Bison to solve more complex problems, plus it's hemmed in by having to support historical program structures and interfaces, such as liberal use of functions and function pointers for composition.

rurban · on Jan 11, 2019

No, too branchy and bytewise only. This PH is word wise mostly, and the keywords are longer, ~4-5.

goldenkey · on Jan 14, 2019

Still, even the best strcmp is going to involve some byte checking unless the string is exactly divisible into large words.

Wouldn't the efficiency between the two methods mainly depend on the relative odds of non-keywords occurring?

lucb1e · on Jan 11, 2019

It seems that this is a special case of hashing, indicated by the word 'perfect', where it somehow (I did not read the article and I don't know how this is done) generates a guaranteed collision-free hash table. While you can do the same with cryptographic hash functions, I'm sure that those are much slower than what is implemented as non-cryptographically secure hash function in databases.

ComputerGuru · on Jan 11, 2019

@jnwatson included “perfect” in his comment. And he is correct, it is surprising that this wasn’t already the case.

boshomi · on Jan 10, 2019

see also: Joerg Sonnenberger: Perfecting hashing in NetBSD (P3B, AsiaBSDCon 2013) [1]

[1] https://www.youtube.com/watch?v=87pUZYERkNQ (youtube)

robustpastoral · on Jan 13, 2019

Thanks

woadwarrior01 · on Jan 11, 2019

That remind me, some years ago, I’d created a pull request for redis to use perfect hashing with gperf to parse commands. IIRC, the author turned it down because he was wary of adding the build time dependency on gperf.

rurban · on Jan 11, 2019

That's awkward, because the gperf dependency is only needed every 2 years or so, when a new keyword is added. The genererated C code needs to be added to git, and only needs a regen if changed. I do that all the time via gperf or via some perl-generated perfect hash tables, eg for static keywords, Config tables, unicode tables, ... The memory win is always dramatic, like 20x better typically. I also use some standard gperf-fixup scripts for easier inclusion into my code.

senderista · on Jan 11, 2019

If your keyword list is small, you can just tweak a Pearson hash function (i.e. random permutation) until you get the desired result: http://burtleburtle.net/bob/hash/perfect.html

rurban · on Jan 11, 2019

Yes, you can, but this approach is better. Pearson would beat CHM on very small CPU's (8-16bit) and it is not guaranteed to find a good permutation. With CHM it's much easier.

Most important thing is to store the keywords for the false positive verification in an continuous const char const* array.

w23j · on Jan 11, 2019

Since there is a lot of discussion about alternative solutions to the keyword lookup problem I was wondering, why Tries were not mentioned yet.

Since the current implementation has to do a string compare after the hash, wouldn't a Trie be at least as fast? Am I missing something? Are there other disadvantages?

Someone · on Jan 11, 2019

For the trie, you still would have to do multiple character comparisons (on level 1, on average about 13, I guess, given that I read there are 400 keywords, and 26 characters in the alphabet).

Presumably, hashing the string is faster. I think that partly is because it introduces fewer cache misses than walking a trie. Both a this and a trie have to walk the input string, but code using a trie jumps around the trie, while this code only compare to the keyword, stored in contiguous bytes.

Another reason might be that trie-based code is worse for branch prediction logic.

w23j · on Jan 11, 2019

Regarding your first point: One possible implementation of tries uses an array with length "size of alphabet" to store the children of a node. So at each node it's just an offset into an array, not multiple character comparisons.

Of course your other points are valid. Thank you for your answer. So I gather the differences are rather subtle. I was wondering if I was missing something bigger.

Someone · on Jan 11, 2019

Even assuming just a-z, that array will be mostly empty at lower levels, and most nodes are at lower levels. That’s not good for avoiding cache misses.

There are tricks to counteract that (you don’t store pointers to lower levels, for example, but offsets, because at typical sizes of a trie you can store them in way fewer bits), and I haven’t timed any implementation for decades, but it would surprise me if such an approach would improve performance on lower levels.

You could use it at top level (basically replacing the trie with separate tries for each starting letter)

Also, the ‘search for a character in an array’ code can be sped up by using vector instructions nowadays (can’t find a link for the exact code, but http://0x80.pl/articles/simd-strfind.html may give ideas). Whether want to do that because of power usage concerns is a completely different can of worms.

krackers · on Jan 11, 2019

Someone else can probably comment better but hashing seems more efficient memory wise than holding the trie in memory, and hashing might probably also better cache-wise compared to jumping between the nodes of the trie.

sheerun · on Jan 11, 2019

Wouldn't decision trees be useful for constructing most efficient perfect hash function?

rurban · on Jan 10, 2019

perl, nice!

ricardobeat · on Jan 11, 2019

This looks scary: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...

Isn't there a better option than manually generating C code from a Perl script?

alnsn · on Jan 11, 2019

Not sure that scares you, code generation or a choice of scripting language?

If you want to avoid a code generation, you can generate a .cdb (constant database) file with the cdbw(3) C API and read it using the cdbr(3) C API.

cdbw(3) http://netbsd.gw.com/cgi-bin/man-cgi?cdbw+3+NetBSD-current cdbr(3) http://netbsd.gw.com/cgi-bin/man-cgi?cdbr+3+NetBSD-current

Some examples of .cdb files on NetBSD:

/usr/share/misc/terminfo.cdb /var/db/services.cdb

If you don't like a particular choice of a scripting language, you can adapt my Lua script:

https://gist.github.com/alnsn/68f599bc9358fcee122d6175392d77...

Building and peeling a graph should be identical but the assign step is different. I use non-minimal not order preserving BDZ scheme while Joerg's code is based on the original CHM scheme.

Given keys A, B, C and D, CHM always maps them to values 0, 1, 2 and 3. BDZ, on the other hand, can generate values in a range [0, 2.1 * 4) and it doesn't preserve an order. For instance, one possible mapping in the BDZ scheme is 7, 2, 5, 1.

Alex

rurban · on Jan 11, 2019

Nope, cdb is worse than this generator. We've used cdb before. cdb is even worse than gperf.

alnsn · on Jan 11, 2019

It’s down to a quality of implementation. I can give you some hints on improving it if you’re interested.

A tiny JIT code generator inside cdb would definitely bring performance on par with a generated C code but it’s probably an overkill.

Arnavion · on Jan 11, 2019

The hash function needs to be compiled into the final binary, so it needs to be C code. Thus something needs to generate that C code.

There's already a bunch of other perl scripts in src/tools/ that generate / modify C code, so why use something else?

hinkley · on Jan 11, 2019

I think this is one of the classes of problems that Jonathan Blow identifies. Instead of having a Turing Complete macro system you have regular functions that evaluate at compile time. You’d generate your perfect hash in the same language you’re working in.

nicoburns · on Jan 11, 2019

D does this. Although it's not necessarily one or the other. C++ has the c preprocessor, templates, and constexpr. Rust has macros, but is looking at adding const function evaluation too.

steveklabnik · on Jan 11, 2019

const fn has already landed, though a minimal version whose powers we are expanding upon almost every release.

jbn · on Jan 11, 2019

Another option would be using https://www.gnu.org/software/gperf/

anarazel · on Jan 11, 2019

We looked at that, and the generated code/hash unfortunately isn't that good.

souprock · on Jan 11, 2019

It can be made good. I used it for parsing /proc/PID/status keywords at a very high rate to support "top" and "ps". You'll have to change the generated code, which likely means that you use gperf during development instead of running it during every build.

The first step was to put only the needed keywords in a file, without the colons. I ran gperf with options to choose 3 specific character offsets and to use the length in the calculation; this was best after I tried a few choices. I then cleaned up the code and optimized it. I rounded the table size up to a power of 2, padding it correctly, and changed the length check to a mask. I switched the code over to using gcc's computed goto extension. I provided the keyword lengths, letting me use memcmp instead of strcmp. I inlined everything.

rurban · on Jan 11, 2019

I evaluated most of the available perfect hash generators a while ago, and gperf is not often the best. I've created much better ones.

But I haven't added this NetBSD code yet and also not yet the new perl5 mph code, will do soon.

https://github.com/rurban/Perfect-Hash

nuclx · on Jan 11, 2019

It's a reasonable choice for the use-case at hand. I used a similar approach to generate C code using python with jinja2. Ironically, I generated a binary hash based lookup, because the hash values ought to stay constant even when new keys are added in future versions.

chocolateboy · on Jan 11, 2019

Yes, a better Perl script [1] :-)

In C, re2c [2] provides something similar, and there are similar tools available in other languages.

[1] https://metacpan.org/pod/Devel::Tokenizer::C

[2] http://re2c.org/

rurban · on Jan 11, 2019

Devel::Tokenizer::C generates horrible code, even worse than gperf. Much better is my Perfect::Hash module, this postgresql module or the new perl5 mph script.

Haven't checked re2c but it looks like a naive lexer implementation generating a static DFA bytewise, not optimized for speed nor memory.

Better generators should really operate on the wordlevel or wider (SSE), not bytes. These are ~8x faster, in some case up to 200x. http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-an...

goosewaterfan · on Jan 13, 2019

Always great to see some perl these days.