Designing a fast hash table

markpapadakis · on Aug 29, 2016

I recommend Robin Hood open addressing scheme. It's particularly trivial to implement, and because it relies on linear probing, so it's memory-cache friendly.

See more here: http://www.sebastiansylvan.com/post/robin-hood-hashing-shoul...

Jeff Preshing has written many interesting blog posts about the subject. For example, see his blog post on Leapfrog Probing( http://preshing.com/20160314/leapfrog-probing/ )

panic · on Aug 29, 2016

Yeah, most (all?) implementations of unordered_set are really slow. My understanding is that the iterator invalidation behavior specified by the standard forces implementations to use chaining, which means allocating constantly on insertions and chasing pointers on every lookup.

amelius · on Aug 29, 2016

Yes, I guess the designers of the standard library envisioned that programmers can easily make mistakes with the lifetime of iterators, and this is their best way to safeguard against it. Rust enforces the constraints on iterator usage, so this is a clear example where Rust would be a superior language.

gpderetta · on Aug 29, 2016

My guess is that the hash_map of the original STL had that iterator guarantees. While hash_map wasn't standardized, it was de-facto available on many standard libraries. When the committee standardized unordered_map, they tried to make it as much of a drop-in replacement for hash_map as they could and subtly different invalidation rules would have been extremely hard to catch.

Ono-Sendai · on Aug 29, 2016

I've heard it was due to std::unordered_map trying to be a drop-in replacement for std::map.

munificent · on Aug 29, 2016

> The advantage of using power-of-two is that we don't > need to look for prime numbers, and we don't need to > do an expensive modulo operation (we just need a > dirt-cheap bit mask).

I'd heard that point about using a mask instead of modulo to fit the hash into the table range before, but never paid much attention to it. I figured with all of the other work going on in a hash table, the difference between one of two arithmetic operations would be negligible.

Right now, I'm working on a book on programming language interpreters. That includes implementing a hash table from scratch. I used "%" because I felt it made the behavior easier for readers to understand.

After I got it all up and running, I discovered that in some microbenchmarks were spending something like 10% of the total execution time on that "%". I changed it to use a mask and "&" and it made a dramatic performance difference.

So I think I'll be explaining both approaches in the book. The former for clarity and the latter as an example of an optimization.

lorenzhs · on Aug 29, 2016

The post is very light on details about the measurements, and only compares to libc++'s std::unordered_{map,set} (no version number provided). There is no mention of space usage.

It's well known that std::unordered_map is slow (see panic's comment). Some more worthy opponents would be Google's dense_hash_map [1] or sparsepp [2]

[1] https://github.com/sparsehash/sparsehash

[2] https://github.com/greg7mdp/sparsepp

faragon · on Aug 29, 2016

Some questions:

- Which CPU was used? Which compiler optimization flags? (2 million inserts per seconds seems slow, for a std::unordered_map, using 64 bit data)

- How does HashSet read operations perform vs std::unordered_map?

- How does external fragmentation handling in HashSet compares to std::unordered_map? (e.g. writing and deleting random keys, continuously)

faragon · on Aug 29, 2016

Amend: I meant internal fragmentation, and not external fragmentation (allocated hash buckets left empty after deleting elements)

socmag · on Aug 29, 2016

I agree. Two million inserts per second is very low. It should be close to an order of magnitude more.

mamcx · on Aug 29, 2016

A bit off-topic, but I have wonder lately if I new language start today, what could be the very best (overall) foundations for it.

For example, I read here (https://cr.yp.to/critbit.html) that critbit tries could be a better base structure for a language (ie: like dict and list are for python).

So, if we start today, what hash to use? What will be a better default structure (my understanding is that hash-maps is/was the way to go)?

chubot · on Aug 29, 2016

I'm generally a big fan of DJB, but I see a few problems with this scheme.

First, it seems like the prefix-free requirement means that the strings can't contain internal NUL bytes. In Python you can store 'a\0' as well as 'a\0b' in a hash table -- how do you handle this with crit-bit trees?

Second, in Python, unlike JavaScript, arbitrary types can be hashable. Tuples are hashable. How do you store ('a', 0) and ('b', 1) as keys in a crit-bit tree?

eutectic · on Aug 29, 2016

Just store a bit (for a set) at each internal node, denoting whether a string ends there. For a map you also need a value pointer.

Hashing keys is also easy: add a pointer to a linked list of key-value pairs to each leaf node. To save space you could even use a flag bit and then a pointer to a single key-value pair, or a linked list in case of collision.

Note that you can store tuples directly as the concatenation of their fields; no need to use hashing.

socmag · on Aug 29, 2016

What are you talking about. How do tuples preclude the requirement of a way to hash?

That's a very naive way to map using the delimeter of the tuple parts. You might have interior elements that are 10k long.

eutectic · on Aug 30, 2016

And strings can also be long.

socmag · on Aug 29, 2016

I'm using Cedar to store arbitrary order preserving tuples.

After a lot of perf testing, and that library along with others had the same issue with not allowing null bytes.

Initially i thought it was a deal breaker, especially since I needed to preserve order.

Ended up doing a possibly mutating prepass on any keys keys that might contain nulls (Which is generally very fast O(len(key)) before storing, to rewrite without them.

It was still worth it.

Ono-Sendai · on Aug 29, 2016

There's another design decision not considered in this article: using a separate bucket state table, or using sentinel values inside the buckets to represent 'empty' etc.. I've measured sentinel values to be faster.

socmag · on Aug 29, 2016

It is an interesting idea.

There just seems a lot wrong with the presentation of the benchmark comparisons...

Specifically where is a full suite of comparisons as compared with other libraries for time and space, insert and retrieval. Hell, I make those graphs for other people's code, let alone my own.

Usually when you come up with a new algorithm it is typical to be so over the moon to show how it compares all around. I'm not seeing this here and that is a red flag.

I'll definitely give it a whirl though as it sounds promising, although I'm personally more interested in order preserving hashes.

bogomipz · on Aug 29, 2016

The author states:

"They allow you to insert, erase and look up things in amortized O(1) operations (O(√N) time)." Are they staying that the average time is (O(√N) and that is amortized to ~ O(1)?

I assume (O(√N) is the same as .5 multiplied by N? Is that correct? I don't think I've seen a square root in Big O notation before.

lorenzhs · on Aug 29, 2016

Nah, that O(√N) comment in the article is a link to another blog post where the author claims that random memory accesses take O(√N) time.

I say "claim" because everything can be made to look like a fit on a log-log-plot. It's just memory hierarchy. (TLB misses do incur a logarithmic delay, because the page table usually uses a tree data structure, albeit with a high branching factor).

njaremko · on Aug 29, 2016

.5 multiplied by N is still O(N) time (.5 a constant factor).

O(√N) means there will be at most C*√N operations (where C is some constant factor > 0)

knappa · on Aug 29, 2016

No O(√N) is O(N^0.5)

bogomipz · on Aug 29, 2016

Sorry yes of course, that's what I meant. But O(1) is the amortization with the average being O(N^0.5)

knappa · on Aug 29, 2016

The amortization is the average and is O(1). The claim is that the worst case (when the table becomes relatively saturated and a resize is required) is O(N^0.5). I don't really buy that and the supplied link doesn't really make that claim either. By the argument in that link, it should be O(N^1.5) for rehashing. (Usually we think that copying N items is O(N), but he makes an argument from physics which I haven't thought through.)

njaremko · on Aug 29, 2016

If you read the linked blog post about cache speed (http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-p...) it'll make a lot more sense.

lorenzhs · on Aug 29, 2016

That article is currently being discussed at https://news.ycombinator.com/item?id=12383012 - suffice it to say, there are numerous issues with the author's understanding of Big-O notation.

bogomipz · on Aug 29, 2016

Oh thanks, also a good read. The author in this post however states:

"That I use Big O to analyze time and not operations is important."

This strikes me as an odd statement though as Big O is generally meant to describe a run time or space not operations. What am I missing?

njaremko · on Aug 29, 2016

Big O is really just saying that a function f(n) belongs to O(g(n)) if 0 <= f(n) <= c*g(n) for some C >= 0.

When he says he's using big O for time and not operations, he's saying that n is time (as opposed to operations)

lorenzhs · on Aug 29, 2016

You're making the same mistake that the author of that article is making. That condition only has to hold for sufficiently large values of n - i.e., n ≥ n₀ for some n₀.

bogomipz · on Aug 29, 2016

Sure, I just thought it was an odd statement to be explicit about as it the usual case to have it express time and not operations.

lorenzhs · on Aug 29, 2016

Not quite. You're usually measuring time in a specific model. Most often, that's the RAM model, where memory access and arithmetic operations take constant time (defined as one time step). The author is using a different model, but doesn't formally define it and instead tries to make it fit wall-clock time. That requires far more elaborate modelling, ends up being extremely complicated, and usually doesn't yield all that much additional insight. That's why computer scientists usually use the RAM model - it captures the most important aspects of computers, but omits others as they're hard to model accurately. The memory hierarchy is one of these.

Some models where such things are studied in more details include the external memory model (which has a two-level hierarchy - cache and main memory, or main memory and disk) and cache-oblivious models.

Most of the time, you don't need that extra information, though, and the RAM model suffices.

bogomipz · on Aug 29, 2016

Thanks for clearing that up. I'm catching up on the other discussion now : )

marcosdumay · on Aug 29, 2016

The time being amortized to O(1) means that not only the average is O(1), but it is guaranteed that your use case will average into O(1) if you do enough operations. There is no chance you'll get an O(n^0.5) run just because you got bad data or unlucky.

socmag · on Aug 29, 2016

The only hash table worth competing with is Cedar. That can even be better, but the standard distribution is pretty damn awesome as it is IMHO, and I've tested a lot.