Hacker News new | past | comments | ask | show | jobs | submit login
Designing a fast hash table (ilikebigbits.com)
67 points by vasili111 on Aug 29, 2016 | hide | past | favorite | 34 comments



I recommend Robin Hood open addressing scheme. It's particularly trivial to implement, and because it relies on linear probing, so it's memory-cache friendly.

See more here: http://www.sebastiansylvan.com/post/robin-hood-hashing-shoul...

Jeff Preshing has written many interesting blog posts about the subject. For example, see his blog post on Leapfrog Probing( http://preshing.com/20160314/leapfrog-probing/ )


Yeah, most (all?) implementations of unordered_set are really slow. My understanding is that the iterator invalidation behavior specified by the standard forces implementations to use chaining, which means allocating constantly on insertions and chasing pointers on every lookup.


Yes, I guess the designers of the standard library envisioned that programmers can easily make mistakes with the lifetime of iterators, and this is their best way to safeguard against it. Rust enforces the constraints on iterator usage, so this is a clear example where Rust would be a superior language.


My guess is that the hash_map of the original STL had that iterator guarantees. While hash_map wasn't standardized, it was de-facto available on many standard libraries. When the committee standardized unordered_map, they tried to make it as much of a drop-in replacement for hash_map as they could and subtly different invalidation rules would have been extremely hard to catch.


I've heard it was due to std::unordered_map trying to be a drop-in replacement for std::map.


> The advantage of using power-of-two is that we don't > need to look for prime numbers, and we don't need to > do an expensive modulo operation (we just need a > dirt-cheap bit mask).

I'd heard that point about using a mask instead of modulo to fit the hash into the table range before, but never paid much attention to it. I figured with all of the other work going on in a hash table, the difference between one of two arithmetic operations would be negligible.

Right now, I'm working on a book on programming language interpreters. That includes implementing a hash table from scratch. I used "%" because I felt it made the behavior easier for readers to understand.

After I got it all up and running, I discovered that in some microbenchmarks were spending something like 10% of the total execution time on that "%". I changed it to use a mask and "&" and it made a dramatic performance difference.

So I think I'll be explaining both approaches in the book. The former for clarity and the latter as an example of an optimization.


The post is very light on details about the measurements, and only compares to libc++'s std::unordered_{map,set} (no version number provided). There is no mention of space usage.

It's well known that std::unordered_map is slow (see panic's comment). Some more worthy opponents would be Google's dense_hash_map [1] or sparsepp [2]

[1] https://github.com/sparsehash/sparsehash

[2] https://github.com/greg7mdp/sparsepp


Some questions:

- Which CPU was used? Which compiler optimization flags? (2 million inserts per seconds seems slow, for a std::unordered_map, using 64 bit data)

- How does HashSet read operations perform vs std::unordered_map?

- How does external fragmentation handling in HashSet compares to std::unordered_map? (e.g. writing and deleting random keys, continuously)


Amend: I meant internal fragmentation, and not external fragmentation (allocated hash buckets left empty after deleting elements)


I agree. Two million inserts per second is very low. It should be close to an order of magnitude more.


A bit off-topic, but I have wonder lately if I new language start today, what could be the very best (overall) foundations for it.

For example, I read here (https://cr.yp.to/critbit.html) that critbit tries could be a better base structure for a language (ie: like dict and list are for python).

So, if we start today, what hash to use? What will be a better default structure (my understanding is that hash-maps is/was the way to go)?


I'm generally a big fan of DJB, but I see a few problems with this scheme.

First, it seems like the prefix-free requirement means that the strings can't contain internal NUL bytes. In Python you can store 'a\0' as well as 'a\0b' in a hash table -- how do you handle this with crit-bit trees?

Second, in Python, unlike JavaScript, arbitrary types can be hashable. Tuples are hashable. How do you store ('a', 0) and ('b', 1) as keys in a crit-bit tree?


Just store a bit (for a set) at each internal node, denoting whether a string ends there. For a map you also need a value pointer.

Hashing keys is also easy: add a pointer to a linked list of key-value pairs to each leaf node. To save space you could even use a flag bit and then a pointer to a single key-value pair, or a linked list in case of collision.

Note that you can store tuples directly as the concatenation of their fields; no need to use hashing.


What are you talking about. How do tuples preclude the requirement of a way to hash?

That's a very naive way to map using the delimeter of the tuple parts. You might have interior elements that are 10k long.


And strings can also be long.


I'm using Cedar to store arbitrary order preserving tuples.

After a lot of perf testing, and that library along with others had the same issue with not allowing null bytes.

Initially i thought it was a deal breaker, especially since I needed to preserve order.

Ended up doing a possibly mutating prepass on any keys keys that might contain nulls (Which is generally very fast O(len(key)) before storing, to rewrite without them.

It was still worth it.


There's another design decision not considered in this article: using a separate bucket state table, or using sentinel values inside the buckets to represent 'empty' etc.. I've measured sentinel values to be faster.


It is an interesting idea.

There just seems a lot wrong with the presentation of the benchmark comparisons...

Specifically where is a full suite of comparisons as compared with other libraries for time and space, insert and retrieval. Hell, I make those graphs for other people's code, let alone my own.

Usually when you come up with a new algorithm it is typical to be so over the moon to show how it compares all around. I'm not seeing this here and that is a red flag.

I'll definitely give it a whirl though as it sounds promising, although I'm personally more interested in order preserving hashes.


The author states:

"They allow you to insert, erase and look up things in amortized O(1) operations (O(√N) time)." Are they staying that the average time is (O(√N) and that is amortized to ~ O(1)?

I assume (O(√N) is the same as .5 multiplied by N? Is that correct? I don't think I've seen a square root in Big O notation before.


Nah, that O(√N) comment in the article is a link to another blog post where the author claims that random memory accesses take O(√N) time.

I say "claim" because everything can be made to look like a fit on a log-log-plot. It's just memory hierarchy. (TLB misses do incur a logarithmic delay, because the page table usually uses a tree data structure, albeit with a high branching factor).


.5 multiplied by N is still O(N) time (.5 a constant factor).

O(√N) means there will be at most C*√N operations (where C is some constant factor > 0)


No O(√N) is O(N^0.5)


Sorry yes of course, that's what I meant. But O(1) is the amortization with the average being O(N^0.5)


The amortization is the average and is O(1). The claim is that the worst case (when the table becomes relatively saturated and a resize is required) is O(N^0.5). I don't really buy that and the supplied link doesn't really make that claim either. By the argument in that link, it should be O(N^1.5) for rehashing. (Usually we think that copying N items is O(N), but he makes an argument from physics which I haven't thought through.)


If you read the linked blog post about cache speed (http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-p...) it'll make a lot more sense.


That article is currently being discussed at https://news.ycombinator.com/item?id=12383012 - suffice it to say, there are numerous issues with the author's understanding of Big-O notation.


Oh thanks, also a good read. The author in this post however states:

"That I use Big O to analyze time and not operations is important."

This strikes me as an odd statement though as Big O is generally meant to describe a run time or space not operations. What am I missing?


Big O is really just saying that a function f(n) belongs to O(g(n)) if 0 <= f(n) <= c*g(n) for some C >= 0.

When he says he's using big O for time and not operations, he's saying that n is time (as opposed to operations)


You're making the same mistake that the author of that article is making. That condition only has to hold for sufficiently large values of n - i.e., n ≥ n₀ for some n₀.


Sure, I just thought it was an odd statement to be explicit about as it the usual case to have it express time and not operations.


Not quite. You're usually measuring time in a specific model. Most often, that's the RAM model, where memory access and arithmetic operations take constant time (defined as one time step). The author is using a different model, but doesn't formally define it and instead tries to make it fit wall-clock time. That requires far more elaborate modelling, ends up being extremely complicated, and usually doesn't yield all that much additional insight. That's why computer scientists usually use the RAM model - it captures the most important aspects of computers, but omits others as they're hard to model accurately. The memory hierarchy is one of these.

Some models where such things are studied in more details include the external memory model (which has a two-level hierarchy - cache and main memory, or main memory and disk) and cache-oblivious models.

Most of the time, you don't need that extra information, though, and the RAM model suffices.


Thanks for clearing that up. I'm catching up on the other discussion now : )


The time being amortized to O(1) means that not only the average is O(1), but it is guaranteed that your use case will average into O(1) if you do enough operations. There is no chance you'll get an O(n^0.5) run just because you got bad data or unlucky.


The only hash table worth competing with is Cedar. That can even be better, but the standard distribution is pretty damn awesome as it is IMHO, and I've tested a lot.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: