Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This misses the point which is two-part:

1) A lot of hash tables uses modulo to map a large range to a small range, and this is slow.

2) Fibonacci hashing is according to the author both faster and as a side effect happens to improve on bad inputs.

Yes, you should improve your hash function, but assuming the author is right you should still prefer this over integer modulo to reduce the size of the input hash value to the size of the hash table if you can as a default because it's faster, and the extra mixing is just a bonus.

Also, for library implementations where the implementer has no control over which hash function the user provides, being resistant to pathological inputs is good. Especially if you get it as a free side effect of a performance improvement.



> Yes, you should improve your hash function, but assuming the author is right you should still prefer this over integer modulo to reduce the size of the input hash value to the size of the hash table if you can as a default because it's faster, and the extra mixing is just a bonus.

What we're comparing here are these two:

    // We're only dealing with sizes of num_slots = 2^b

    int key_to_slot1(key) {
       return hash(key) & (num_slots - 1);
    }

    int key_to_slot2(key) {
       return (hash(key) * k) >> (64 - b);
    }
The only difference is: (1) we're doing an extra multiplication, and (2) we're using the high bits instead of the low bits.

This should _only_ help if you use a terrible hash function. I'm also not sure why they claim that integer modulo is slow when the* show how to make it fast by using bitwise AND.

> Also, for library implementations where the implementer has no control over which hash function the user provides, being resistant to pathological inputs is good. Especially if you get it as a free side effect of a performance improvement.

But we're designing the hash table! There's no reason for us to make it possible for the user to provide a custom hash function. We've learnt this before: Instead of letting the user provide `int hashCode(Key key)` (as in Java), we use `void hash(Hash state, Key key)` (as in Rust). The user will tell us which sources of data needs to be hashed, but we can provide the hash function and mixing functionality.


> This should _only_ help if you use a terrible hash function.

So in other words it helps. The point being that when you're writing a library where you have no control over what people will pass you, you can either punt on doing the best you can and leave it to be the users problem, or when - like in this case - there is a trivially simple option that makes things better you can make it better.

> I'm also not sure why they claim that integer modulo is slow when the* show how to make it fast by using bitwise AND.

They show that integer modulo is slow because a lot of code uses the actual modulo rather than bitwise and (granted often because they assume they need a prime sized hash table to do well).

EDIT: Note that the article specifically also addresses the bitwise and by pointing out that Dinkumware in using that 1) had to compensate by using a more expensive hash for integers, 2) created a situation where implementers providing a custom hash had to be aware that this specific implementation would just throw away the high bits, and pointed out they'd been bitten by that issue several times. It also mentions how Google's google::dense_hash_map did the same thing but without even providing it's own integer hash replacement to compensate.

> But we're designing the hash table! There's no reason for us to make it possible for the user to provide a custom hash function.

He's comparing it to the C++ stdlib. An implementation that is intended to be a drop-in replacement for that needs to abide by similar guarantees, so it is possible to provide a custom hash function for the set of contracts in the libraries he compares with, and it's fairly common to do so.

The logical conclusion is that there is a reason for him to make it possible for the user to provide a custom hash function, because it's what the clients he wants to support expects.

Sure when you have control over the API and you are free to refuse to let users of the API set custom hash functions, just provide a better hash function.

But that is explicitly not the case this article is addressing, so you're arguing against a strawman.

The argument that another solution to this issue is to just refuse to let users provide the hash function would be entirely reasonable, and maybe it'd have been nice if the article started by addressing that.


> But we're designing the hash table! There's no reason for us to make it possible for the user to provide a custom hash function.

Technically Rust users are also welcome to provide a custom hash function, it's just that it's defined in terms of an API called by all the stuff which wants to be hashed via the trait named Hasher. You can get a drop-in FNV1a hash for example: https://crates.io/crates/fnv

The mechanics to make all this work in Rust involve a proc macro, so people who want to be able to put a Foozle in a HashMap (or an FNV HashMap, or their own custom hashtable) just write #[derive(Hash,Eq,PartialEq)] above the definition of Foozle telling Rust yeah, just look at the constituent parts of this type and use them to derive functions for equality and hashing the whole type. C++ macros aren't powerful enough to do that, so you'd need a more powerful reflection mechanism and templates to do it I think.


In Stockfish, multiplication is used and then the lowest 16 bits are stored in the entries to detect collisions. This is because storing the full hash, or god forbid the board state, would double or ntuple the overhead per entry, which is itself 64 bits not counting the 16 bit key.


So does this mean Stockfish can have situations where it didn't know the difference between board position A and board position B because they had the same key ? This seems like an opportunity for some (presumably small) fraction of completely spurious losses where Stockfish is pursuing position X (a clear winner) by taking steps towards position Y (a loss which may look very obviously different to a human) and only realises when it's too late.


Yes, but this has much less of an impact than you might think. The transposition table is used to cache the evaluation and best move of previously looked at positions, which can speed up the search, by guiding it in the right direction. But probably the worst thing that can happen if it gets data from the wrong position is that it wastes some time looking at a bad move(also quite likely is that the move is illegal and it's just skipped), and then has to re-search the position because the result made no sense. Even if in theory it could lead to a misevaluation, ultimately Stockfish' move is decided by a vote between different search threads, so there would have to be a conspiracy between multiple threads as well. It's pretty unlikely they'll all see the same collision because they're set up to emergently end up in different parts of the search space. And the transposition table is constantly being updated.


AFAIK most high performance modern hash tables use power of two sizes anyways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: