This is because hash tables resize when they are filled with too many elements. ...

eru · on Oct 29, 2019

If you are really paranoid, you can even use a salted cryptographic hash function. But the constant factors for those are usually worse than for simpler salted hashes, so they are not worth it for hashtables.

Your fallback for collisions could also be something with O(log size_of_bucket) runtime, instead of linked lists. But again, when you don't have many collisions, that's going to be slower than something simpler.

(I half remember a result that if you tune your parameters right, you can get away with a hashtable that only has enough size for storing one element in each slot; and if a collision happens, you replace that element with a tomb stone, and store put the elements involved in a single global linked list.

Basically, for that to work you need to keep your total number of collision smaller than a constant with high probability.)

LgWoodenBadger · on Oct 29, 2019

Once your hashes collide, you have no other mechanism to use besides equality. How can you use something other than a list in order to find the actual value? This being for the general case where items don't have an "order/rank/sort" property between them.

eru · on Oct 29, 2019

Most of the time, when you talk about collisions we mean that two different keys get hashed to the same bucket.

What mechanisms you have available depends on your implementation.

For example Cuckoo Hashing (https://en.wikipedia.org/wiki/Cuckoo_hashing) relies on having two different hash functions available.

And yes, for having something like a balanced tree, having a comparison would be useful.

In general, most hash functions work by consuming something like a stream of bits. So for your implementation it makes a lot of sense for the datatypes you want to use as keys to export a way to convert themselves into that stream, and leave the actual hashing into a fixed size as a detail of your hashtable.

That way you can eg do fall-back comparisons directly on stream of bits (including for equality). Or you can transparently support multiple hash functions.

Even in languages where the hashable interface works by giving you a method to spit out eg a 64 bit number only, you still have to map that number to one of your buckets. So for your fall-back, you can choose a different mapping.

LgWoodenBadger · on Oct 29, 2019

My point was "what else can you use besides a linked list to store hash-colliding values?"

If you have a second (or a third, or a fourth...) hashing algorithm, then make it hash tables all the way down. At the end, you still need some data structure to store hash-colliding values. And if so, what other structure could you possibly use besides a list (linked-, array-, etc.) ?

eru · on Oct 30, 2019

> If you have a second (or a third, or a fourth...) hashing algorithm, then make it hash tables all the way down. At the end, you still need some data structure to store hash-colliding values.

Why? You can have just two layers: the primary hash table and your buckets are made up of one small secondary hashtable each. If there's a collision in the bucket hashtable, pick a now random hash function for that bucket and re-hash everything in the bucket.

If that fails after trying a few times, pick a new random hash for the primary hash table and consider resizing.

I bet you can make that scheme workable in O(1) expected amortised time for inserts and lookups.

Cuckoo hashing (https://en.wikipedia.org/wiki/Cuckoo_hashing) is a related idea: you just have to hash functions. If you had three elements that collide for both hash functions each, you just re-hash your entire table.

(From Wikipedia:)

> When a new key is inserted, and one of its two cells is empty, it may be placed in that cell. However, when both cells are already full, it will be necessary to move other keys to their second locations (or back to their first locations) to make room for the new key. A greedy algorithm is used: The new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process continues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions: [...]

You say:

> And if so, what other structure could you possibly use besides a list (linked-, array-, etc.) ?

Any datastructure you feel like. You can also use a balanced search tree, if you want to. See eg https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_w...

hhmc · on Oct 29, 2019

But in the pathological case where all the inputs collide to the same hash, I don't see how you avoid degrading to at least O(N log N), even with resizing -- you must account for the resizing as an O( g(N) ) operation.