The proof of almost-universality won't apply to subsets of bits, and many applications of hashes only use a subset of bits. The construction techniques used to get almost-universality might even make some subset of bits very poor in their performance: or at least my experience with designing linear codes for error detection/correction has been that strong codes often have some bits that are surprisingly weak.
So hopefully people have tested this empirically that the collision rates for the top and bottom n bits are all reasonable looking for varrious values of n (and might be useful to know which bits are the strongest, assuming you get a choice). Assuming there are no hidden gotchas like that, this looks pretty great.
I am working on a new hash that uses more memory (aiming around 4KiB that's shared in a process), but is a whole lot faster. The finalizer for that hash is a universal hash which I believe is 2^(2-k) almost-delta universal mod 2^k for any k <= 64. So the bottom bits are always high quality.
PolymurHash has no such guarantee, the claims are only over the full hash. That said, I don't expect problems in practice due to the Murmur style bit mixing at the end.
> I am working on a new hash that uses more memory (aiming around 4KiB that's shared in a process), but is a whole lot faster.
Though it's hard to imagine anything with 4kib of state being very fast! is it just a precomputed power table or something where the whole thing isn't even accessed for short inputs?
> The finalizer for that hash is a universal hash which I believe is 2^(2-k) almost-delta universal mod 2^k for any k <= 64. So the bottom bits are always high quality.
That's interesting!
> PolymurHash has no such guarantee, the claims are only over the full hash. That said, I don't expect problems in practice due to the Murmur style bit mixing at the end.
That mix at the end is just an addition with a secret constant, right? I'd expect then the least significant bits to not be improved much by it.
> Though it's hard to imagine anything with 4kib of state being very fast! is it just a precomputed power table or something where the whole thing isn't even accessed for short inputs?
Correct, it's a pregenerated table of random bits that, in essence, gets multiplied with the input. If the input is small, only a small portion gets accessed.
> That mix at the end is just an addition with a secret constant, right? I'd expect then the least significant bits to not be improved much by it.
No, the mix at the end is as follows:
x ^= x >> 32;
x *= 0xe9846af9b1a615dULL;
x ^= x >> 32;
x *= 0xe9846af9b1a615dULL;
x ^= x >> 28;
x += secret;
The addition of secret at the end ensures that the permutation is not trivially invertible by an adversary. The rest is a Murmur-style permutation, which PolymurHash gets part of its name from.
So what is the value of the "proof of universality"? I'm a bit confused as to what the value pitch is when compared to xxhash3, for example, which runs ~twice as fast and has a large user base (it's used when hashing in dotnet now).
From a dev perspective though, it seems very nice. Short and clean code with decent comments. Although I'm struggling to understand what `polymur_hash_poly611` is doing.
You can construct sets of keys that make xxhash collide much more often than average. With provable universal hashing, you know you get a good distribution for all sets of input keys.
I wish people would stop writing "header-only" stuff. I get it, but it distorts the point of a header file, which should be an interface for an accompanying implementation. What's so wrong about providing 2 files?
cryptography noob here. what is the potential application for this function? I just read about Universal hash today, and from what I understand the results is not deterministic? So it can't be used for password hashing? Would appreciate explanation for this
You can use it to make hash tables or other fast hash based algorithms that do not suffer from adversarial inputs. With most other fast hash functions it's easy to find colliding strings, letting an attacker mess up your algorithm.