> We’ve been big fans of the Postgres HyperLogLog extension for many years and are excited to be taking on responsibility to maintain HyperLogLog going forward.
Conceptually hyperloglog should be really simple (count leading zeroes) so I wasn't sure what there was to "maintain".
The basic idea is that for each element X in your data, you compute a function f(X) which is equal to the number of leading zeros in a bitstring generated by hashing X. Then you take the maximum of this value across all elements in the dataset (which can be efficiently done on distributed or streaming datasets).
For any given value of X, the probability that its hash contains N leading zeros is proportional to 1/2^N. We're taking the maximum, so if any one of the values has N leading zeros, then max(f(X)) >= N.
If the number of distinct values observed is significantly greater than 2^N, it's very likely that at least one of them has N leading zeros. Conversely, if the number of distinct values is much less than 2^N, it's relatively unlikely. So the expected value of max(f(X)) is proportional to log(distinct values), with a constant factor that can be derived.
This is the basic "probabilistic counting" algorithm of Flajolet and Martin (1985). Its major flaw is that it produces an estimate with very high variance; it only takes a single unusual bitstring to throw the result way off. You can get a much more reliable result by partitioning the input into a bunch of subsets (by slicing off a few bits of the hash function) and keeping a separate counter for each subset, then averaging all the results at the end. This improvement, plus a few other optimizations, gives you HyperLogLog.
There are some good answers here already, but I'm going to take a crack at it too =)
Say you have a random hash function that produces (unsigned) int values. So, for all inputs, hash(x) will be in the range [0, 2^32).
- How many different inputs would you need to feed it before you would expect to see hash(x) == 0?
... uh, like, billions?
- How many different inputs would you need to feed it before you would expect to see hash(x) < 2?
... well, still in the billions, but maybe half as many as finding hash(x) == 0..
- How many different inputs would you need to feed it before you would expect to see hash(x) < 4? .. 8? .. 16? .. 32? ... all the way up to < 2^31 (typically you'll see this within two tries) and 2^32 (all of them).
... and so on and so forth; math does the rest. The "count the leading zeroes" check is the same as "the hashcode is less than 2^X". To prevent a single "lucky" hash from throwing off your estimate, you process your inputs in batches and average their estimates.
Find a nonce such that hash(x + nonce) has n leading zeroes, and you've invented Proof of Work :)
It is amazing how different techniques add up, but almost exclusively base off of a novel, mathematically sound, fundamental piece of computer science.
ELI5 version: you have a bunch of things and no way to count them all. So you try to measure the ‘space’ between them and guess how many there are by how far apart they are.
In the case of hll it’s the space around the smallest or largest value
Conceptually hyperloglog should be really simple (count leading zeroes) so I wasn't sure what there was to "maintain".
Looking at the citus repo it seems like they made a lot of interesting improvements to the original algorithm: https://github.com/citusdata/postgresql-hll
Interesting how fitting a simple theory work to the real world takes 3.3k lines of code.