Distributed count distinct vs. HyperLogLog in Postgres

rawnlq · on Dec 22, 2017

> We’ve been big fans of the Postgres HyperLogLog extension for many years and are excited to be taking on responsibility to maintain HyperLogLog going forward.

Conceptually hyperloglog should be really simple (count leading zeroes) so I wasn't sure what there was to "maintain".

Looking at the citus repo it seems like they made a lot of interesting improvements to the original algorithm: https://github.com/citusdata/postgresql-hll

Interesting how fitting a simple theory work to the real world takes 3.3k lines of code.

shmerl · on Dec 22, 2017

For those unfamiliar, log in hyperloglog is not related to logging, but rather refers to logarithm.

potle · on Dec 22, 2017

any good step by step derivation for this out there?

I read the wikipedia article and still not sure why this works.

teraflop · on Dec 22, 2017

The basic idea is that for each element X in your data, you compute a function f(X) which is equal to the number of leading zeros in a bitstring generated by hashing X. Then you take the maximum of this value across all elements in the dataset (which can be efficiently done on distributed or streaming datasets).

For any given value of X, the probability that its hash contains N leading zeros is proportional to 1/2^N. We're taking the maximum, so if any one of the values has N leading zeros, then max(f(X)) >= N.

If the number of distinct values observed is significantly greater than 2^N, it's very likely that at least one of them has N leading zeros. Conversely, if the number of distinct values is much less than 2^N, it's relatively unlikely. So the expected value of max(f(X)) is proportional to log(distinct values), with a constant factor that can be derived.

This is the basic "probabilistic counting" algorithm of Flajolet and Martin (1985). Its major flaw is that it produces an estimate with very high variance; it only takes a single unusual bitstring to throw the result way off. You can get a much more reliable result by partitioning the input into a bunch of subsets (by slicing off a few bits of the hash function) and keeping a separate counter for each subset, then averaging all the results at the end. This improvement, plus a few other optimizations, gives you HyperLogLog.

rjeli · on Dec 23, 2017

What kind of average? Not a math person, but median seems best?

kochthesecond · on Dec 22, 2017

Thank you for your good explanation!

rawnlq · on Dec 22, 2017

A more layman version:

- Each value generates a sequence of coin flips

- Count the number of consecutive heads until you hit your first tail

- Take the longest streak amongst all your values, call it n

- Report 2^n as your estimate

This works because if you only have a few people flipping, you won't get a lot of long streaks.

But if you had lets say 1024 flippers, you expect to see one person flip 10 consecutive heads.

Coins flips here are just the bits generated by hashing the value so each coin flip sequence will be the same for each value.

finnh · on Dec 22, 2017

There are some good answers here already, but I'm going to take a crack at it too =)

Say you have a random hash function that produces (unsigned) int values. So, for all inputs, hash(x) will be in the range [0, 2^32).

- How many different inputs would you need to feed it before you would expect to see hash(x) == 0?

... uh, like, billions?

- How many different inputs would you need to feed it before you would expect to see hash(x) < 2?

... well, still in the billions, but maybe half as many as finding hash(x) == 0..

- How many different inputs would you need to feed it before you would expect to see hash(x) < 4? .. 8? .. 16? .. 32? ... all the way up to < 2^31 (typically you'll see this within two tries) and 2^32 (all of them).

... and so on and so forth; math does the rest. The "count the leading zeroes" check is the same as "the hashcode is less than 2^X". To prevent a single "lucky" hash from throwing off your estimate, you process your inputs in batches and average their estimates.

ignoramous · on Dec 23, 2017

Find a nonce such that hash(x + nonce) has n leading zeroes, and you've invented Proof of Work :)

It is amazing how different techniques add up, but almost exclusively base off of a novel, mathematically sound, fundamental piece of computer science.

niftich · on Dec 22, 2017

Antirez blogged about this in detail [1] when HyperLogLog was added to Redis.

[1] http://antirez.com/news/75

hinkley · on Dec 22, 2017

ELI5 version: you have a bunch of things and no way to count them all. So you try to measure the ‘space’ between them and guess how many there are by how far apart they are.

In the case of hll it’s the space around the smallest or largest value