Hacker News new | past | comments | ask | show | jobs | submit login
Bloom Filters by Example (llimllib.github.io)
194 points by gvenzl on July 10, 2017 | hide | past | favorite | 37 comments



I disagree with the standard dogma around bloom filters that you need multiple hash functions. Just use a simple incrementing salt value to modify the input so you can hash the resulting salted input as many times as you need to, using a different salt value each time.

Say you want to hash the string "abc" 8 times. Instead of having 8 hash functions, just take the hash of, say, "abc-0", "abc-1", ... "abc-7". As long as your hash function is of good quality and you're treating the output properly, you don't have to worry about the results from these different inputs having some significant relationship. For cryptographic types maybe I'm using the term "salt" loosely. Whatever, think of another term if you like. Anyway in practice this has worked fine for me.


See "Less Hashing, Same Performance: Building a Better Bloom Filter": https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.p.... This is pretty standard practice now.

There've been a couple of other analyses (at least) of the importance of independence among hash functions in probabilistic data structures --- which if I remember correctly was "not as much as you'd think" --- but I don't remember the titles of the papers offhand.


Standard practice? Someone should tell all the fresh CS grads who keep blogging about their multiple hash algorithms. I've been doing this as standard practice since the 1990s. I'm glad if Computer Science academics are catching up to the working practitioner.


> catching up

Not quite. The n-hashing methods described in that paper are rather different from yours. They're significantly faster, for one thing, which matters if you're building e.g. high-performance networking products.

An equally important contribution of the paper though is providing a framework for proving the reliability of such methods. There are areas of software development where "in practice this has worked fine for me" isn't good enough.


I see, I misunderstood your first comment. I took it as you posting a paper about the technique I described, and when you said "This is pretty standard practice now." I thought you meant "this, what natch just said in his comment," as opposed to "this, what the paper said."

So they're still doing multiple hash functions as standard practice, you're saying? Wow.

Yes their methods are rather different I see.

Indeed using multiple salts is slower than two hashes.

What I do now instead of that (the salts method was easier to explain) is just one hash, currently SHA-1 but may move to BLAKE2 soon, and chop the result of that up into 20-bit pieces that I use as hash values, modulo the bit array size. How do you think this would compare to what they do?


> What I do now

Same here (and if you're interested in trying out some new hash functions for performance-related reasons, I'll recommend fast non-cryptographic options like xxHash and MurmurHash). You can still use your chopped-up hashes to generate additional values if needed, depending on the desired error rate.

Do watch out for modulo bias when mapping the chopped-up hashes to the array, though. For max hash value >>> array size the bias is going to be vanishingly small, but if you're clipping the hashes to the nearest power of two greater than the array size, things could get ugly.


That's just an easy way to create new hash functions out of an existing one. Nobody sensible would insist that your hashes must have different underlying algorithms, just that they produce uncorrelated outputs.


You haven't been reading the blog posts on bloom filters. Or maybe you're saying they're not sensible, in which case I agree.


Haven't been reading is definitely true. Not sensible sounds like the case.

If you're using a cryptographic hash, your method clearly works. If you're using a fast and crappy hash function, you might need more care. But I suspect not.

As always, test with your real-world data. If it does what you need it to, theoretical qualms are likely irrelevant.


> I disagree with the standard dogma around bloom filters that you need multiple hash functions.

There is no dogma there, just a confusion of terminology. Your "abc-0", "abc-1", ... "abc-7" could be considered a family of hash functions, with the suffix as a parameter. "Use this algorithm on the input + a suffix of '-0'" is a different hash function than "Use this algorithm on the input + a suffix of '-1'"

Similarly, splitting a long or infinite length output of a hash function into smaller pieces is a perfectly acceptable way of constructing multiple independent hash functions.

When a paper talks about "multiple hash functions", that means exactly what it says, but no one is saying that you need multiple hash algorithms. It might mean making multiple hash functions using universal hashing, or through techniques like the "Less Hashing, Same Performance" paper people have linked to, or just using a prefix/suffix on the input like you suggest. No one is using CRC32 and MD5 and SHA1 together as their multiple hash functions, that's silly, unnecessary, complicated and doesn't scale.


The only problem is this assumes that the hash unpredictably changes on input modification. Which is true for cryptographic hashes but not others.


It assumes nothing of the sort. [Edit: well I mean it RELIES on it yes but it's not an assumption; it's verified by knowing the properties of the hash function used.] But it perhaps hinges on your definition of "unpredictably."

If you're taking the modulus of a large integer with respect to a very much smaller bit array with a length that is a prime number, there is plenty of unpredictability with most decent hash functions (note I did have "good quality" as a caveat above) even non-cryptographic ones. That being said, I do use cryptographic hashes.


basically, ideally you would want all of your hash functions to belong to the same universal hash family[1].

[1]: https://en.wikipedia.org/wiki/Universal_hashing


I think I actually wrote a section on doing this and then wrote over it at some point :(

PRs welcome, I'm on vacation and unlikely to get to it immediately. http://github.com/llimllib/bloomfilter-tutorial


That seems a roundabout way of sizing a bloom filter. Here's what I came up with when I first implemented them for Newzbin: https://hur.st/bloomfilter

    m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0)))));
    k = round(log(2.0) * m / n);


Thank you for this!

I actually used this site today. Found it on google...


Glad you found it useful :)


Unrelated, but I've always had this question about bloom filters...

If testing for membership in the set is unreliable, but testing for non-membership in the set works every time, then why can't you simply invert the boolean for the non-membership test and call that a membership test?

E.g., if it's not not in the group, then it's in the group...


If you want to invert the boolean check, then you need to invert the input space as well, which is not possible.

For example, consider a Bloom filter checking the availability of a username during sign up. If you want an inverse Bloom filter that checks if it is not not in the group, then you need to load it with all possible usernames.


Makes sense. OP's link is interactive enough that I was beginning to see that, though couldn't articulate it. Thanks!


If I was going to explain a bloom filter like you're 5... a bloom filter is like a savant who never forgets a face -- maybe he's got a job in passport control in Arstotzka -- if you show him someone's face or picture once, he'll never forget it: if any time later you show him the same picture and ask him "have you seen this face before?" he'll say "yes" without fail. If he replies "no way", you can be 100% sure he's never seen it. But his memory isn't photographic: he confuses people's faces after a while, recognizing faces he's never seen, and that becomes worse the more people he's seen. So to be safe he doesn't reply just "yes" but "hmm, I guess so".

In computing terms the interface has 2 methods:

- takeALookAtThisFace(x)

- yaSeenThisFaceBefore?(x) which returns either "no way" or "hmm, I guess so".

"no way" means P(x is a member) = 0. Its negation is not "hell yes" (P(x is a member) = 1), it's "maybe" (P(x is a member) > 0).


Thanks for this explanation it was really helpful!


Glad to help. But I noticed the analogy is a bit flawed; testing for membership does not add anything to the set, but the analogy might imply asking yaSeenThisFaceBefore(x) will make the savant remember x. I should change the story and the 2 functions to "remember this terrorist" and "is this a terrorist?" or something like that.


For implementations that use cryptographic hash functions, do you actually need more than one hash function invocation per item?

For instance, suppose you were implementing a Bloom filter with 2^20 bits, and you want to use 10 bits per item. Instead of hashing the item 10 times, could you hash once with a 256 bit cryptographic hash, and then take the lower 200 bits, divide that into 10 bit strings of 20 bits each, and use those?


Yes, you can do that. Let me describe something similar that I implemented recently.

To use your example, let's say that the Bloom filter has a size of 2²⁰ bits (128 KB) and that we are using 10 bits per item. In other words, for each new item we need to calculate ten positions in the range [0,2²⁰).

We start by using a cryptographic hash function on the item just once. For example, SHA-256, which will give us a 256-bit value.

Now we need to extract ten blocks from these 256 bits. We can do as you suggest and make each block 20 bits long, but I don't know of a reason why we should not make them a bit longer. A length of 32 bits would be nice, so each block could be neatly mapped to a long.

To get ten blocks of 32 bits out of a 256-bit value, we just calculate a step such that block i starts at the position i * step. In our example, this means that the first block is in positions [0,31], the second is in [22,54]… and the tenth one is in [220,252].

At this point, we have ten blocks of 32 bits and we need ten values in the range [0,2²⁰). So we just turn each block into a long and calculate modulo 2²⁰ for each of them.

And that's pretty much it.

To enter the item in the filter, we set to true the positions corresponding to each of those 10 values.

To check whether an item may be in the filter, we do the same procedure and check that all of the ten positions have been set to true. If any of them is false, the item is not in the filter.


What is the tradeoff? Cryptographically secure hashes usually require dozens of rounds and take a lot of time. Could two less secure hashes could do the same thing in less time? What about when the crypto hash is hardware accelerated?

I haven't run the numbers, but I'd bet that two easy hashes still beat the hardware accelerated one by quite a lot (taking time is part of what makes cryptographic hashes secure).


I remember I played with this specific scheme a while ago, and from what I remember it was doing the work well (I did not run a comparison though).


I am confused by:

    > cryptographic hashes such as sha1,
    > though widely used therefore are not
    > very good choices
I thought SHA1 was fast, and that was a reason to not use it in applications where brute-forcing might be an issue.

    > [the more times you hash it the fewer
    > false positives]
That doesn't match my understanding of how any even slightly reasonable hash function should work, doubly so if it's uniformly distributed.


SHA1 is fast compared to key-derivation functions so you do not want to use it to hash passwords.

However SHA1 is slow compared to most hash functions (especially but not solely non-cryptographic hashes) so you don't want to use it for hash-based collections like hash tables or bloom filters.

edit: here's a bit I saved from tptacek (sadly I didn't keep the link, only the content):

* If you need random fixed-sized URLs, generate UUIDs; don't tie them to content, which can (a) change and (b) be predicted.

* For error detection, CRC schemes aren't weak. Against adversaries, MD5 is weak. For offline file integrity checking, or user-timescale online checking, use SHA256; at the very minimum, use an algorithm that hasn't been broken.

* Do not ever use the MD5(password) password scheme. MD5 is much faster than Unix crypt; even conventional Unix crypt is at least salt'ed to defend against rainbow table attacks, and modern adaptive hashing can be tuned to make dictionary attacks infeasable.

* MD5 is too slow for in-memory hash tables; I cringe when I see people use it. You're probably just hashing a string: use Torek's 31/37 hash. Otherwise, use Jenkins.

* PRNG design is hard. Just running MD5 over trivially small internal state doesn't yield a secure PRNG. Again, a problem other people solved that you have no business hacking on yourself, at least if your code matters.

* If you are concerned about collision attacks on your cryptosystem, which you should be if you're this guy and you're using MD5, use an algorithm that hasn't been broken; don't just jumble up one that already has. Kerckhoff's principal: look it up.

The bits you want are 3 and 4, but everything is good. Just sed s/MD5/SHA1/



The article is too informal and it might suggest that hash functions are iterated (taking the hash of the hash of the hash...), like in applications that need very expensive hash functions.

A Bloom filter needs multiple cheap and decently independent hashes of the input, which at least in principle could be computed in parallel. For example a FNV-family hash function, which has no arbitrary parameters nor a key input, could be applied to the concatenation of k different fixed integers with the input data to get k different, and approximately independent, hash values. The hash function would also have to be adjusted to obtain a uniform distribution over m distinct values rather than the usual much larger range.


I had added a paragraph with this idea but have since lost it accidentally :(


Bloom filters are data structure, in the same category as hashtable and trees.

They need a "hash" function to select an index for an item. A hash function in that context does not need to be cryptographic, it only needs to be quick and gives a good distribution of data.

SHA1 is slow, compared to a general purpose hash functions (typically murmurhash for bloom filters).


SHA1 lives in an inbetween world, it's not cryptographically secure, but it is slow when you don't actually care about cryptographic security. It has increasingly few modern uses.

Your second point I agree with, sounds like someone is using a very bad hash function there.


"Simply hash it a few times"?


Definitely not clear from the article alone. You can use more than one hash algorithm to reduce the chance of collisions. You set the values in the bit array for both hashes, and then you check both again.

Practical, contrived example: the string "w" gives us "2" from fnv and "11" from murmur. When you add that to the filter with both hashes, bits 2 and 11 are set.

The string "h" gives us 2 from fnv and 10 from murmur. If you were using only the fnv hash, you'd get a "maybe exists" result. But since the murmur hash is different for "h" than it is for "w", you get a "definitely not" result.

Of course, you still have collisions, you just cut them down. Both fnv and murmur return the same hashes for "w" and "woot", so adding "w" then checking "woot" still gives you a "maybe" result, but at least checking "h" does not.


sha1 is very, very fast. This article seems to confuse cryptographically useful hash functions with key stretching, adaptive work hashing, and the like.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: