Wouldn't using a simple bloom filter make sense in their case? Just build the th...

jgrahamc · on Feb 27, 2018

Here's a quick table of sizes of the Bloom Filter with the FP rate

    False positives            Size (MB)
    0.1                              285
    0.01                             571
    0.001                            857
    0.0001                          1120
    0.00001                         1390
    0.000001                        1670

Ajedi32 · on Feb 27, 2018

That's only if you're sending a bloom filter for the entire hash table though. If you just use the existing buckets you could reduce the API response sizes considerably.

The real problem is that the current API returns a count of the number of times each particular password has appeared, and AFAIK there's no good way to do that with a bloom filter.

KMag · on Feb 27, 2018

Sure there is. Use a counting bloom filter.

In order to increase the count for a given key, you expand the key into your probe sequence, find the minimum counter across all of the probed locations, and then increment all of the values equal to that minimum value. One generally uses saturating addition to avoid overflow. If you want to support removing items, you actually increment the values at all of the probe locations, at a cost of increasing the expected deviation from correct counts.

To read the count for a given key, you generate your probe sequence and return the minimum of the values stored at the probed locations.

Note that a regular Bloom filter is just a counting Bloom filter using 1-bit counters.

In this case, I presume one would generate a counting bloom filter and then either store logarithm or quantile of the count, since there isn't a common use case for distinguishing between 4,096 and 4,095 occurrences of a given password.

jgrahamc · on Feb 27, 2018

I don't understand why everyone's obsessed with using a Bloom Filter for this. The median response size on the API is 12.2KB with 305 entries. A Bloom Filter with 305 entries and a 0.000001 false positive rate would be about 1KB. It seems to me you're introducing a lot of complexity (Bloom Filter vs. simple string match) and possibility of false positives to save 11KB.

ericfrederich · on Feb 27, 2018

A bloom filter can be pushed to clients. Everything is static and could be served from a CDN. If there is a hit you could then do a secondary request to perform an actual lookup. For passwords which have not been pwned there'd be a 100% savings on CPU.

jgrahamc · on Feb 27, 2018

True, although there's an interesting side effect. With the false positive rate tuned low any call to the actual API would likely be saying "My password is one of the ones you already know about" with quite high probability.

KMag · on Feb 27, 2018

Presumably the use case is to reject either common or compromised passwords, and they'd like to eliminate a point of failure by not depending upon Troy's service being up / maintained forever.

Also, a Bloom filter has a smaller page cache footprint and disk space usage than using a precise set.

thomasj · on Feb 28, 2018

I've experimented with using a bloom filter for this dataset: https://github.com/jthomas/serverless-pwned-passwords

pbnjay · on Feb 27, 2018

You'd have to either send the password, or something very easily derived from the password for this to work. Or have every client download the bloom filter data which is quite a lot of bandwidth by comparison.

zcam · on Feb 27, 2018

what?

Just send a hash from the client and have the bloom filter built offline with the sames hashes, no big deal. You'd never ask clients to download it.

pbnjay · on Feb 28, 2018

What? Your approach would be highly susceptible to a rainbow table-like attack AND not allow for k-Anonymity.

_wldu · on Feb 27, 2018

I think so. Just set the bits and test them as needed.

bpicolo · on Feb 27, 2018

Wouldn't be useful for the range queries, all possible prefixes return data

zcam · on Feb 27, 2018

True, I didn't see they allow to query on ranges. I guess you might be able build something on a BloomFilter Trie for that, but that might be pushing it.

But for one of the calls (checking a whole password against what's stored), which probably accounts for most of the usage of the API, a bloom filter seems like a perfect fit.

bpicolo · on Feb 27, 2018

Doesn't seem particularly necessary though yeah? Bloom filters are super useful when the cost of doing the computation for the actual lookup is high, but in this case it's a single key-value lookup. That's about as cheap as an operation can possibly get.