That's only if you're sending a bloom filter for the entire hash table though. If you just use the existing buckets you could reduce the API response sizes considerably.
The real problem is that the current API returns a count of the number of times each particular password has appeared, and AFAIK there's no good way to do that with a bloom filter.
In order to increase the count for a given key, you expand the key into your probe sequence, find the minimum counter across all of the probed locations, and then increment all of the values equal to that minimum value. One generally uses saturating addition to avoid overflow. If you want to support removing items, you actually increment the values at all of the probe locations, at a cost of increasing the expected deviation from correct counts.
To read the count for a given key, you generate your probe sequence and return the minimum of the values stored at the probed locations.
Note that a regular Bloom filter is just a counting Bloom filter using 1-bit counters.
In this case, I presume one would generate a counting bloom filter and then either store logarithm or quantile of the count, since there isn't a common use case for distinguishing between 4,096 and 4,095 occurrences of a given password.
I don't understand why everyone's obsessed with using a Bloom Filter for this. The median response size on the API is 12.2KB with 305 entries. A Bloom Filter with 305 entries and a 0.000001 false positive rate would be about 1KB. It seems to me you're introducing a lot of complexity (Bloom Filter vs. simple string match) and possibility of false positives to save 11KB.
A bloom filter can be pushed to clients. Everything is static and could be served from a CDN. If there is a hit you could then do a secondary request to perform an actual lookup.
For passwords which have not been pwned there'd be a 100% savings on CPU.
True, although there's an interesting side effect. With the false positive rate tuned low any call to the actual API would likely be saying "My password is one of the ones you already know about" with quite high probability.
Presumably the use case is to reject either common or compromised passwords, and they'd like to eliminate a point of failure by not depending upon Troy's service being up / maintained forever.
Also, a Bloom filter has a smaller page cache footprint and disk space usage than using a precise set.
You'd have to either send the password, or something very easily derived from the password for this to work. Or have every client download the bloom filter data which is quite a lot of bandwidth by comparison.
True, I didn't see they allow to query on ranges. I guess you might be able build something on a BloomFilter Trie for that, but that might be pushing it.
But for one of the calls (checking a whole password against what's stored), which probably accounts for most of the usage of the API, a bloom filter seems like a perfect fit.
Doesn't seem particularly necessary though yeah? Bloom filters are super useful when the cost of doing the computation for the actual lookup is high, but in this case it's a single key-value lookup. That's about as cheap as an operation can possibly get.