reivalc's comments

reivalc · on April 13, 2021

Neat! I wonder if you quantified the performance of your search somehow? I guess labelling data for that would be laborious.

wolfgarbe · on April 13, 2021

You mean performance in terms of latency or relevancy?

reivalc · on April 13, 2021

Relevancy - I'd guess it's difficult to measure, due to the large dataset, the large number of possible search terms, and large number of possible results.

reivalc · on April 11, 2021

thanks for the post, it inspired this naive code:

    class BloomFilter:
        def __init__(self, size):
            self.f = [0] * size

        def contains(self, s):
            h1, h2, h3 = self.hashes(s)
            if self.f[h1] * self.f[h2] * self.f[h3] == 1:
                return 'Value might be in the set.'
            else:
                return 'Value is definitely not in the set.'

        def hashes(self, s):
            h1 = hash(s) % len(self.f)
            h2 = hash(s + 'salt') % len(self.f)
            h3 = hash(s + 'more salt') % len(self.f)
            return h1, h2, h3

        def insert(self, s):
            h1, h2, h3 = self.hashes(s)
            self.f[h1] = self.f[h2] = self.f[h3] = 1

    bf = BloomFilter(64)
    bf.insert('bill')

    print(f"{bf.contains('bill') = }")
    print(f"{bf.contains('bob') = }")

    Out:
    bf.contains('bill') = 'Value might be in the set.'
    bf.contains('bob') = 'Value is definitely not in the set.'

polyrand · on April 11, 2021

This looks fantastic! I also implemented one from scratch[0], and I was doing something very similar at first. Then I found the Python hash() function returns different outputs for the same input when you restart the interpreter. I really like your implementation, it's functional and easy to understand!

[0] https://ricardoanderegg.com/posts/understanding-bloom-filter...

reivalc · on April 12, 2021

Nice!

> found the Python hash() function returns different outputs for the same input when you restart the interpreter

I wasn't aware of this >>