Hacker News new | past | comments | ask | show | jobs | submit | reivalc's comments login

Neat! I wonder if you quantified the performance of your search somehow? I guess labelling data for that would be laborious.


You mean performance in terms of latency or relevancy?


Relevancy - I'd guess it's difficult to measure, due to the large dataset, the large number of possible search terms, and large number of possible results.


thanks for the post, it inspired this naive code:

    class BloomFilter:
        def __init__(self, size):
            self.f = [0] * size

        def contains(self, s):
            h1, h2, h3 = self.hashes(s)
            if self.f[h1] * self.f[h2] * self.f[h3] == 1:
                return 'Value might be in the set.'
            else:
                return 'Value is definitely not in the set.'

        def hashes(self, s):
            h1 = hash(s) % len(self.f)
            h2 = hash(s + 'salt') % len(self.f)
            h3 = hash(s + 'more salt') % len(self.f)
            return h1, h2, h3

        def insert(self, s):
            h1, h2, h3 = self.hashes(s)
            self.f[h1] = self.f[h2] = self.f[h3] = 1

    bf = BloomFilter(64)
    bf.insert('bill')

    print(f"{bf.contains('bill') = }")
    print(f"{bf.contains('bob') = }")

    Out:
    bf.contains('bill') = 'Value might be in the set.'
    bf.contains('bob') = 'Value is definitely not in the set.'


This looks fantastic! I also implemented one from scratch[0], and I was doing something very similar at first. Then I found the Python hash() function returns different outputs for the same input when you restart the interpreter. I really like your implementation, it's functional and easy to understand!

[0] https://ricardoanderegg.com/posts/understanding-bloom-filter...


Nice!

> found the Python hash() function returns different outputs for the same input when you restart the interpreter

I wasn't aware of this >>


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: