Sorry if this is a bit negative but from a cursory look I wouldn't use this.
An efficient Bloom Filter should only allocate on heap exactly once: when allocating the initial bitmap. This implementation seems to be not very efficient as it allocates even during a simple query.
The usage of `static` variables makes the BloomFilter code unnecessarily dangerous to use when multiple threads are used even if every thread just wants to allocate and use totally separate bloom filters. Actually it could even cause issues with two bloom filters created after each other since even `Initialize` just returns a pointer to a single static `BloomFilter` :o I can't come up with a reason why you'd consider making anything in there `static`.
Oh and btw I believe bloom_path can cause memory safety issues because it is not terminated with a null after strncopy.
I would additionally suggest to use some code formatting tool as it's a bit all over the place. But the variable/function names mix various styles as well.
I noticed some other weirdness like functions taking a double pointer argument and then freeing the memory and replacing it with a newly allocated piece of memory. Seems rather dangerous to do that.
And zeroing out data by freeing the memory and callocing new memory. Why not just write zeros into the memory instead of reallocating it from scratch only to overwrite it with zeros anyway?
If you want a networked version (note: not based on Fleur): https://github.com/armon/bloomd In the source there is a "libbloom" as well that you could copy directly into your codebase if you wanted an embedded version. The networked version has a simple Redis-like text protocol so you can just use `telnet` to play around with it, too.
Also written in C, with great performance (no comparison to this one, I haven't done it), and has been used in production by many companies for many years (since ~2012 or 2013).
Just pointing out other implementations if anyone is curious!
[3]: I have a vague recollection of somebody telling me why combining two hashes in the way bloomd does for k >= 4 is a good idea but I can't remember - anybody have a good reference for me to link to? (edit: nvm, I already link the paper on my page! sheesh)
I vaguely remember having read about multiple hashes for bloom filters in the past, but I have trouble extracting the actual approach from the linked paper.
Could you give a summary? The paper is quite mathematical and seems to lack a clear description of how to actually use the two hashes without reading in depth.
The basic idea of a Bloom filter is that you represent each item in the filter by setting k bits, and you set and query these bits using k hash functions. The bits need to be independent and uniformly distributed, so the hash functions need to output suitably random values for the same input.
Even fast hash functions like Murmur add overhead, though, and the lower the desired false positive rate, the more hash functions you need (x hash functions for 2^-x false positive rate).
The conclusion of the paper is roughly that you can create new hash functions by recombining the outputs of two initial hash functions without compromising the statistical integrity of the filter, and this makes querying the filter a lot faster.
To be clear, even the two initial hash functions can be halves or quarters of a single hash function with an output larger than the filter needs, e.g. a filter that needs 64-bit hashes can run entirely on Murmur-128 using its bottom and top halves as the two hash functions.
hah yeah I noticed a minute later, the link is right in the comments, and I'd already linked the paper in my article! I feel like this is one of those things I re-learn every 5 years
I appreciate the Go-style docstring comments in the code!
For anyone trying to understand Bloom filters who may be interested in checking out an alternative C implementation (designed to be compiled to WebAssembly), I'll link mine below.
I used it to build a Bloom filter of every Hacker News submission ever for a privacy-preserving browser extension that tells you if the page you're currently on has been submitted before.
Could be faster still if it didn't calloc/free a "Fingerprint" buffer each time you add or query an entry. Since its individual BloomFilters are not thread-safe anyway, it could allocate the buffer once at BloomFilter creation time and be done with it.
Sorry to be pedant, but using good prefix, or even suffix for C library is a must, and come up with something unique (and "fleur" is kind of cool), probably good idea to prefix with it, and then it can be used without any current or future symbol clashes.
An efficient Bloom Filter should only allocate on heap exactly once: when allocating the initial bitmap. This implementation seems to be not very efficient as it allocates even during a simple query.
The usage of `static` variables makes the BloomFilter code unnecessarily dangerous to use when multiple threads are used even if every thread just wants to allocate and use totally separate bloom filters. Actually it could even cause issues with two bloom filters created after each other since even `Initialize` just returns a pointer to a single static `BloomFilter` :o I can't come up with a reason why you'd consider making anything in there `static`.
There is also strange code like this loop https://github.com/hashlookup/fleur/blob/4ee2644a850381d928a... that jumped into my eye.
Oh and btw I believe bloom_path can cause memory safety issues because it is not terminated with a null after strncopy.
I would additionally suggest to use some code formatting tool as it's a bit all over the place. But the variable/function names mix various styles as well.