Fleur – A bloom filter implementation in C

eis · on July 28, 2022

Sorry if this is a bit negative but from a cursory look I wouldn't use this.

An efficient Bloom Filter should only allocate on heap exactly once: when allocating the initial bitmap. This implementation seems to be not very efficient as it allocates even during a simple query.

The usage of `static` variables makes the BloomFilter code unnecessarily dangerous to use when multiple threads are used even if every thread just wants to allocate and use totally separate bloom filters. Actually it could even cause issues with two bloom filters created after each other since even `Initialize` just returns a pointer to a single static `BloomFilter` :o I can't come up with a reason why you'd consider making anything in there `static`.

There is also strange code like this loop https://github.com/hashlookup/fleur/blob/4ee2644a850381d928a... that jumped into my eye.

Oh and btw I believe bloom_path can cause memory safety issues because it is not terminated with a null after strncopy.

I would additionally suggest to use some code formatting tool as it's a bit all over the place. But the variable/function names mix various styles as well.

shric · on July 28, 2022

> There is also strange code like this loop https://github.com/hashlookup/fleur/blob/4ee2644a850381d928a... that jumped into my eye.

The line below that is worse:

strncpy(bloom_path , argv[optind], 128);

If you pass something >= 128 chars then bloom_path won't be null terminated. In general strncpy should never be used for copying strings.

For expansion on that: https://ramblings.implicit.net/c/2014/05/02/c-functions-that...

shric · on July 28, 2022

And I just realised the GP already mentioned the bloom_path issue. Sorry, too late to edit.

metadat · on July 28, 2022

tl;dr: use strlcpy instead.

tpankaj · on July 28, 2022

I noticed some other weirdness like functions taking a double pointer argument and then freeing the memory and replacing it with a newly allocated piece of memory. Seems rather dangerous to do that.

And zeroing out data by freeing the memory and callocing new memory. Why not just write zeros into the memory instead of reallocating it from scratch only to overwrite it with zeros anyway?

mitchellh · on July 27, 2022

If you want a networked version (note: not based on Fleur): https://github.com/armon/bloomd In the source there is a "libbloom" as well that you could copy directly into your codebase if you wanted an embedded version. The networked version has a simple Redis-like text protocol so you can just use `telnet` to play around with it, too.

Also written in C, with great performance (no comparison to this one, I haven't done it), and has been used in production by many companies for many years (since ~2012 or 2013).

Just pointing out other implementations if anyone is curious!

llimllib · on July 27, 2022

Neat! Because it's a thing I do[1], the hash algorithms in use:

Fleur (and DCSO/bloom and DCSO/flor): fnv

bloomd: a combination of SpookyHash and murmur[2]

[1]: https://llimllib.github.io/bloomfilter-tutorial/ (I'll update it to add bloomd and spooky)

[2]: https://github.com/armon/bloomd/blob/23c19a7f5cbb35d7c3d970b...

[3]: I have a vague recollection of somebody telling me why combining two hashes in the way bloomd does for k >= 4 is a good idea but I can't remember - anybody have a good reference for me to link to? (edit: nvm, I already link the paper on my page! sheesh)

FreakLegion · on July 27, 2022

It's in the source, Less Hashing, Same Performance: Building a Better Bloom Filter: https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.p...

I've used this trick at scale on network gear and it works great.

ascar · on July 27, 2022

I vaguely remember having read about multiple hashes for bloom filters in the past, but I have trouble extracting the actual approach from the linked paper.

Could you give a summary? The paper is quite mathematical and seems to lack a clear description of how to actually use the two hashes without reading in depth.

FreakLegion · on July 27, 2022

The basic idea of a Bloom filter is that you represent each item in the filter by setting k bits, and you set and query these bits using k hash functions. The bits need to be independent and uniformly distributed, so the hash functions need to output suitably random values for the same input.

Even fast hash functions like Murmur add overhead, though, and the lower the desired false positive rate, the more hash functions you need (x hash functions for 2^-x false positive rate).

The conclusion of the paper is roughly that you can create new hash functions by recombining the outputs of two initial hash functions without compromising the statistical integrity of the filter, and this makes querying the filter a lot faster.

To be clear, even the two initial hash functions can be halves or quarters of a single hash function with an output larger than the filter needs, e.g. a filter that needs 64-bit hashes can run entirely on Murmur-128 using its bottom and top halves as the two hash functions.

llimllib · on July 27, 2022

The bloomd source I linked above provides an excellent simple practical implementation

llimllib · on July 27, 2022

hah yeah I noticed a minute later, the link is right in the comments, and I'd already linked the paper in my article! I feel like this is one of those things I re-learn every 5 years

gallypette · on July 28, 2022

Ho wow, I did not expect an hacker news front page code review today. As you can see I am not a seasoned C developer so that was very welcome :)

jstrieb · on July 27, 2022

I appreciate the Go-style docstring comments in the code!

For anyone trying to understand Bloom filters who may be interested in checking out an alternative C implementation (designed to be compiled to WebAssembly), I'll link mine below.

I used it to build a Bloom filter of every Hacker News submission ever for a privacy-preserving browser extension that tells you if the page you're currently on has been submitted before.

https://github.com/jstrieb/hackernews-button/blob/master/blo...

drfuchs · on July 27, 2022

Could be faster still if it didn't calloc/free a "Fingerprint" buffer each time you add or query an entry. Since its individual BloomFilters are not thread-safe anyway, it could allocate the buffer once at BloomFilter creation time and be done with it.

malkia · on July 27, 2022

Sorry to be pedant, but using good prefix, or even suffix for C library is a must, and come up with something unique (and "fleur" is kind of cool), probably good idea to prefix with it, and then it can be used without any current or future symbol clashes.

lambdaloop · on July 27, 2022

I love that name! Based on the metrics at the bottom, is it right that it's 3 times faster than the Go implementation? What makes it so fast?

FreakLegion · on July 27, 2022

Without looking at the implementations I'd guess it's Go's bounds checking.

khoobid_shoma · on July 27, 2022

Was it necessary to use Ninja?

glouwbug · on July 27, 2022

The best approach is bash + cmake + ninja + github actions with a single main.c /s