Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A portable hash map in C (github.com/e-dant)
176 points by e-dant 40 days ago | hide | past | favorite | 90 comments



I recently coded a linear-probing hash map in C and I highly recommend you to use fuzz tests. These evolve naturally from unit tests: next step table unit tests, next step property test, next step fuzzing, all steps are incremental, hence easy.

Having a fuzz test was in-va-lu-ab-le. I caught several bugs right there. The delete function was particularly tricky.

I ended up with two fuzz tests as my hash table has a key feature: convergence. Having same contents, it would have exactly same bits in the buffer. In other words, it is independent of the insertion/deletion order. For this, I added another fuzz test. I would add a third one if I realize there is an important invariant I did not fuzz test. That is not much work, but so much useful!

https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASH....

https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASHd...

P.S. In your case, I recommend clang's libfuzzer with ASAN.

https://llvm.org/docs/LibFuzzer.html#fuzzer-usage


Here's another example of unit tests for hash tables (and RB-trees and skiplists and…)

https://github.com/FRRouting/frr/blob/master/tests/lib/test_...

(this file is #include'd from another, with some #define set up:)

https://github.com/FRRouting/frr/blob/master/tests/lib/test_...

The "tribe" of C developers with distaste for the preprocessor will absolutely hate how this is written. But it works and caught real issues. It doesn't use an external fuzzing tool but rather a deterministic PRNG (see prng_* calls) to get a good mix of call combinations. Coverage analysis was done to make sure it reaches everywhere.

(disclaimer: I wrote this. And it did catch bugs. Also it really should have a lot of comments.)


If you are in a tribe not hating on pp-macros, you might find this approach for testing/debugging data structures (or even the surrounding pp-based "CTL"-like C templating idea or abstract ways of dealing with the BST zoo) interesting:

    https://github.com/c-blake/bst/blob/main/test/bst-interact.c
Essentially, you just write a very simple (some might say trivial) interactive command-shell that can include an auto-checking upon edit of (in that case tree) invariants keyed-off an environment variable. The only missing piece is some little tool to generate random inserts/deletes/etc.

If the micro-shell has a pretty printer then it is easy to "replay" any failed series of edits with some "p" commands before & after invariant failures.


This is incredibly funny to me, because that's (almost) exactly how some of our other tests work :D

(Almost: we have a full-featured command shell and just use that for testing)

Shell bindings: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...

Input: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...

Expected output: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...

Absolutely agree this is a great option to have for some kinds of tests.


One reason to have a razor thin custom command shell (perhaps obeying similar conventions across data structures..) is that the parsing for such can be so fast/consistent that you can also use it for benchmarking/perf regression testing with a workload generator. You might also find & record "anomalous" workloads that way or write a little "translator" (some might say "compiler") from a log to such a workload or etc.

I have done this for data structures even as fast as integer-keyed hash tables (though in that hyper-fast case you might need to try to measure & subtract off parser-loop/IO dispatch overhead and/or need statistical testing for "actually even different at all", perhaps along the lines of https://github.com/c-blake/bu/blob/main/doc/tim.md).


fuzz testing vs HN-popular-post testing, go!


A bug, I believe: If you "put" three colliding strings A and then B and then C, and then "delete" B, you won't be able to find C anymore.


Good catch! Yes, that looks like a bug :)


You are implementing a closed hash table with linear probing. You need tombstones to mark deleted items, or better, move other items to replace deleted items [1]. Currently your library doesn't have either mechanism.

[1] https://en.wikipedia.org/wiki/Linear_probing#Deletion


If it helps, my book Crafting Interpreters walks through a linear probing hash table implementation in C including handling deletion:

https://craftinginterpreters.com/hash-tables.html#deleting-e...


Here's some food-for-thought questions:

First, a correctness question:

- if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?

And then some integration questions:

- how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?

- how do you make this typesafe for its keys and values, i.e. remove accident-prone casts on read access?

- how do you not call the compare and hash functions through function pointers? (Calling through function pointers is both performance relevant as well as a target for exploitation by overwriting the pointer)

(None of these have "nice" answers. But thinking about them is IMHO extremely valuable since these are the most challenging in organizing and engineering an actual codebase itself.)


> - if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?

It's undefined for uninitialized objects. If I understand correctly, when an initializer is provided the padding bits are set to 0. Additionally, setting the whole chunk of memory to 0 w/ memset would work.

https://stackoverflow.com/a/37642061/24042444

> - how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?

Lacking templates, the user of the library could create slim wrapper methods for each type (and coerce to void* to call the underlying hashmap lib); You could still accidentally call the wrong method but it would help a bit. I suppose you could wrap the real hashmap in a new type to help further (so each set of wrapper methods for a given type would be forced to use the same wrapper struct).

Alternatively, the library could provide helper macros to accomplish the same.


> If I understand correctly, when an initializer is provided the padding bits are set to 0.

https://interrupt.memfault.com/blog/c-struct-padding-initial... looks at this (albeit a bit outdated; clang 13 and GCC 11) and claims to see undefined values even when initializers are given, at higher optimization levels

> Additionally, setting the whole chunk of memory to 0 w/ memset would work.

This seems to be the only thing that can be relied on to work (… sadly)

An alternate answer is: don't treat structs as chunks of bytes when the chunk crosses padding (i.e. create functions to hash specific structs)

Another alternate answer: remove the padding and add "_pad12[34]" members instead. But then you need to figure out where you actually have padding, which is architecture dependent…


Thanks for sharing. Simplicity is a good thing.

In order to speed it up by reducing the number of malloc() calls, it may be worth adding a simple arena memory allocation measure: by allocating one larger block (e.g. 1 MB) initially and then doubling the memory size each time you run out, all malloc()/calloc() calls can become local salmagundi_alloc() calls that are just macro invocations that return an arena buffer pointer and also increment said pointer as a side effect.

I also recommend you have a look at Chris Hanson's book "C: Interfaces and Implementations", which has a few neat C API tricks that your code could benefit from (e.g. for reducing name space pollution, for avoiding null pointer argument errors, API method naming etc.).


If you double a memory block with realloc, the memory may be relocated to a different address (at least on MacOS). Then all pointers pointing to the old block will be invalid. This can be addressed by adding new blocks to a list. It will be more complex than a minimal arena allocator.

Furthermore, a typical arena allocator only grows but doesn't shrink. A deleted item from the memory block will still hold the memory and will not be released. You will need a more sophisticated allocator for deletion. For fixed sized items, memory pool may be an option.


Row 100 of hm_put() clears all contents of the item struct, thus losing the pointer to where the previous value was stored. But then row 107 needs this pointer to pass it to realloc().

Additionally, in case of overwrite, the memory previously allocated for the key is leaked.

All in all, I don't think row 100 is really needed, and removing it wouldn't do any harm.

Finally, deletes are hard in hash maps. Either you decide that deletes are not allowed, or you implement them with tombstones or re-insertions. A half-backed solution is not an option.

Hope these suggestions help you.


I agree. How to deal with deletions is a 101 of hash table implementation. Many CS majors could get it right or at least know about the caveat. It is amusing that many high-quality posts on hash tables are buried in the "new" section of HN while this half-baked toy floats to the front page, with 83 github stars at the moment. People don't know what they are upvoting for.


> Finally, deletes are hard in hash maps.

Deletes are hard in open hash maps like here; they're trivial in chained hash maps.

(I'm only posting this because this seems to be an exercise/learning-ish HN post and I think it's a relevant point to not forget.)


Right.

Thinking about it, I don't grasp why the author has opted for linear probing here. Given the amount of malloc(), they might have used linked lists as well.


Well, chained hash maps are significantly slower in the face of collisions due to the memory fetch latency hit to grab the next item... even ignoring the academic O(n) lookup, memory is often the primary perf constraint these days...

And for an exercise, it might be worth implementing linear probing just to get more learning experience out of it(?)


But in this specific implementation the keys are stored separately, each one at its own malloc()ed address. So its performance is limited by memory latency, cache (non-)locality, etc. just like the chained-lists variety.


Considering there seems to be at least a one memory bug (hm_put/key from another comment), I would strongly recommend running the tests in valgrind or similar if you haven't already. Doing C without it usually ends in some kind of disaster for me.


What kind of disaster? Wouldn't OS security prevent userspace program do destructive things?


It depends.

You can corrupt memory in such a way that allows executing arbitrary code, in which you could do anything that the process has privilege to do, which is probably reading and writing files as the current user and a lot more.


A userspace process can be capable of pretty destructive things, and they're all on the menu in C, especially without Valgrind.


This is neat, and remarkably small. I personally need more comments in order to follow, say, the growth logic. I see it rehashes but I don't see how it doesn't potentially overwrite entries on the way.

Algorithm for hash collisions is just to find the next empty slot (linear probing)? What happens when the original entry is removed, are the ones in subsequent slots cycled backwards? I don't see that...

Also the name is new to me! TIL 'salmagundi' is an English word (not even a loan-word) for a type of salad made of many ingredients: an excellent name for a hash map that can contain many effectively random items.


I like the exercise of trying to find the simplest reasonable solution to some problem.

Many of my toy and hobby projects are exactly that. Some make allowances for the sake of performance and generality.

The hash map, though, is up there with sorting algorithms in the breadth of wild implementations. Salmagundi is unlikely to be the fastest, or the smartest. But it’s cute and portable.


It's neat. I like it!

I edited my comment with some more questions, btw, likely as you were answering - apologies if that ended up confusing.


There’s no slot cycling logic. When a k,v pair at some index is deleted, and a different k,v pair has a matching hash, there’s a (small?) penalty in the hm_get logic that will eventually the right k,v pair.

I need a bit more clarification on your first question. There should be no mutation of the underlying map when it grows — we’re just allocating more room for entries and adjusting the old map’s item “ownership” — moving pointers->items around.


Nevermind — there are ways for deleting colliding items to cause other colliding items to be “forgotten”, and one solution would be to implement slot cycling on deletion


Another solution could be tombstones: When a slot gets cleared and the next slot has an item, insert a placeholder value. During linear probing, treat the placeholder as a normal item (that doesn't match); during insertion & deletion treat the placeholder as an empty slot.


The etymology of 'salmagundi' seems pretty obscure. It might be a loan word?

https://www.nytimes.com/1978/01/16/archives/the-salmagundi-d...

https://en.m.wiktionary.org/wiki/salmagundi


I'm curious about the #ifdef guard, which is of a form I've never encountered.

  #ifndef BD9DF82A4540BB19368E48E4747C0706
  #define BD9DF82A4540BB19368E48E4747C0706
Does some IDE do that for you? Did you just pick a random sequence? Why not just a variant of #ifndef _salmagundi_h that is more common?

Other than that, I strongly echo other's advice about avoiding so many mallocs.


Amazing how hard some people try to avoid the obviously correct and portable "#pragma once".


It is neither portable nor obviously correct.


Yes, it's both. It works fine in real world million line codebases. I have never, not once, seen a bug or portability issue caused by it. I have seen multiple header guard macro names collide.

Eschewing pragma once these days is pure ignorance and superstition --- then again, so is writing new code in C at all.


This good for you, other people reported problems with pragma once.

And I think C is one of the better languages you could use to write code today. Also from the programs I use daily, the more stable and reliable ones tend to be written in C.


Yep, header guard name collisions, and also typos in header guards so they don’t work as intended.


for me it's mostly muscle memory that takes a few seconds at the beginning of a file.

i#ifndef __foo_h <escape>yyp:s/ifndef/define/<enter>o<enter>#endif<escape>


What's wrong with `#pragma once`? (honest question, i only do C and C++ as a hobby)


You can't target ancient compilers.


It is not supported everywhere and it uses some heuristic to determine that the file is actually the same one as used before, and this heuristic sometimes fails in complicated scenarios, e.g. with multiple shared directories, link, or because a file was copied.


Nothing is wrong with that.


One thing is wrong with it: a large part of C developers either doesn't know it exists, or doesn't know what it does.

… which I'd fix by using it more :D


It doesn't exist in C. It's solely a compiler extension. Include guards get the same performance benefit as #pragma once on compilers that support it. There's no point in embracing laziness with no upside.


There is an upside: complex header files often contain a bunch of #ifdef … #endif blocks, especially when dealing with cross-platform support. Since indentation is generally not used there, they can be rather hard to read, especially when nested.

Not having the include guard use the same mechanism makes these a bit easier to read and removes a little bit of cognitive load. (And every bit of cognitive load counts since that's a very hard limit of people's brains.)

Personally (and for the projects I work on), I couldn't care less whether it exists in ISO C or not. I have a list of compilers and platforms I want/need to support. But of course your project may be different.


> Since indentation is generally not used there,

Then indent your macros for readability. clang-format can do it for you if you don't want to maintain them.


Thanks but I'm not asking for a better way to handle #ifdef's, indenting them is extra work¹ and there's still an unnecessary outermost block for the include guard. You've also not made any further argument against the use of "#pragma once".

¹ especially across an open source project with more than a hundred contributors from a multitude of companies.


Pedants don’t like it.


It's the length of an MD5 hash, which I'd wager isn't a coincidence. It's not "salmagundi", "Salmagundi", or "salmagundi.h" though.


My C is quite rusty, so apologies for stupid questions.

In the hm_put function, when you overwrite, why do you malloc and copy the key again, and what happens with the old key pointer and the old value pointer? (no free for the key and the memset zeros everything?)


Nice catch :)

Looks like the key logic is missing an overwrite case, the value is reallocated on overwrite.


In the code example, I guess the variable `k_sz` is undefined in the call `hm_get(map, k, k_sz)`. Hope that helps.


In reviewing hm_put, a few things stand out in the area of handling cases where memory allocation fails. The first problems are at lines 113-116 where memory is allocated and written to, then the allocations are validated in line 117 after they are written to with memcpy. An allocation failure would cause a crash. Secondly, more about semantics, is at line 105 where item->v is free’d after it is found to be NULL, to me it is counter intuitive and unnecessary. Reading further (and back to the first area of concern) at lines 118-119, after an allocation failure, both are free’d which, to me, is acceptable and convenient. In light of that, it seems that line 105 is there for the sake of the reviewer who may not like the apparent asymmetry of the two sites.


I want to thank everyone here who pointed out deficiencies in this little library.

Good catches :)


Not getting all defensive is a superpower!


I followed the link in the docs to the page on the hash function. There's a DJB-2 algorithm on that page, but no DJB-1


Huh — I must have written djb1 in error


Purely out of interest, and probably my bad, but what is:

    (void)k_sz;
doing? I've seen fussy people doing:

     (void) printf( "hello" );
but not what you have written. As I say, probably ignorance on my part.


It's an idiom for telling the compiler “don't give a warning on this argument not being used, I know about it”. Not all compilers will heed it. Other choices include e.g. __attribute__((unused)) (GNU), [[maybe_unused]] (C++17/C23) or just not giving it a name (nonstandard, but very commonly supported).


> or just not giving it a name (nonstandard, but very commonly supported)

It's only supported by gcc 11+ and clang 11+ (both from ~2021) because it was a suggested C2x extension. It's been accepted in C23 so other compilers will eventually support it, but I wouldn't hold my breath regarding when that happens.


Ah, of course, in C mode it's different. (I nearly always use C++ mode, where at least GCC 10 happily takes it.)


OK, thank you. But why do the functions have those unused parameters?


I would assume that is because the hash table takes in a “hash this key” function pointer and a “compare these keys” function pointer, and those must contain the size for variable-length keys. So even if your user-supplied functions know the length, or only care about the first byte, they have to conform to that API, and then you get an unused parameter.


Have functions with different names and different numbers of parameters?


And then have a union between the two different kinds of pointers, and a tag to tell which one is in use, then test that tag everywhere? Why? What good does it bring for that extra complexity?


Doesn't the name of the function select for the types and use of the parameters? But I haven't written in C for a long time, and this I think could be done more clearly in C++.


You're saying you would write the entire hash table four times, with everything duplicated and no reuse? Again: Why? What do you gain compared to one line to suppress a warning?


Clarity? And in C++ at least, I would re-write the functions.


The point is that you can pass any of the hash functions (or your own compatible) to the datastructure, so it'll obviously call of them with the same set of parameters. And hash functions that need all of them will ignore them.


That would be a bad API design.


That would be one way to silence an unused variable warning


This is to suppress a compiler warning that the k_sz argument is unused.



It silences 'unused variable' warnings.


If you’re interested in this, I wrote a blogpost on a simple c hash table without deletion.

https://www.davidpriver.com/c-hash-table.html


I suggest improving the probing a little. Something like

  idx = (idx + 1) % map->cap;
becomes

  iterCount++;
  idx = (idx + iterCount) % map->cap;


It could be improved even more, performance wise. The potentially expensive modulo could be avoided entirely with an if statement. Or, only use powers of 2 for the capacity, and then you can also use bit wise ops


This web site is amazing. We have, on one hand, people explaining machine sentience and quantum breakthroughs, and on the other hand, we have people refusing to use "#pragma once" in C because it's still too novel and doesn't work in every compiler yet.

Never change, HN.


[flagged]


My auto-formatter always changes it like that and I just don't understand why `char *foo` is better than `char* foo`.

I consider `char*` to be the type of `foo`. What's the rationale of having `char *foo`?


Dunno how this is supposed to be worded, but it's because the "pointerhood" is tied to the variable being declared, not the type itself. This becomes obvious when you declare multiple variables at once.

  char* cat, dog;  /* one char pointer, one char */
  char *cat, *dog; /* two char pointers */


The minor problem is that a typedef pointer breaks this pattern:

  typedef foo * FooP;
  FooP a, b; // Both are pointers
Pointer typedefs are misguided but there is no denying that C is inconsistent on whether the '*' is part of the type or not.


Right. It would be nice if C allowed types and variables to be separated, we could even defines arrays like this:

  char[8] buffer;
Alas, it's not the syntax Dennis Ritchie settled upon.


The trade-off is one vs. two operator sub-syntaxes or in other words "sub-syntax economy". As it is, C has just one operator expression syntax (and one set of operator precedences/associativities). The "forward applicative" expression syntax is "reused" in the "backward" or "inverse" type declarations.


It's purely cultural. "char* foo" or "char * foo" does make more sense than "char *foo". But culture is a very strong and useful thing; if it "looks weird", that will create issues in people's brain when thinking about the code.

Also, case in point: how do you format "char * const foo"?


Redundant cues are also strong. So, for example, C is largely whitespace-insensitive and operator precedence can matter a lot. Consider "x+y*z". If you always spaced this "x+y * z" it would be the same to the parser but less clear to a human than spacing it "x + y*z" where you have used whitespace as a redundant cue to operator precedence.

Similarly, what makes C type declarators tricky for many is their "inverse nature" - "char *foo" means, applying the operator "*" to the identifier "foo" yields a base type "char". So, the "char* foo" form is literally creating a cue in the opposite direction of operator precedence just like "x+y * z" would. For just one indirection it's no big deal, but for things like "char *foo[]" and beyond it becomes more important. So, the "mis"-spacing "char* foo" sets people up for later conceptual failure.

This is in addition to the above point by @onre about multiple identifiers and @anacrolix's initial rather, ahem, pithy take. There is also another "where would ()s be redundant/unneeded" rationale (though this largely recapitulates the 2nd paragraph here). If you are looking for a catchphrase then maybe it could be "Don't space-against-syntax".


I do not think this analysis is correct. "char *foo" is not a cue in the wrong direction, it indicates that in "char *foo, bar" only foo is a pointer. Whether the syntax for declarations in C is a good or idea or not is a different question, but it is not misleading in the same way "x+y * z" is.


You seem to have misinterpreted my comment since we agree that the "no-space-before 'foo' way" is truer to the syntax, and to repeat myself, I had said "char* foo" is the bad cue and even mentioned @onre's multiple identifiers supporting point (which you reiterate). It can be hard to see whitespace on HN comments, though unless you do the 4space indents to get a monospaced font. You can in-page search, though.

The only difference with the spacing against syntax in "x+y * z" is that there are fewer type-relevant operators (e.g. '+' does not work in a type expression). One does NOT hear people advocating for more space around '*' than '+' (though I have sometimes heard "equal space"). Your non-disagreement has not persuaded me that this is a bad way to explain the problem, although I agree I could cut down on the double negatives. :-) { Even grade school kids have some sense of precedence of '+' & '*' which is so broadly adopted that even hand calculators do it. }

"How misleading" spacing against syntax can be is admittedly more subjective. I think whitespace can really matter to human readers. There are many other examples. gcc added -Wmisleading-indentation a while back. Off-side rule languages like Python embrace things that sometimes prevent that problem. The Nim compiler was once sensitive to exactly this kind of operator spacing where it would parse "x+y * z" { space around '*' } as "(x+y)*z" in an attempt to have the compiler "Do what I mean".


Ah sorry, I then I have indeed misread your comment.


I always get confused with `const` so I avoid using it hah. But If I had to, I'd either use `const char * foo`, `char * const foo` or `const char * const foo`.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: