Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: libcodr7 – fundamental collections in the spirit of C (github.com/codr7)
67 points by codr7 on Dec 30, 2019 | hide | past | favorite | 51 comments



I didn't realize you could implement a red-black tree with this little code.

The code shown here seems to indicate that insert has at least four cases, and delete has six: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree

But the code I see here is very small: https://github.com/codr7/libcodr7/blob/master/source/codr7/r...

I guess it's not handling delete, which probably accounts for a lot of the difference.

EDIT: I see that it's implementing a left-leaning red/black tree. I see some caution about that approach here? Would love to get the author's opinion about this: http://read.seas.harvard.edu/~kohler/notes/llrb.html


Red-black tree presentations are often more complicated than necessary, mostly because of (premature?) optimizations. A simple presentation can be found here: https://www.cs.tufts.edu/~nr/cs257/archive/chris-okasaki/red...

I haven't checked how this relates to the code in this library.


You'll find delete on top of the todo list, it's more complex but not horribly so.

I skimmed the link, but from what I can see there are no issues there that outweigh the significantly simpler code for me. This may very well change once I get more experience with them, time will tell.


Anyone who does too much C would know that manually implementing associative array (dictionary) is not fun. I started to use two implementations for my work, both worked pretty well for the things they are good at (and all of them are as "generic" as you can get from C):

khash: https://github.com/attractivechaos/klib/blob/master/khash.h

The interface I really liked, and you can customize the hash equality functions.

rb from jemalloc: https://github.com/jemalloc/jemalloc/blob/dev/include/jemall...

The interface is a bit wacky, but it is flexible enough for my needs. rb shines when you need range queries (keys less than, more than etc.)


I prefer ordered sets with value semantics (which allows using as maps without overhead) to hash tables since I find the performance easier to reason about and often want ordering anyway. The easiest one to implement would be a binary searched vector, but comes with the disadvantage of unstable pointers because of reallocations. rbtrees are definitely trickier, though this one is left leaning which reduces the complexity somewhat (see PDF linked from README).


The rbtree implementation from jemalloc allows you to maintain the value semantics (it is an intrusive design as well). I believe it is also a left-leaning 2-3 red-black tree to be exact. The API is a bit harder to understand but once you get use to it, can be pretty easy to use. It also implemented one trick that embeds the color bit into the right child node pointer, otherwise you are paying a full 8-byte size penalty.


Yeah, that's usually where I draw the line, embedding data in pointers :) Bitfields and packed structs is another angle that I usually don't bother with. Too many details to keep in mind for my feeble brain.


If I were starting a new project in C, I would definitely rely on a solid foundation library like APR[0] or CoreFoundation[1] without reimplementing a bunch of fundamental data structure algorithms by myself. Glib is another alternative, but too bloated in my view.

[0] - https://apr.apache.org [1] - https://github.com/opensource-apple/CF


When I use C, it's typically for performance reasons. I need to do funny things with pointers and bit-wise arithmetic that I can't do in higher-level languages because I'm operating in cases where a few more instructions per loop iteration matter. I usually end up re-implementing very fast, cut-down versions of data structures that work only on a tightly-bounded input set for maximum performance (e.g. very fast hash table for IP addresses recently). This approach doesn't work for the cases where C really shines.


Note that the GitHub link you have there is an unofficial mirror and the git version history consists simply of the official source dumps.

To each their own but IMO repositories created in that manner are generally not very useful (though there are exceptions).

The official location of the CoreFoundation (actually CFLite) source dumps is https://opensource.apple.com/tarballs/CF/

The most recent CFLite tarball is https://opensource.apple.com/tarballs/CF/CF-1153.18.tar.gz

You can browse the CFLite source online at the official location as well: https://opensource.apple.com/source/CF/CF-1153.18/

The docs for CoreFoundation are at https://developer.apple.com/documentation/corefoundation


These days you’re probably better off using the version of CoreFoundation that’s maintained as part of the Swift project:

https://github.com/apple/swift-corelibs-foundation/tree/mast...

...as it’s more up-to-date and developed in the open.


I wouldn’t go as far as to say it’s “developed in the open”. It’s significantly more opaque than Swift or WebKit, for example; I’d describe it as “accepting of patches”.


In that case I would rather go all in and use C++, C is all about control and flexibility.


I can't figure out how the linked list is used. Where is the per-element data?

I thought there would be a usage example in the tests, but there wasn't.


These structures are intended to be used like the Linux Kernel's data structures, via `container_of` (Though in the library the macro is called `c7_baseof`. But it does the exact same thing as `container_of`).

They have somewhat of an example usage of the list in the `deque.c` file: https://github.com/codr7/libcodr7/blob/2b598e3d89d878ef53e05...

If you don't know about `container_of`, the idea is that it allows you to take a pointer to an inner field, and transform it into a pointer to an outer 'container'. For the linked-list, this means that if you want to make a list of `struct car` objects, you simply embed a `struct c7_list` into `struct car`, pass the address of that internal `struct c7_list` object to the list manipulation functions provided here, and then use `container_of` (Or `c7_baseof`) to transform the pointer to the `struct c7_list` into a pointer to the containing `struct car`.

In general, I think `container_of` is one of the best things for writing C code, and it provides a surprising amount of flexibility and nice patterns you can use. It does have the obvious issue of type checking (You could have a `struct wheel` that also has a `struct c7_list` on it, and accidentally add ti to the same `struct c7_list` that has `struct car`), and if your `struct car` needs to be on more than one list at a time it can get confusing knowing which `struct c7_list` entries belong to which list, but in general I'd call it a big net positive.

It's also possibly worth noting, you can use this to implement a non-intrusive list if you want, just simply make an object that wraps a `struct c7_list` and a void pointer (Or a typed pointer, if you want), and then write some extra logic around it for allocating nodes and such.


What a neat trick for creating a fairly generic intrusive data structure. I’d love to have (or put together) a little compendium of clever, non-obvious programming tricks like this.


Also sometimes referred to as "intrusive containers." Related SO post: https://stackoverflow.com/q/5004162/9287029 Also provided for C++ by Boost.Intrusive


Back when I independently invented this due to being self-taught, I didn't know about offsetof(), and I just put the collection at the start and cast the pointer. Learning C is surprisingly obscure even today.


Yeah, even just 'upcasting' is surprisingly effective as well, it's really just inheritance. The lack of being able to embed more than one 'thing' is painful though. And if you're not careful you can run into ugly strict aliasing issues like `struct sockaddr` has (Though a quick `-fno-strict-aliasing` can make all your problems go away...)

And I would agree, I find a surprising number of people think C is 'easy' because it doesn't contain that many built-in constructs, and then get completely stuck when they try to use it. To get really good at C, you need a very solid understanding of common and effective patterns you can use. You can still write C without such things, but it quickly becomes a mess of random patterns and inconsistencies. And unfortunately, I can't really think of one source you can really point to for learning them (though admittedly I haven't really looked into it tons).


Casting around between the type of the first struct member and the type of the overall struct is okay under strict aliasing, but unfortunately sockaddr violates it because sockaddr isn't a member of sockaddr_in. Not that I knew this at the time, it was just a mysterious rule.


Yeah, I used to spend a lot of time reinventing OOP in C back in the days. Part of the problem is I learned C++ long before I took a serious look at C.

These days I definitely prefer embedding/baseof since it's more explicit.

When I need polymorphism, stuffing some function pointers in a struct usually works well enough.


Okay, I'm getting it. Not a knock on the author, but looking at this macro'd, pointer-to-void'd stuff certainly makes me appreciate the STL.

Edit: aha, I see "intrusive" was the keyword I was missing to learn all about this different style.


Using `container_of` lets you avoid having to use void pointers. And the macros like foreach are surprisingly readable once you give them a shot, and they're simple enough they don't really have any big gotchas in how they work.

Not every `container_of` data structure is perfect, but in general I'd say they're just as nice to use as any other language's data structures (Though `container_of` is flexible in different ways). The only big downside is that it not part of the C standard, leading to more than a few implementations, some more featureful than others. This one is probably the "least featureful" version I've seen, which may be intentional. The Linux Kernel's implementation has a lot more utility things like looping in different ways and different types of list manipulations. Some are highly useful, others not at much.


Intrusive lists are very different beasts than lists by value.

There are less allocations involved (perf and points of failure), values can be inserted on different lists without new allocations, deletion is O(1), elements can be heterogeneous (different sizes and types), etc

etc

I seldomly use linked lists, but most of the time i prefer them to be intrusive. There of course are intrusive list implementations on C++.


Same here.

Using non-intrusive linked lists just feels wrong once you've gotten used to the idea of embedding links, I find there are always better options.

Intrusive lists on the other hand is simply the best solution in some cases.


> values can be inserted on different lists without new allocations

Not multiple lists at the same time, surely...?


They were talking about taking a value off of one list and adding it to a different one, which doesn't require an allocation, but the entity is never on more than one list at a time in that scenario.

That said, you can add an entity onto multiple lists if it contains multiple nodes embedded inside. You have to keep track of which nodes are attached to which lists though (Which generally just means being consistent on which you use where). If you give them decent names, then it's not usually a problem, but it can sometimes get a little confusing if you're not careful.


Obviously not


It's all about control and flexibility. If you don't need it you shouldn't pay the price. Having spent quite some time trying to bend the STL to my exact requirements, I don't mind so much.


Thanks for explaining that.


Sorry, didn't get around to that yet as it's trivial and used enough internally that I figured it has enough test coverage for now. The c7_list struct is embedded in the items, rbpool [0] uses it to keep track of nodes [1] and slabs [2] for example. Note that the same structure is used as root.

[0] https://github.com/codr7/libcodr7/blob/master/source/codr7/r...

[1] https://github.com/codr7/libcodr7/blob/master/source/codr7/r...

[2] https://github.com/codr7/libcodr7/blob/master/source/codr7/r...


Why have comparison callbacks return this enum

  enum c7_order {C7_LT, C7_EQ, C7_GT};
instead of a signed int which seems to be the idiomatic way in C?


Because I always end up switching on the return value, and switching on enums is a lot nicer than ints. Especially when there is no agreement on what values are returned, LT is mostly -1 but is often specified as simply less than zero.

When you have a finite set of cases, enum is just the right solution period. I have no idea why people still insist on using ints.


So, like http://troydhanson.github.io/uthash/ but without the documentation?

(NetBSD has had an all-macro red/black tree implementation forever, plus the ancient and well-understood BSD queue.h)


From a quick glance; it's hashed (as in not ordered), not pool allocated and intrusive as opposed to offering value semantics. Couldn't get more different.

There's nothing new under the sun, I never claimed novelty. This is simply how I would do it, take or leave.


"in the spirit of C" is pretty ambiguous. What does this do better or differently from other libraries?

There's no working examples shown. There's a "real-world examples" link, but it seems to just point to another project of yours.

There's no documentation of the API. If I wanted to use this, I would need to manually read the header files and hope that I'm using them correctly.

And to top it all off, there's not a single comment in the entire project.

I don't mean to be harsh, but if you want other developers to use this, then you need to provide more resources.


The descriptions of the containers contain more of the kind of information you're looking for, it's a bit much to expect the caption to explain it all.

It's as real world as it gets for now, the test suite should give you an idea.

I will typically add comments where I feel the code doesn't do a good enough job of explaining itself.

One step at a time.

It's ok, I don't expect anything.


I've only glanced at these; can someone more familiar explain how these are different from the BSD "queue" macros (and their offspring, like the tree headers)?


"In the spirit of C". What exactly is meant by that?


The spirit of chainsaws with no hand guards haha.

But actually, to me that means a thin, light API with a couple of functions and some macros. Does what it needs to do, no more. Error handling is basic if even provided, usually via return of numerical error codes.


That's correct, it does mostly what you tell it to and stays out of the way. If you want error checking beyond the test suite you have to do it yourself. The issues people have with C comes mostly from treating it like any other language. C is all about control and flexibility.

Further, special care is taken to reduce the number of allocations, often using pools/slab allocators. And value semantics, or rather byte array semantics, which means items are treated as blocks of N bytes rather than pointers/references.


So what's the advantage of this over existing libraries?


That would depend on which libraries we are discussing, I don't know them all.

This is simply how I would do it based on my experience.


Definitely cool, although the spirit of C is to be overly paranoid about portability and not link to a library, God forbid. (Single pairs of h/c files like this are usually okay with people)


Or you could link to a library written by someone equally paranoid about portability :)


This is exactly why you shouldn't code in C unless you absolutely have to. So many ugly boiler plates and macros.


There are myriad reasons why you might "absolutely have to" code in C. There are even a few reasons you might code in C even if you have a choice.


I sort of agree with the first part, though some people actually enjoy having this level of control and flexibility even when they don't absolutely need it.

I find chatty generic code where everything is nailed to the floor ugly, Rust, C++, Swift, Java etc. So I guess it depends on who you ask and their experience.

There is nothing wrong with macros, though I definitely prefer the Common Lisp kind.


Is there a reason C libraries generally prefix with 'lib'? Or is it just historic?


If you pass -lcodr7 to the Unix C linker, it would typically go looking for a file called libcodr7.so. You could perhaps view this as mainly historic behaviour.

Some libraries like Cairo seem to be generally known as "Cairo", not "Libcairo". But there is also a trend for libraries to have very straight-to-the-point names. If the PNG parsing library is called libpng.so and the XML parsing library is called libxml.so, it would be confusing to call these libraries just "Png" and "Xml".

As an other example, the two C programs "systemd" and "libsystemd" live in the same repo, but are really two different beasts: libsystemd might be useful to you even on a system which doesn't use systemd.


Besides this; codr7 is my alias, and I wouldn't want to be confused with a C library.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: