Hacker News new | past | comments | ask | show | jobs | submit login
15-line hash table in C (archbsd.net)
168 points by api on Dec 12, 2014 | hide | past | favorite | 97 comments



People understand that the compiler/executable doesn't run any faster the less newlines there are, right?

This is cute, but as others have pointed out it, it isn't really a correct implementation of a hash table. Also, it wouldn't pass a code review anywhere I've ever worked.


> This is cute

Well that's the whole point, right? This is clearly written for fun, it's not production code, it's not supposed to pass a code review. It's just supposed to be cool and interesting, and a challenge for the author.


Exactly! Building cute but ultimately useless[0] programs is a lot of fun and makes for great brainteasers. It's also a great way to interact in a friendly but still technical way with your programmer peers. I recommend it to everyone!

[0] - http://lelandbatey.com/posts/2014/09/binary-tree-printer/


>People understand that the compiler/executable doesn't run any faster the less newlines there are, right?

As much as others understand that it's not like this is some obsfuscated thing with some bizarro code that eats newlines for breakfast.

Merely avoids 4-5 newlines anybody can easily add, only affecting 1 or 2 statements each. With newlines expanded it would still as small as it is.


There's no bounds testing in hget(): some valid sequences of operations will cause buffer overflows. For example, this should segfault:

    int (**table)[2] = hnew();

    for (int j=0; j<40; ++j) {
      hset(table, (10 + j*SIZE), 0);
    }
The problem is, the probing function doesn't wrap (the "t += h" part), so if you have have several colliding keys, it will probe for them past the end of the table.


Yeah... it’s easy to write small code that doesn’t work.


It's not hard to fix, however - one extra modulus, as far as I can see.


Technically there is bounds checking

  ... & (SIZE - 1)
but it looks broken because t does the C-ish mutating accumulator thing rather than acting as a base and using h as an offset

You need a temporary variable of

  int (**)[2]
to avoid doing two additions per iter, though maybe the compiler can pick that up and do it for you. Anyway, this no longer crashes but it now runs forever if the hash fills

https://gist.github.com/pwr22/a08597e475d1aa44cd96

It will still fail looking up a non-existent key, which I don't understand


>Due to C there is no way to declare both an int and that in the loop preamble so you'd need at least one more line

In C99 it's allowed to declare in the loop preamble. True for ANSI-C.


Not differing types


If you really want to, I think you might be able to declare a struct in there. A curse on your house though....


In the end I wanted the variable to persist beyond the end of the loop anyway


Sorry, non-existent keys return a slot but since it isn't allocated it blew up when I was trying to print out its values :(


And an extra local variable. Which might take up a line of source code!


With two additional casts you can make it compile[1] as C++ - without the incredibly wasteful extra local variable.

[1] http://codepad.org/


Your link to "http://codepad.org/" doesn't show any code. Did you forget some arguments in the URL?



Every time a lookup is performed, isn't it linearly looking through the table to find that key... That doesn't sound like a hash! Maybe am missing something here.


Instead of a linked list to store hash collisions, it's using linear probing, which just uses the next open spot in the table. (Maybe I'm misunderstanding the question, it looks like they're indexing into the table correctly to me)


h seems to be the hash.

    h = k & (SIZE - 1)


> There's no bounds testing in hget(): some valid sequences of operations will cause buffer overflows.

It wouldn't be C if that wasn't the case.


I love how a comment thread about a fun little program turned into HN showing off their wisdom in software engineering coupled with occasional boasting about tight code reviews.

I will not understand how and why people's egos get threatened by things like this.


I actually enjoyed reading this thread. Yes, there are some snarky comments about missing newlines, but there are also a lot of constructive comments here!


This code reminds me why clever is the enemy of good.


Exactly. Why people try to write code in less lines of code? More lines == better readability, better debugging, easier to move around, easier to fix single parts and so on.


I do not code much, but I have to read a lot of code when I investigate bugs. I never see very concise code, but I have quite often verbose code. The readability of verbose code is very bad: I have to scroll a lot more to understand algorithm and often, it is split in many files.

In my experience more lines == reduced readability, more difficult to move around, more difficult to fix single parts (any change requires to modify many files) and so on. More lines of code is often against the simplicity.

I have seen code that was very difficult to understand. Rewriting it to make it simple never needed to increase the number of lines.


I agree in that I'd much rather read terse expressive code than pages of dribble convey anything meaningful, having seen a lot of verbose code too. That said, code should be well formed. When I started out in C I only used curly braces when I needed to. After creating several bugs stemming from this I now add them as a rule of thumb. That particular issue is compounded here by the fact that multiple statements are shoved on a single line - also poor form.


I think it's a matter of balance - at one extreme is IOCCC-style excessive terseness+obfuscation, and at the other extreme is the type of excessively indirect code often referred to as "enterprisey". The average probably varies depending on the culture of the language (contrast the typical C project with a Java one, for example), but over the years I've found that as my skill with a language improves, my code tends to become more concise.

Whenever someone brings up the "cleverness" argument I always like to mention this counterpoint: http://www.linusakesson.net/programming/kernighans-lever/


More lines == better readability, better debugging, easier to move around, easier to fix single parts and so on.

Disagree. There's a sweet spot. Code that is too dense (golfing code) is unreadable but so is code that's too long. Long code can be readable if the substructure is evident, but it's usually not.

There are two issues that come to mind on this. One is that concise code usually requires strong familiarity with the language. For example, Haskell allows dense and readable code-- if you know Haskell. Same with Clojure (Lisp). I hate Java, but one thing I'll say for it and its sprawling code is that it's relatively easy to (superficially?) understand (with an IDE to jump around, because modest programs can be thousands of lines) with a mediocre knowledge of the language.

I'd say the same of research-level mathematics. If you're intimately familiar with linear algebra notation and terminology because you use it every day, it's concise and readable. If not, then "(X^TX)^(-1)Xy" looks like line noise.

The second issue is that what tends to make large codebases unreadable is a failure of factoring. It's not that it's 500 lines* that is the problem, because the intrinsic complexity of the problem may be at a level where 400-500 lines is appropriate. It's when you have 500 lines in one class (ultimately, "class" and "object" are poorly-defined and that's why so much OOP code is trash) or, worse yet, one function/method (method is just a code-name for "locally interpreted stateful function") that you get hell code. This happens a lot in large imperative codebases (typically not becoming intractable until you have two or more programmers on the same code) because, in imperative code where an action (as opposed to stateless function) is the fundamental unit, things appear to compose better than they actually do.

In other words, it's complicated. In general, there seems to be a density level that accurately represents the intrinsic complexity of the problem. Go denser and you're forcing people to think in a specific (and poorly communicated) way to understand the implicit structure and policies that informed the code construction. On the other hand, typical sprawling Java code is great for testing your IDE's scrolling feature but the amount that can fit on a single page is low, and if this is coupled with long methods and huge classes, you get OOP (here, meaning "business OOP", not "Alan Kay's vision of OOP") spaghetti code.


exactly ...i made a similar comment about making more readable while increasing the lines to 100 and got down voted. folks, any dev can read 100 lines of code...

maybe the downvote is ok - im not grokking the essence of it - but similar bits of code written in perl are notorious jokes.


(See http://pastebin.com/KypbginU)

// corrected overflow hget

  for (int k2=k,o=0; t[k&(SIZE-1)] && **t[k&(SIZE-1)] != k2 && o<SIZE; ++k,++o);
    return t+(k&(SIZE-1));
// hset now allows overwrite

  for (int (**a)[2] = hget(t, k); a && (*a || (*a=malloc(sizeof(**t)))); (**a)[0]=k,(**a)[1]=v,a=0);


I can see two legitimate cases for writing very short, very dense code.

1: Implement an algorithm in a very concise and straightforward, even if not very efficient, way. An example is the classic quicksort in Haskell, which most literally implements the idea of the algorithm:

    qsort [] = []
    qsort (p:xs) = qsort [ y | y <- xs, y < p ] ++ [p] ++
                   qsort [ y | y <- xs, y >= p ]
2: Implement an algorithm in a super-efficient, while non-obvious, way. An example is the inverse square root calculation made by famous by John Carmack (http://www.codemaestro.com/reviews/9).

The linked hashtable implementation, to my mind, is neither very elegant nor fiendishly clever, nor even reasonably correct.


There is something funny about stating that most "literally" implements the idea of the algorithm, when the main "idea" of the algorithm is to be quick. :)

Also, even in examples of super efficient ways, be wary. The inverse square root you are referring to is actually slower than what many CPUs can do with a single instruction nowdays.

Also, I think you are missing out on the main reason this code was written. Essentially a puzzle to see if it can be done.


Just because it's one instruction doesn't make it faster


In the case of a reciprocal square root estimate it probably is, particularly compared to the code given. A fast reciprocal square root estimate function usually takes less than 5 cycles - and is less than 1/4096th out, so the accuracy is better too.


Thank god, I was just about to run out of newlines!


For those of us who actually ship code, I'd like to suggest khash[1], which is a pretty nice implementation of a hash table in C. It uses lot's of macros, so if you're looking for a brain teaser you will be satisfied as well :)

[1]: http://attractivechaos.github.io/klib/#Khash%3A%20generic%20...


uthash[1] is another good solution. It's pretty bulletproof and has lots of options. Also implemented with macros, if that's a plus :-).

[1]: http://troydhanson.github.io/uthash/


I don't think being implemented with macros is a plus. I've seen some friends use it and get stuck whenever it was failing because it's impossible to know how it's working. It's the same type of cleverness some people criticize in other comments. Yes, it may be complete and useful but it's opaque.


See also clibs/hash[1], which provides a friendly API on top of khash.

[1]: https://github.com/clibs/hash


will need the clockwise spiral rule to parse this: http://c-faq.com/decl/spiral.anderson.html


The author helpfully explains the code by way of un-optimizing it to something readable: http://pastes.archbsd.net/graphitemaster/hashtable_explinati...


see why couldn't the code be written this way by default! then they would've been understandable from the beginning!


That un-optimized version also helpfully lacks the buffer overflow bug.


The "clockwise spiral rule" is wrong in general, as Linus Torvalds helpfully explains here:

http://webcache.googleusercontent.com/search?q=cache:V51nbJE...

(Cache link since original post is 500.)



So, I have been looking at this implementation for several minutes. Ignoring what others have already posted about the code being way too dense for no benefit, how is this a hash table? It looks like a bog-standard associative array.


No doubt, I prefer to have actual hashing going on in my hash tables.


Why the hell does this thing get upvoted ? What's good about this implementation? It's cryptic, it's buggy and it's useless for any practical purpose.

Why is this better than an array of structs ? How much longer can that be ?


If you rewrote this exact logic in Python it would be longer, because then the code would actually be readable.

However, fortunately it's just two chars: {}


It seems to me that the exact logic behind those two chars is roughly 150 times longer :) ( https://github.com/python-git/python/blob/master/Objects/dic... )


Is this some common style?

    int (**hnew())
I've never seen parens used like that. Usually it's:

    int **hnew()


[] has higher precedence than *, so without the parens you get "function returning array of two pointer to pointer to int" (which is illegal) instead of the desired "function returning pointer to pointer to array of two int".


So would that be equal to this?

    typedef int arr2int[2];
    typedef arr2int** ptr_arr2int;

    #define SIZE 1024
    static ptr_arr2int hnew() {
        return calloc(sizeof(int**), SIZE);
    }


parens are used since the [2] is on the end. Otherwise, it would think its a pointer to a pointer of an int [2] -> (hnew()[2]).


Might be to fix some -pedantic issue.


No, they are necessary here, otherwise it would return two pointers to pointers to int, not a pointer to pointer to int[2]


This doesn't seem to use the hash in the first access. So the first hset, seems to always go to the first row of the table.

Perhaps this is what is desired (fixing roll over issue as well):

    static int (**hget(int (**t)[2], int k))[2] {
      int (**t_old)[2] =   t;                                                                                                                                                                                   
      int h = k & (SIZE - 1);
      for (t = t + h; **t && ***t != k;                                                                                                                                                                         
           h = ((h + 1) & (SIZE - 1)), t += h, t = t_old + ((t - t_old) & (SIZE - 1)));
      return t;
    }


everything is 15-line when you have semicolon.


Won’t hget() easily run off the end of the array if it has a few collisions? The function doesn’t save off the original value of t and then does an unbounded number of (t += h) operations.

edit: someone else pointed this out already. How did this make it to the front page of HN?


Is this even a hash table? It looks more like a regular associative array. It seems like look ups are linear in time which is ditching one of the biggest advantages with hash tables.

Also since keys are integers, how this would be useful? Isn't the same that a regular array?


It took me some serious thinking to figure out why you could take t, despite t being NULL.

If I managed to understand it correctly, it is because t is an array, and arrays are special.

t is a pointer to an array, i.e. it points to the first element of the array. t is an array, which means it behaves like a pointer to the first element.

The remarkable consequence of this is that t and t have the same numerical value, but different types. t==t evaluates to true.


A bug nobody has pointed out yet is that star-star-t (sorry, can't figure out this site's markup for escaping stars) dereferences a null pointer. It appears to work because gcc decides to generate the same code for star-star-t as for star-t (which is correct behaviour if star-t is not NULL, and moot if t is NULL). The code should just be star-t there.

Also, sizeof(int star-star) in hnew should be sizeof( int (star)[2] ).


I'd call this 16 lines - multiple statements on one line is cheating; if we're going that route this could be 3 lines.


A small improvement could be to hash the key using a prime multiplier and successive multiplications, instead of using the key and increments of one. It'd reduce the collisions at the expense of a more computationally expensive hash function.


The problem with that method is that it doesn't have data access locality, while linear probing does. Linear probing ends up being more efficient because it is easy on the cache.


It seems to me this would only be true if the keys that collide are related to each other, or you have a vastly oversized table that you collide a lot in, at which point you're just accidentally synthesizing a smaller table.

What am I missing here?

[edit] I guess the other situation would be if the keys are largely sequential, but then a hash table seems like an odd choice of data structure.


Linear probing works by initially hashing the value, call it h(x), then if there is a collision, it checks h(x)+1, h(x)+2, ..., h(x) + k, until it finds a open slot. Lookup works in the same way, and deletion is a bit more complicated.

This model plays nicely with the cache, although its downside is there tend to be more "runs" of contiguous filled slots in the hash table. This method still provably takes an expected insert/lookup time of O(1) with a 5-wise independent hash function and a load factor smaller than 1.


I do know how linear probing works, thanks. ;)

But now I see where the confusion lies. I was taking your post to be replying more to the hash function part of the GP, but you were talking specifically about the skip distance. Yes, now I see what you mean, and I'm not actually sure how I misinterpreted so badly in the first place.


Let's say you get a collision. Would you rather:

a) Find your correct key within the same cache-line. But, you have to check 4 more values to get there;

b) Find your correct key in the next try...But, you have to jump to another part of the array.

Once you've indexed into the array, you want to read forward from there. You don't want to jump around.

http://preshing.com/20130107/this-hash-table-is-faster-than-...


It uses quadratic probing. Notice how both h and t increase.


(Insert my usual rant about not checking malloc()/calloc() for failure.)


Note that on many systems (e.g. Linux), malloc/calloc won't always return NULL when you're out of memory because of lazy memory allocation policies. It will only crash when you start reading / writing. That makes it arguably less useful to test the return value.

edit: clarity.


However, mmap and the functions which rely on it will return MAP_FAILED/NULL if no gap large enough is found in the virtual address space or if you've used MAP_FIXED and the area isn't free. If you don't check for errors, the application could potentially end up trying to dereference a NULL pointer.


Ok, let me rephrase then: On modern systems, checking the return value of malloc() is not a proper way to check that the system is out of memory and therefore it doesn't guarantee that the program will run correctly in any case.

That's what I wanted to point, and I agree that it remains useful to detect other types of errors such as the ones you mentioned.


Is this a step towards you understanding you are useless?


Cheerfully, yes!


What's with the hardcoded size though...


Generally it's easier if the height of a hash table is a fixed value (which can be calculated once at runtime based on how much memory the table can consume) so that the hashing logic is:

target_row_number = table_height % int_value_of_key

If table height keeps changing then this formula would be inconsistent. This code isn't using % but kind of doing the same using &


Just add this line to make it much more readable typedef int (table_row)[2];

Replace (int)[2] with table_row :)


Is learning how to create a hash table necessary for a software engineer?


Yes. Hash table, linked list and a graph of some sort are those fundamental datastructures that everyone should have implemented at least once by themselves. IMO, the core to excellent software engineering his having a lucid intuition about fundamental things such as those.

Mainly because they are simple, and lot of problems are solvable by using them without resorting to Someone Elses Gigantic Platform Framework.


In my experience, there's strong correlation between programmers that don't know how to implement basic data structures and programmers that make poor decisions when picking up one from a collections library. The ones that know that they don't know at least stick to fixed sized arrays and pay the performance price.

So, no... and yes. You can call yourself a software engineer without knowing any of this stuff, but you will be a better one if you do.


Absolutely. The hash table is rich with algorithm design and analysis.


Learning what it is and how it works is necessary. From there, any remotely competent programmer should be able to quickly and easily create one.

Programming isn't about learning how to create specific things. But learning how to create specific things can help you gain the intuition necessary to create other things.


In C.. Yes.. it's a basic data type you'll probably find yourself using quite a bit. If you program in some other language that already provides hashtable functions/types, then maybe you can get by with just knowing how it works so you can figure out when to use it.


Is this a joke?


hset doesn't allow you to overwrite old entries. But you can do it manually with hget.


why would you a 2 element array of int here instead of defining a struct of 2 ints?


A wasteful struct - are you kidding?


I'm not. Is there some performance penalty for using a struct over a 2 element array?


No performance penalty. It would make the code longer.

Consider "<Foo> in <bar> lines" type articles to be performance pieces - the point is to make it short, not readable or user friendly.


In this case, it makes no difference. The array is used instead of a struct to save a few lines of code and making the title of this post more impressive.


unreadable code is unreadable


just think how amazing and readable it could be in 100 lines


C++ wins again.


C++ never won.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: