15-line hash table in C

tsmith · on Dec 12, 2014

People understand that the compiler/executable doesn't run any faster the less newlines there are, right?

This is cute, but as others have pointed out it, it isn't really a correct implementation of a hash table. Also, it wouldn't pass a code review anywhere I've ever worked.

burkaman · on Dec 12, 2014

> This is cute

Well that's the whole point, right? This is clearly written for fun, it's not production code, it's not supposed to pass a code review. It's just supposed to be cool and interesting, and a challenge for the author.

lelandbatey · on Dec 12, 2014

Exactly! Building cute but ultimately useless[0] programs is a lot of fun and makes for great brainteasers. It's also a great way to interact in a friendly but still technical way with your programmer peers. I recommend it to everyone!

[0] - http://lelandbatey.com/posts/2014/09/binary-tree-printer/

coldtea · on Dec 12, 2014

>People understand that the compiler/executable doesn't run any faster the less newlines there are, right?

As much as others understand that it's not like this is some obsfuscated thing with some bizarro code that eats newlines for breakfast.

Merely avoids 4-5 newlines anybody can easily add, only affecting 1 or 2 statements each. With newlines expanded it would still as small as it is.

throwaway_yy2Di · on Dec 12, 2014

There's no bounds testing in hget(): some valid sequences of operations will cause buffer overflows. For example, this should segfault:

    int (**table)[2] = hnew();

    for (int j=0; j<40; ++j) {
      hset(table, (10 + j*SIZE), 0);
    }

The problem is, the probing function doesn't wrap (the "t += h" part), so if you have have several colliding keys, it will probe for them past the end of the table.

valleyer · on Dec 12, 2014

Yeah... it’s easy to write small code that doesn’t work.

userbinator · on Dec 12, 2014

It's not hard to fix, however - one extra modulus, as far as I can see.

pwr22 · on Dec 12, 2014

Technically there is bounds checking

  ... & (SIZE - 1)

but it looks broken because t does the C-ish mutating accumulator thing rather than acting as a base and using h as an offset

You need a temporary variable of

  int (**)[2]

to avoid doing two additions per iter, though maybe the compiler can pick that up and do it for you. Anyway, this no longer crashes but it now runs forever if the hash fills

https://gist.github.com/pwr22/a08597e475d1aa44cd96

It will still fail looking up a non-existent key, which I don't understand

hawski · on Dec 12, 2014

>Due to C there is no way to declare both an int and that in the loop preamble so you'd need at least one more line

In C99 it's allowed to declare in the loop preamble. True for ANSI-C.

pwr22 · on Dec 12, 2014

Not differing types

camgunz · on Dec 12, 2014

If you really want to, I think you might be able to declare a struct in there. A curse on your house though....

pwr22 · on Dec 12, 2014

In the end I wanted the variable to persist beyond the end of the loop anyway

pwr22 · on Dec 12, 2014

Sorry, non-existent keys return a slot but since it isn't allocated it blew up when I was trying to print out its values :(

valleyer · on Dec 12, 2014

And an extra local variable. Which might take up a line of source code!

ExpiredLink · on Dec 12, 2014

With two additional casts you can make it compile[1] as C++ - without the incredibly wasteful extra local variable.

[1] http://codepad.org/

vog · on Dec 12, 2014

Your link to "http://codepad.org/" doesn't show any code. Did you forget some arguments in the URL?

ExpiredLink · on Dec 12, 2014

http://codepad.org/LuQpUhmj

balakc · on Dec 12, 2014

Every time a lookup is performed, isn't it linearly looking through the table to find that key... That doesn't sound like a hash! Maybe am missing something here.

jfoutz · on Dec 12, 2014

Instead of a linked list to store hash collisions, it's using linear probing, which just uses the next open spot in the table. (Maybe I'm misunderstanding the question, it looks like they're indexing into the table correctly to me)

crabhi · on Dec 12, 2014

h seems to be the hash.

    h = k & (SIZE - 1)

pjmlp · on Dec 12, 2014

> There's no bounds testing in hget(): some valid sequences of operations will cause buffer overflows.

It wouldn't be C if that wasn't the case.

mathetic · on Dec 12, 2014

I love how a comment thread about a fun little program turned into HN showing off their wisdom in software engineering coupled with occasional boasting about tight code reviews.

I will not understand how and why people's egos get threatened by things like this.

jakobegger · on Dec 12, 2014

I actually enjoyed reading this thread. Yes, there are some snarky comments about missing newlines, but there are also a lot of constructive comments here!

shaftoe · on Dec 12, 2014

This code reminds me why clever is the enemy of good.

inDigiNeous · on Dec 12, 2014

Exactly. Why people try to write code in less lines of code? More lines == better readability, better debugging, easier to move around, easier to fix single parts and so on.

reacweb · on Dec 12, 2014

I do not code much, but I have to read a lot of code when I investigate bugs. I never see very concise code, but I have quite often verbose code. The readability of verbose code is very bad: I have to scroll a lot more to understand algorithm and often, it is split in many files.

In my experience more lines == reduced readability, more difficult to move around, more difficult to fix single parts (any change requires to modify many files) and so on. More lines of code is often against the simplicity.

I have seen code that was very difficult to understand. Rewriting it to make it simple never needed to increase the number of lines.

bobchops · on Dec 12, 2014

I agree in that I'd much rather read terse expressive code than pages of dribble convey anything meaningful, having seen a lot of verbose code too. That said, code should be well formed. When I started out in C I only used curly braces when I needed to. After creating several bugs stemming from this I now add them as a rule of thumb. That particular issue is compounded here by the fact that multiple statements are shoved on a single line - also poor form.

userbinator · on Dec 12, 2014

I think it's a matter of balance - at one extreme is IOCCC-style excessive terseness+obfuscation, and at the other extreme is the type of excessively indirect code often referred to as "enterprisey". The average probably varies depending on the culture of the language (contrast the typical C project with a Java one, for example), but over the years I've found that as my skill with a language improves, my code tends to become more concise.

Whenever someone brings up the "cleverness" argument I always like to mention this counterpoint: http://www.linusakesson.net/programming/kernighans-lever/

michaelochurch · on Dec 12, 2014

More lines == better readability, better debugging, easier to move around, easier to fix single parts and so on.

Disagree. There's a sweet spot. Code that is too dense (golfing code) is unreadable but so is code that's too long. Long code can be readable if the substructure is evident, but it's usually not.

There are two issues that come to mind on this. One is that concise code usually requires strong familiarity with the language. For example, Haskell allows dense and readable code-- if you know Haskell. Same with Clojure (Lisp). I hate Java, but one thing I'll say for it and its sprawling code is that it's relatively easy to (superficially?) understand (with an IDE to jump around, because modest programs can be thousands of lines) with a mediocre knowledge of the language.

I'd say the same of research-level mathematics. If you're intimately familiar with linear algebra notation and terminology because you use it every day, it's concise and readable. If not, then "(X^TX)^(-1)Xy" looks like line noise.

The second issue is that what tends to make large codebases unreadable is a failure of factoring. It's not that it's 500 lines* that is the problem, because the intrinsic complexity of the problem may be at a level where 400-500 lines is appropriate. It's when you have 500 lines in one class (ultimately, "class" and "object" are poorly-defined and that's why so much OOP code is trash) or, worse yet, one function/method (method is just a code-name for "locally interpreted stateful function") that you get hell code. This happens a lot in large imperative codebases (typically not becoming intractable until you have two or more programmers on the same code) because, in imperative code where an action (as opposed to stateless function) is the fundamental unit, things appear to compose better than they actually do.

In other words, it's complicated. In general, there seems to be a density level that accurately represents the intrinsic complexity of the problem. Go denser and you're forcing people to think in a specific (and poorly communicated) way to understand the implicit structure and policies that informed the code construction. On the other hand, typical sprawling Java code is great for testing your IDE's scrolling feature but the amount that can fit on a single page is low, and if this is coupled with long methods and huge classes, you get OOP (here, meaning "business OOP", not "Alan Kay's vision of OOP") spaghetti code.

andyidsinga · on Dec 12, 2014

exactly ...i made a similar comment about making more readable while increasing the lines to 100 and got down voted. folks, any dev can read 100 lines of code...

maybe the downvote is ok - im not grokking the essence of it - but similar bits of code written in perl are notorious jokes.

je-so · on Dec 12, 2014

(See http://pastebin.com/KypbginU)

// corrected overflow hget

  for (int k2=k,o=0; t[k&(SIZE-1)] && **t[k&(SIZE-1)] != k2 && o<SIZE; ++k,++o);
    return t+(k&(SIZE-1));

// hset now allows overwrite

  for (int (**a)[2] = hget(t, k); a && (*a || (*a=malloc(sizeof(**t)))); (**a)[0]=k,(**a)[1]=v,a=0);

nine_k · on Dec 12, 2014

I can see two legitimate cases for writing very short, very dense code.

1: Implement an algorithm in a very concise and straightforward, even if not very efficient, way. An example is the classic quicksort in Haskell, which most literally implements the idea of the algorithm:

    qsort [] = []
    qsort (p:xs) = qsort [ y | y <- xs, y < p ] ++ [p] ++
                   qsort [ y | y <- xs, y >= p ]

2: Implement an algorithm in a super-efficient, while non-obvious, way. An example is the inverse square root calculation made by famous by John Carmack (http://www.codemaestro.com/reviews/9).

The linked hashtable implementation, to my mind, is neither very elegant nor fiendishly clever, nor even reasonably correct.

taeric · on Dec 12, 2014

There is something funny about stating that most "literally" implements the idea of the algorithm, when the main "idea" of the algorithm is to be quick. :)

Also, even in examples of super efficient ways, be wary. The inverse square root you are referring to is actually slower than what many CPUs can do with a single instruction nowdays.

Also, I think you are missing out on the main reason this code was written. Essentially a puzzle to see if it can be done.

foota · on Dec 12, 2014

Just because it's one instruction doesn't make it faster

to3m · on Dec 12, 2014

In the case of a reciprocal square root estimate it probably is, particularly compared to the code given. A fast reciprocal square root estimate function usually takes less than 5 cycles - and is less than 1/4096th out, so the accuracy is better too.

Demiurge · on Dec 12, 2014

Thank god, I was just about to run out of newlines!

jakobegger · on Dec 12, 2014

For those of us who actually ship code, I'd like to suggest khash[1], which is a pretty nice implementation of a hash table in C. It uses lot's of macros, so if you're looking for a brain teaser you will be satisfied as well :)

[1]: http://attractivechaos.github.io/klib/#Khash%3A%20generic%20...

AntiRush · on Dec 12, 2014

uthash[1] is another good solution. It's pretty bulletproof and has lots of options. Also implemented with macros, if that's a plus :-).

[1]: http://troydhanson.github.io/uthash/

gjulianm · on Dec 12, 2014

I don't think being implemented with macros is a plus. I've seen some friends use it and get stuck whenever it was failing because it's impossible to know how it's working. It's the same type of cleverness some people criticize in other comments. Yes, it may be complete and useful but it's opaque.

cel · on Dec 12, 2014

See also clibs/hash[1], which provides a friendly API on top of khash.

[1]: https://github.com/clibs/hash

ivan_ah · on Dec 12, 2014

will need the clockwise spiral rule to parse this: http://c-faq.com/decl/spiral.anderson.html

HillRat · on Dec 12, 2014

The author helpfully explains the code by way of un-optimizing it to something readable: http://pastes.archbsd.net/graphitemaster/hashtable_explinati...

chii · on Dec 12, 2014

see why couldn't the code be written this way by default! then they would've been understandable from the beginning!

ksherlock · on Dec 12, 2014

That un-optimized version also helpfully lacks the buffer overflow bug.

userbinator · on Dec 12, 2014

The "clockwise spiral rule" is wrong in general, as Linus Torvalds helpfully explains here:

http://webcache.googleusercontent.com/search?q=cache:V51nbJE...

(Cache link since original post is 500.)

adamnemecek · on Dec 12, 2014

or http://cdecl.org

vonmoltke · on Dec 12, 2014

So, I have been looking at this implementation for several minutes. Ignoring what others have already posted about the code being way too dense for no benefit, how is this a hash table? It looks like a bog-standard associative array.

jayvanguard · on Dec 12, 2014

No doubt, I prefer to have actual hashing going on in my hash tables.

codeshaman · on Dec 12, 2014

Why the hell does this thing get upvoted ? What's good about this implementation? It's cryptic, it's buggy and it's useless for any practical purpose.

Why is this better than an array of structs ? How much longer can that be ?

readme · on Dec 12, 2014

If you rewrote this exact logic in Python it would be longer, because then the code would actually be readable.

However, fortunately it's just two chars: {}

jeorgun · on Dec 12, 2014

It seems to me that the exact logic behind those two chars is roughly 150 times longer :) ( https://github.com/python-git/python/blob/master/Objects/dic... )

viraptor · on Dec 12, 2014

Is this some common style?

    int (**hnew())

I've never seen parens used like that. Usually it's:

    int **hnew()

userbinator · on Dec 12, 2014

[] has higher precedence than *, so without the parens you get "function returning array of two pointer to pointer to int" (which is illegal) instead of the desired "function returning pointer to pointer to array of two int".

pmelendez · on Dec 12, 2014

So would that be equal to this?

    typedef int arr2int[2];
    typedef arr2int** ptr_arr2int;

    #define SIZE 1024
    static ptr_arr2int hnew() {
        return calloc(sizeof(int**), SIZE);
    }

art187 · on Dec 12, 2014

parens are used since the [2] is on the end. Otherwise, it would think its a pointer to a pointer of an int [2] -> (hnew()[2]).

DiabloD3 · on Dec 12, 2014

Might be to fix some -pedantic issue.

valleyer · on Dec 12, 2014

No, they are necessary here, otherwise it would return two pointers to pointers to int, not a pointer to pointer to int[2]

art187 · on Dec 12, 2014

This doesn't seem to use the hash in the first access. So the first hset, seems to always go to the first row of the table.

Perhaps this is what is desired (fixing roll over issue as well):

    static int (**hget(int (**t)[2], int k))[2] {
      int (**t_old)[2] =   t;                                                                                                                                                                                   
      int h = k & (SIZE - 1);
      for (t = t + h; **t && ***t != k;                                                                                                                                                                         
           h = ((h + 1) & (SIZE - 1)), t += h, t = t_old + ((t - t_old) & (SIZE - 1)));
      return t;
    }

eggsterminator · on Dec 12, 2014

everything is 15-line when you have semicolon.

valleyer · on Dec 12, 2014

Won’t hget() easily run off the end of the array if it has a few collisions? The function doesn’t save off the original value of t and then does an unbounded number of (t += h) operations.

edit: someone else pointed this out already. How did this make it to the front page of HN?

pmelendez · on Dec 12, 2014

Is this even a hash table? It looks more like a regular associative array. It seems like look ups are linear in time which is ditching one of the biggest advantages with hash tables.

Also since keys are integers, how this would be useful? Isn't the same that a regular array?

im2w1l · on Dec 12, 2014

It took me some serious thinking to figure out why you could take t, despite t being NULL.

If I managed to understand it correctly, it is because t is an array, and arrays are special.

t is a pointer to an array, i.e. it points to the first element of the array. t is an array, which means it behaves like a pointer to the first element.

The remarkable consequence of this is that t and t have the same numerical value, but different types. t==t evaluates to true.

OldWolf2 · on Dec 15, 2014

A bug nobody has pointed out yet is that star-star-t (sorry, can't figure out this site's markup for escaping stars) dereferences a null pointer. It appears to work because gcc decides to generate the same code for star-star-t as for star-t (which is correct behaviour if star-t is not NULL, and moot if t is NULL). The code should just be star-t there.

Also, sizeof(int star-star) in hnew should be sizeof( int (star)[2] ).

dllthomas · on Dec 12, 2014

I'd call this 16 lines - multiple statements on one line is cheating; if we're going that route this could be 3 lines.

deweerdt · on Dec 12, 2014

A small improvement could be to hash the key using a prime multiplier and successive multiplications, instead of using the key and increments of one. It'd reduce the collisions at the expense of a more computationally expensive hash function.

nowne · on Dec 12, 2014

The problem with that method is that it doesn't have data access locality, while linear probing does. Linear probing ends up being more efficient because it is easy on the cache.

stormbrew · on Dec 12, 2014

It seems to me this would only be true if the keys that collide are related to each other, or you have a vastly oversized table that you collide a lot in, at which point you're just accidentally synthesizing a smaller table.

What am I missing here?

[edit] I guess the other situation would be if the keys are largely sequential, but then a hash table seems like an odd choice of data structure.

nowne · on Dec 12, 2014

Linear probing works by initially hashing the value, call it h(x), then if there is a collision, it checks h(x)+1, h(x)+2, ..., h(x) + k, until it finds a open slot. Lookup works in the same way, and deletion is a bit more complicated.

This model plays nicely with the cache, although its downside is there tend to be more "runs" of contiguous filled slots in the hash table. This method still provably takes an expected insert/lookup time of O(1) with a 5-wise independent hash function and a load factor smaller than 1.

stormbrew · on Dec 12, 2014

I do know how linear probing works, thanks. ;)

But now I see where the confusion lies. I was taking your post to be replying more to the hash function part of the GP, but you were talking specifically about the skip distance. Yes, now I see what you mean, and I'm not actually sure how I misinterpreted so badly in the first place.

syllogism · on Dec 12, 2014

Let's say you get a collision. Would you rather:

a) Find your correct key within the same cache-line. But, you have to check 4 more values to get there;

b) Find your correct key in the next try...But, you have to jump to another part of the array.

Once you've indexed into the array, you want to read forward from there. You don't want to jump around.

http://preshing.com/20130107/this-hash-table-is-faster-than-...

im2w1l · on Dec 12, 2014

It uses quadratic probing. Notice how both h and t increase.

linuxlizard · on Dec 12, 2014

(Insert my usual rant about not checking malloc()/calloc() for failure.)

bnegreve · on Dec 12, 2014

Note that on many systems (e.g. Linux), malloc/calloc won't always return NULL when you're out of memory because of lazy memory allocation policies. It will only crash when you start reading / writing. That makes it arguably less useful to test the return value.

edit: clarity.

lgeek · on Dec 12, 2014

However, mmap and the functions which rely on it will return MAP_FAILED/NULL if no gap large enough is found in the virtual address space or if you've used MAP_FIXED and the area isn't free. If you don't check for errors, the application could potentially end up trying to dereference a NULL pointer.

bnegreve · on Dec 13, 2014

Ok, let me rephrase then: On modern systems, checking the return value of malloc() is not a proper way to check that the system is out of memory and therefore it doesn't guarantee that the program will run correctly in any case.

That's what I wanted to point, and I agree that it remains useful to detect other types of errors such as the ones you mentioned.

mpu · on Dec 12, 2014

Is this a step towards you understanding you are useless?

linuxlizard · on Dec 15, 2014

Cheerfully, yes!

Aardwolf · on Dec 12, 2014

What's with the hardcoded size though...

snorkel · on Dec 12, 2014

Generally it's easier if the height of a hash table is a fixed value (which can be calculated once at runtime based on how much memory the table can consume) so that the hashing logic is:

target_row_number = table_height % int_value_of_key

If table height keeps changing then this formula would be inconsistent. This code isn't using % but kind of doing the same using &

vivekvelankar · on Dec 12, 2014

Just add this line to make it much more readable typedef int (table_row)[2];

Replace (int)[2] with table_row :)

youngthugger · on Dec 12, 2014

Is learning how to create a hash table necessary for a software engineer?

fsloth · on Dec 12, 2014

Yes. Hash table, linked list and a graph of some sort are those fundamental datastructures that everyone should have implemented at least once by themselves. IMO, the core to excellent software engineering his having a lucid intuition about fundamental things such as those.

Mainly because they are simple, and lot of problems are solvable by using them without resorting to Someone Elses Gigantic Platform Framework.

crpatino · on Dec 12, 2014

In my experience, there's strong correlation between programmers that don't know how to implement basic data structures and programmers that make poor decisions when picking up one from a collections library. The ones that know that they don't know at least stick to fixed sized arrays and pay the performance price.

So, no... and yes. You can call yourself a software engineer without knowing any of this stuff, but you will be a better one if you do.

blazespin · on Dec 12, 2014

Absolutely. The hash table is rich with algorithm design and analysis.

amazingrando · on Dec 12, 2014

Learning what it is and how it works is necessary. From there, any remotely competent programmer should be able to quickly and easily create one.

Programming isn't about learning how to create specific things. But learning how to create specific things can help you gain the intuition necessary to create other things.

rgbrenner · on Dec 12, 2014

In C.. Yes.. it's a basic data type you'll probably find yourself using quite a bit. If you program in some other language that already provides hashtable functions/types, then maybe you can get by with just knowing how it works so you can figure out when to use it.

10098 · on Dec 12, 2014

Is this a joke?

Buge · on Dec 12, 2014

hset doesn't allow you to overwrite old entries. But you can do it manually with hget.

Saltor · on Dec 12, 2014

why would you a 2 element array of int here instead of defining a struct of 2 ints?

ExpiredLink · on Dec 12, 2014

A wasteful struct - are you kidding?

Saltor · on Dec 12, 2014

I'm not. Is there some performance penalty for using a struct over a 2 element array?

vidarh · on Dec 12, 2014

No performance penalty. It would make the code longer.

Consider "<Foo> in <bar> lines" type articles to be performance pieces - the point is to make it short, not readable or user friendly.

mmaldacker · on Dec 12, 2014

In this case, it makes no difference. The array is used instead of a struct to save a few lines of code and making the title of this post more impressive.

jlink · on Dec 12, 2014

unreadable code is unreadable

andyidsinga · on Dec 12, 2014

just think how amazing and readable it could be in 100 lines

thrownaway2424 · on Dec 12, 2014

C++ wins again.

_glsb · on Dec 12, 2014

C++ never won.