People understand that the compiler/executable doesn't run any faster the less newlines there are, right?
This is cute, but as others have pointed out it, it isn't really a correct implementation of a hash table. Also, it wouldn't pass a code review anywhere I've ever worked.
Well that's the whole point, right? This is clearly written for fun, it's not production code, it's not supposed to pass a code review. It's just supposed to be cool and interesting, and a challenge for the author.
Exactly! Building cute but ultimately useless[0] programs is a lot of fun and makes for great brainteasers. It's also a great way to interact in a friendly but still technical way with your programmer peers. I recommend it to everyone!
There's no bounds testing in hget(): some valid sequences of operations will cause buffer overflows. For example, this should segfault:
int (**table)[2] = hnew();
for (int j=0; j<40; ++j) {
hset(table, (10 + j*SIZE), 0);
}
The problem is, the probing function doesn't wrap (the "t += h" part), so if you have have several colliding keys, it will probe for them past the end of the table.
but it looks broken because t does the C-ish mutating accumulator thing rather than acting as a base and using h as an offset
You need a temporary variable of
int (**)[2]
to avoid doing two additions per iter, though maybe the compiler can pick that up and do it for you. Anyway, this no longer crashes but it now runs forever if the hash fills
Every time a lookup is performed, isn't it linearly looking through the table to find that key... That doesn't sound like a hash! Maybe am missing something here.
Instead of a linked list to store hash collisions, it's using linear probing, which just uses the next open spot in the table. (Maybe I'm misunderstanding the question, it looks like they're indexing into the table correctly to me)
I love how a comment thread about a fun little program turned into HN showing off their wisdom in software engineering coupled with occasional boasting about tight code reviews.
I will not understand how and why people's egos get threatened by things like this.
I actually enjoyed reading this thread. Yes, there are some snarky comments about missing newlines, but there are also a lot of constructive comments here!
Exactly. Why people try to write code in less lines of code? More lines == better readability, better debugging, easier to move around, easier to fix single parts and so on.
I do not code much, but I have to read a lot of code when I investigate bugs. I never see very concise code, but I have quite often verbose code. The readability of verbose code is very bad: I have to scroll a lot more to understand algorithm and often, it is split in many files.
In my experience more lines == reduced readability, more difficult to move around, more difficult to fix single parts (any change requires to modify many files) and so on. More lines of code is often against the simplicity.
I have seen code that was very difficult to understand. Rewriting it to make it simple never needed to increase the number of lines.
I agree in that I'd much rather read terse expressive code than pages of dribble convey anything meaningful, having seen a lot of verbose code too. That said, code should be well formed. When I started out in C I only used curly braces when I needed to. After creating several bugs stemming from this I now add them as a rule of thumb. That particular issue is compounded here by the fact that multiple statements are shoved on a single line - also poor form.
I think it's a matter of balance - at one extreme is IOCCC-style excessive terseness+obfuscation, and at the other extreme is the type of excessively indirect code often referred to as "enterprisey". The average probably varies depending on the culture of the language (contrast the typical C project with a Java one, for example), but over the years I've found that as my skill with a language improves, my code tends to become more concise.
More lines == better readability, better debugging, easier to move around, easier to fix single parts and so on.
Disagree. There's a sweet spot. Code that is too dense (golfing code) is unreadable but so is code that's too long. Long code can be readable if the substructure is evident, but it's usually not.
There are two issues that come to mind on this. One is that concise code usually requires strong familiarity with the language. For example, Haskell allows dense and readable code-- if you know Haskell. Same with Clojure (Lisp). I hate Java, but one thing I'll say for it and its sprawling code is that it's relatively easy to (superficially?) understand (with an IDE to jump around, because modest programs can be thousands of lines) with a mediocre knowledge of the language.
I'd say the same of research-level mathematics. If you're intimately familiar with linear algebra notation and terminology because you use it every day, it's concise and readable. If not, then "(X^TX)^(-1)Xy" looks like line noise.
The second issue is that what tends to make large codebases unreadable is a failure of factoring. It's not that it's 500 lines* that is the problem, because the intrinsic complexity of the problem may be at a level where 400-500 lines is appropriate. It's when you have 500 lines in one class (ultimately, "class" and "object" are poorly-defined and that's why so much OOP code is trash) or, worse yet, one function/method (method is just a code-name for "locally interpreted stateful function") that you get hell code. This happens a lot in large imperative codebases (typically not becoming intractable until you have two or more programmers on the same code) because, in imperative code where an action (as opposed to stateless function) is the fundamental unit, things appear to compose better than they actually do.
In other words, it's complicated. In general, there seems to be a density level that accurately represents the intrinsic complexity of the problem. Go denser and you're forcing people to think in a specific (and poorly communicated) way to understand the implicit structure and policies that informed the code construction. On the other hand, typical sprawling Java code is great for testing your IDE's scrolling feature but the amount that can fit on a single page is low, and if this is coupled with long methods and huge classes, you get OOP (here, meaning "business OOP", not "Alan Kay's vision of OOP") spaghetti code.
exactly ...i made a similar comment about making more readable while increasing the lines to 100 and got down voted. folks, any dev can read 100 lines of code...
maybe the downvote is ok - im not grokking the essence of it - but similar bits of code written in perl are notorious jokes.
I can see two legitimate cases for writing very short, very dense code.
1: Implement an algorithm in a very concise and straightforward, even if not very efficient, way. An example is the classic quicksort in Haskell, which most literally implements the idea of the algorithm:
qsort [] = []
qsort (p:xs) = qsort [ y | y <- xs, y < p ] ++ [p] ++
qsort [ y | y <- xs, y >= p ]
2: Implement an algorithm in a super-efficient, while non-obvious, way. An example is the inverse square root calculation made by famous by John Carmack (http://www.codemaestro.com/reviews/9).
The linked hashtable implementation, to my mind, is neither very elegant nor fiendishly clever, nor even reasonably correct.
There is something funny about stating that most "literally" implements the idea of the algorithm, when the main "idea" of the algorithm is to be quick. :)
Also, even in examples of super efficient ways, be wary. The inverse square root you are referring to is actually slower than what many CPUs can do with a single instruction nowdays.
Also, I think you are missing out on the main reason this code was written. Essentially a puzzle to see if it can be done.
In the case of a reciprocal square root estimate it probably is, particularly compared to the code given. A fast reciprocal square root estimate function usually takes less than 5 cycles - and is less than 1/4096th out, so the accuracy is better too.
For those of us who actually ship code, I'd like to suggest khash[1], which is a pretty nice implementation of a hash table in C. It uses lot's of macros, so if you're looking for a brain teaser you will be satisfied as well :)
I don't think being implemented with macros is a plus. I've seen some friends use it and get stuck whenever it was failing because it's impossible to know how it's working. It's the same type of cleverness some people criticize in other comments. Yes, it may be complete and useful but it's opaque.
So, I have been looking at this implementation for several minutes. Ignoring what others have already posted about the code being way too dense for no benefit, how is this a hash table? It looks like a bog-standard associative array.
[] has higher precedence than *, so without the parens you get "function returning array of two pointer to pointer to int" (which is illegal) instead of the desired "function returning pointer to pointer to array of two int".
Won’t hget() easily run off the end of the array if it has a few collisions? The function doesn’t save off the original value of t and then does an unbounded number of (t += h) operations.
edit: someone else pointed this out already. How did this make it to the front page of HN?
Is this even a hash table? It looks more like a regular associative array. It seems like look ups are linear in time which is ditching one of the biggest advantages with hash tables.
Also since keys are integers, how this would be useful? Isn't the same that a regular array?
It took me some serious thinking to figure out why you could take t, despite t being NULL.
If I managed to understand it correctly, it is because t is an array, and arrays are special.
t is a pointer to an array, i.e. it points to the first element of the array. t is an array, which means it behaves like a pointer to the first element.
The remarkable consequence of this is that t and t have the same numerical value, but different types. t==t evaluates to true.
A bug nobody has pointed out yet is that star-star-t (sorry, can't figure out this site's markup for escaping stars) dereferences a null pointer. It appears to work because gcc decides to generate the same code for star-star-t as for star-t (which is correct behaviour if star-t is not NULL, and moot if t is NULL). The code should just be star-t there.
Also, sizeof(int star-star) in hnew should be sizeof( int (star)[2] ).
A small improvement could be to hash the key using a prime multiplier and successive multiplications, instead of using the key and increments of one. It'd reduce the collisions at the expense of a more computationally expensive hash function.
The problem with that method is that it doesn't have data access locality, while linear probing does. Linear probing ends up being more efficient because it is easy on the cache.
It seems to me this would only be true if the keys that collide are related to each other, or you have a vastly oversized table that you collide a lot in, at which point you're just accidentally synthesizing a smaller table.
What am I missing here?
[edit] I guess the other situation would be if the keys are largely sequential, but then a hash table seems like an odd choice of data structure.
Linear probing works by initially hashing the value, call it h(x), then if there is a collision, it checks h(x)+1, h(x)+2, ..., h(x) + k, until it finds a open slot. Lookup works in the same way, and deletion is a bit more complicated.
This model plays nicely with the cache, although its downside is there tend to be more "runs" of contiguous filled slots in the hash table. This method still provably takes an expected insert/lookup time of O(1) with a 5-wise independent hash function and a load factor smaller than 1.
But now I see where the confusion lies. I was taking your post to be replying more to the hash function part of the GP, but you were talking specifically about the skip distance. Yes, now I see what you mean, and I'm not actually sure how I misinterpreted so badly in the first place.
Note that on many systems (e.g. Linux), malloc/calloc won't always return NULL when you're out of memory because of lazy memory allocation policies. It will only crash when you start reading / writing. That makes it arguably less useful to test the return value.
However, mmap and the functions which rely on it will return MAP_FAILED/NULL if no gap large enough is found in the virtual address space or if you've used MAP_FIXED and the area isn't free. If you don't check for errors, the application could potentially end up trying to dereference a NULL pointer.
Ok, let me rephrase then:
On modern systems, checking the return value of malloc() is not a proper way to check that the system is out of memory and therefore it doesn't guarantee that the program will run correctly in any case.
That's what I wanted to point, and I agree that it remains useful to detect other types of errors such as the ones you mentioned.
Generally it's easier if the height of a hash table is a fixed value (which can be calculated once at runtime based on how much memory the table can consume) so that the hashing logic is:
Yes. Hash table, linked list and a graph of some sort are those fundamental datastructures that everyone should have implemented at least once by themselves. IMO, the core to excellent software engineering his having a lucid intuition about fundamental things such as those.
Mainly because they are simple, and lot of problems are solvable by using them without resorting to Someone Elses Gigantic Platform Framework.
In my experience, there's strong correlation between programmers that don't know how to implement basic data structures and programmers that make poor decisions when picking up one from a collections library. The ones that know that they don't know at least stick to fixed sized arrays and pay the performance price.
So, no... and yes. You can call yourself a software engineer without knowing any of this stuff, but you will be a better one if you do.
Learning what it is and how it works is necessary. From there, any remotely competent programmer should be able to quickly and easily create one.
Programming isn't about learning how to create specific things. But learning how to create specific things can help you gain the intuition necessary to create other things.
In C.. Yes.. it's a basic data type you'll probably find yourself using quite a bit. If you program in some other language that already provides hashtable functions/types, then maybe you can get by with just knowing how it works so you can figure out when to use it.
In this case, it makes no difference. The array is used instead of a struct to save a few lines of code and making the title of this post more impressive.
This is cute, but as others have pointed out it, it isn't really a correct implementation of a hash table. Also, it wouldn't pass a code review anywhere I've ever worked.