Hacker News new | past | comments | ask | show | jobs | submit login

Unfortunately, my interest has mostly been at the algorithms level: real-world implementations are usually dealing with so few elements that you are more affected by "silly" concerns like cache locality and instruction timing than anything else.

Therefore, it is really critical to find out how many items you want to be able to store. If you want to store a couple hundred you are going to do really well with something incredibly "stupid" like a changing hash table.

Another important question is regarding your expected concurrency. I mean, with a really strong random hash function and a data set that reaches a reasonable "steady state" on the number of objects it contains, hash tables are going to be ludicrously efficient if you have a really good hash function.

(edit: I meant to say: if you can tolerate the hashtable going offline entirely during a resize before it reaches steady state, although there are some algorithms for doing online concurrent resize; memcached actually uses a concurrent resizable hashtable, from what I remember, and is generally considered "very fast")

<sidenote> And, SHA is going to be a ludicrously good hash; arguably a /too/ good hash ;P. Out of curiosity, why /are/ you starting with a cryptographic hash (SHA)? Did your data already come in that form (maybe you are storing iPhone UDIDs ;P)?

These kinds of hash functions are actually /designed/ to be slow (so you can't use brute force) and are generally not needed for search structures (as you hopefully have a plan for organic collisions anyway). ;P </sidenote>

I will say that, before going out and doing a lot of research on this, I was in the position of building something similar-sounding (but using a fast 64-bit hash), and what I ended up with was similar to an "LPC trie" (which I luckily discovered quite quickly, and was able to read up on).

http://www.nada.kth.se/~snilsson/publications/TRASH/trash.pd...

http://www.nada.kth.se/~snilsson/publications/Dynamic-trie-c...

(However, again back to the hash function: the fact that I was doing the hash beforehand is not really how most of these algorithms are through through by the people who are developing them. Once you use the hash function you are now storing a fixed size element with a known-random distribution, which makes "use the first X bits as a hash function and store it in an array" an irritatingly good choice.)

I also generally recommend keeping a lot of the stuff the HAT-trie guy has published in mind (I've got it on my "omg must read" list, but had to get back to actual work). His PhD work was on cache-conscious data structures, and he simply went through and optimized the performance of everything under the sun for cache locality.

http://www.naskitis.com/

(By the way: I'd love to talk to you further at some point on IRC. I'm on irc.saurik.com; send a private message, and make it clear who you are in your initial message: don't just say "u there?" as I get a million of those ;P.)




These kinds of hash functions are actually /designed/ to be slow (so you can't use brute force)

Not so much. I think they are designed to have good mixing, be impractcally hard to reverse, and have other properties. This just makes them slower than hash functions generally used in hash tables. My understanding is that both MD5 and SHA were designed to be efficient under these constraints. bcrypt is designed to be slow, and can beat other cryptographic hashes by ridiculous margins. (Parameterized, actually)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: