Those are way complicated, though. I read that HAT-trie paper some time ago and ...

saurik · on Jan 16, 2011

FWIW, I don't think nedtrie has path-compression, and it is only level-compressed in that it has an array at the top that indexes based on leading zero count (which is likely to lead to a very unbalanced root node in comparison to just doing level compression up there).

I pointed out the HAT-trie thing because you claimed that that was a bug with tries, and it is definitely something being worked on. And, as someone who had spent time implementing this data structure who sounded like the hadn't heard of that work, you might have found it fascinating ;P.

(Also, "stratified indexes" are a concept that, while technically applying to most unbalanced structures, was designed for patricia trees and supposedly leads to good balancing properties. It was designed for disk storage, though, so it not quite ideal for in-memory use, but will at least therefore be not /ludicrous/ at memory performance if you stick to the page level.)

liuliu · on Jan 16, 2011

Since you are here, what do you think would be a suitable data-structure for SHA'ed key (the first 64-bit) associative array? I implemented a simple tries structure (no compressing at all) to handle this, but constantly struggled that maybe I over-engineered on this: (https://github.com/liuliu/ccv/blob/release/lib/ccv_cache.c)

tptacek · on Jan 16, 2011

Are you sure you really want to be using SHA1 keys? Git's use of SHA1 is a clever hack, but there are simpler, much-faster universal hash functions. Also, once you've passed a key through a hash function, you've lost much of the benefit of the tree/trie ADT.

saurik · on Jan 16, 2011

Unfortunately, my interest has mostly been at the algorithms level: real-world implementations are usually dealing with so few elements that you are more affected by "silly" concerns like cache locality and instruction timing than anything else.

Therefore, it is really critical to find out how many items you want to be able to store. If you want to store a couple hundred you are going to do really well with something incredibly "stupid" like a changing hash table.

Another important question is regarding your expected concurrency. I mean, with a really strong random hash function and a data set that reaches a reasonable "steady state" on the number of objects it contains, hash tables are going to be ludicrously efficient if you have a really good hash function.

(edit: I meant to say: if you can tolerate the hashtable going offline entirely during a resize before it reaches steady state, although there are some algorithms for doing online concurrent resize; memcached actually uses a concurrent resizable hashtable, from what I remember, and is generally considered "very fast")

<sidenote> And, SHA is going to be a ludicrously good hash; arguably a /too/ good hash ;P. Out of curiosity, why /are/ you starting with a cryptographic hash (SHA)? Did your data already come in that form (maybe you are storing iPhone UDIDs ;P)?

These kinds of hash functions are actually /designed/ to be slow (so you can't use brute force) and are generally not needed for search structures (as you hopefully have a plan for organic collisions anyway). ;P </sidenote>

I will say that, before going out and doing a lot of research on this, I was in the position of building something similar-sounding (but using a fast 64-bit hash), and what I ended up with was similar to an "LPC trie" (which I luckily discovered quite quickly, and was able to read up on).

http://www.nada.kth.se/~snilsson/publications/TRASH/trash.pd...

http://www.nada.kth.se/~snilsson/publications/Dynamic-trie-c...

(However, again back to the hash function: the fact that I was doing the hash beforehand is not really how most of these algorithms are through through by the people who are developing them. Once you use the hash function you are now storing a fixed size element with a known-random distribution, which makes "use the first X bits as a hash function and store it in an array" an irritatingly good choice.)

I also generally recommend keeping a lot of the stuff the HAT-trie guy has published in mind (I've got it on my "omg must read" list, but had to get back to actual work). His PhD work was on cache-conscious data structures, and he simply went through and optimized the performance of everything under the sun for cache locality.

http://www.naskitis.com/

(By the way: I'd love to talk to you further at some point on IRC. I'm on irc.saurik.com; send a private message, and make it clear who you are in your initial message: don't just say "u there?" as I get a million of those ;P.)

stcredzero · on Jan 16, 2011

These kinds of hash functions are actually /designed/ to be slow (so you can't use brute force)

Not so much. I think they are designed to have good mixing, be impractcally hard to reverse, and have other properties. This just makes them slower than hash functions generally used in hash tables. My understanding is that both MD5 and SHA were designed to be efficient under these constraints. bcrypt is designed to be slow, and can beat other cryptographic hashes by ridiculous margins. (Parameterized, actually)