Nedtries: an ordered container faster than hash tables and red-black trees

jallmann · on Jan 16, 2011

This is basically a radix/patricia/critbit tree. Those might be O(1) for the number of keys but are O(n) for the length of the key. Note this is true for hashtables as well.

Such trees are pretty excellent in practice; they give hash-like fast lookup while preserving ordering. Hashtables do have the advantage of being less prone to cache thrashing as you traverse the tree, which is probably why nedtries is best for <10,000 elements. Not to mention the key distribution may affect tree balance and hence lookup times. Probably an issue when you're allocating memory from sequential addresses.

I have a unoptimized and under-tested implementation of such a binary trie, which uses string keys. https://github.com/j0sh/radixtree

edit: another big plus to these types of trees: they use a minimum of space, unlike a hash table.

saurik · on Jan 16, 2011

There has actually been work on cache-conscious tries, such as the HAT-trie.

http://crpit.com/confpapers/CRPITV62Askitis.pdf

jallmann · on Jan 16, 2011

Those are way complicated, though. I read that HAT-trie paper some time ago and I still don't understand it 100%. Compare this to djb's crit-bit description which fits nicely in a paragraph and tells you most everything you need to know.

I suppose it's one of those 80-20 issues: compressed binary tries only have a few drawbacks, but addressing those requires a ton of effort.

saurik · on Jan 16, 2011

FWIW, I don't think nedtrie has path-compression, and it is only level-compressed in that it has an array at the top that indexes based on leading zero count (which is likely to lead to a very unbalanced root node in comparison to just doing level compression up there).

I pointed out the HAT-trie thing because you claimed that that was a bug with tries, and it is definitely something being worked on. And, as someone who had spent time implementing this data structure who sounded like the hadn't heard of that work, you might have found it fascinating ;P.

(Also, "stratified indexes" are a concept that, while technically applying to most unbalanced structures, was designed for patricia trees and supposedly leads to good balancing properties. It was designed for disk storage, though, so it not quite ideal for in-memory use, but will at least therefore be not /ludicrous/ at memory performance if you stick to the page level.)

liuliu · on Jan 16, 2011

Since you are here, what do you think would be a suitable data-structure for SHA'ed key (the first 64-bit) associative array? I implemented a simple tries structure (no compressing at all) to handle this, but constantly struggled that maybe I over-engineered on this: (https://github.com/liuliu/ccv/blob/release/lib/ccv_cache.c)

tptacek · on Jan 16, 2011

Are you sure you really want to be using SHA1 keys? Git's use of SHA1 is a clever hack, but there are simpler, much-faster universal hash functions. Also, once you've passed a key through a hash function, you've lost much of the benefit of the tree/trie ADT.

saurik · on Jan 16, 2011

Unfortunately, my interest has mostly been at the algorithms level: real-world implementations are usually dealing with so few elements that you are more affected by "silly" concerns like cache locality and instruction timing than anything else.

Therefore, it is really critical to find out how many items you want to be able to store. If you want to store a couple hundred you are going to do really well with something incredibly "stupid" like a changing hash table.

Another important question is regarding your expected concurrency. I mean, with a really strong random hash function and a data set that reaches a reasonable "steady state" on the number of objects it contains, hash tables are going to be ludicrously efficient if you have a really good hash function.

(edit: I meant to say: if you can tolerate the hashtable going offline entirely during a resize before it reaches steady state, although there are some algorithms for doing online concurrent resize; memcached actually uses a concurrent resizable hashtable, from what I remember, and is generally considered "very fast")

<sidenote> And, SHA is going to be a ludicrously good hash; arguably a /too/ good hash ;P. Out of curiosity, why /are/ you starting with a cryptographic hash (SHA)? Did your data already come in that form (maybe you are storing iPhone UDIDs ;P)?

These kinds of hash functions are actually /designed/ to be slow (so you can't use brute force) and are generally not needed for search structures (as you hopefully have a plan for organic collisions anyway). ;P </sidenote>

I will say that, before going out and doing a lot of research on this, I was in the position of building something similar-sounding (but using a fast 64-bit hash), and what I ended up with was similar to an "LPC trie" (which I luckily discovered quite quickly, and was able to read up on).

http://www.nada.kth.se/~snilsson/publications/TRASH/trash.pd...

http://www.nada.kth.se/~snilsson/publications/Dynamic-trie-c...

(However, again back to the hash function: the fact that I was doing the hash beforehand is not really how most of these algorithms are through through by the people who are developing them. Once you use the hash function you are now storing a fixed size element with a known-random distribution, which makes "use the first X bits as a hash function and store it in an array" an irritatingly good choice.)

I also generally recommend keeping a lot of the stuff the HAT-trie guy has published in mind (I've got it on my "omg must read" list, but had to get back to actual work). His PhD work was on cache-conscious data structures, and he simply went through and optimized the performance of everything under the sun for cache locality.

http://www.naskitis.com/

(By the way: I'd love to talk to you further at some point on IRC. I'm on irc.saurik.com; send a private message, and make it clear who you are in your initial message: don't just say "u there?" as I get a million of those ;P.)

stcredzero · on Jan 16, 2011

These kinds of hash functions are actually /designed/ to be slow (so you can't use brute force)

Not so much. I think they are designed to have good mixing, be impractcally hard to reverse, and have other properties. This just makes them slower than hash functions generally used in hash tables. My understanding is that both MD5 and SHA were designed to be efficient under these constraints. bcrypt is designed to be slow, and can beat other cryptographic hashes by ridiculous margins. (Parameterized, actually)

jasonwatkinspdx · on Jan 16, 2011

These are binary tries with some interesting optimizations, in particular using special instructions supported on all modern cpus to skip common prefixes at the start of looking up.

According to the author's benchmarks they offer very consistent performance between insert and access operations, which seems to be key to why they perform better than R-B trees. Since tries don't need to re balance on insert we avoid memory operations R-B trees spend reorganizing interior nodes. Even though these are amortized in the R-B, we're still saving work.

Likewise, these are faster than small hash tables since we avoid resizing-rehashing overhead. Searching the binary trie is also likely faster than following a collision linked list we might find in a naive hash table implementation. However, for larger containers hashes begin to catch up, since at these sizes the trie involves increasingly more memory accesses.

I'd like to see these tries compared to generalized cuckoo hashing.

I'd also like to see them compared to the bagwell array mapped tries used by clojure and others.

saurik · on Jan 16, 2011

(For the record, the reason why I have stared at this as much as I have just now is because I study these kinds of data structures and therefore want to understand what is new and compelling about this specific one.)

So, tries have the property of being only as balanced as the distribution of their alphabets. It is very easy to end up in a situation where worst case performance of a trie is linear in the number of items in the trie. There are some researchers who were doing work on a concept called "stratified indexes" that allow you to index into deeply skew'd tries that you may find interesting.

Regardless, the claims in this article are somewhat off. For example: they state that red-black trees "write out a lot of memory". Any full binary tree with N leaf nodes has N-1 non-leaf nodes. That statement is just as true of red-black trees as it is of this data structure. It is definitely true that the STL red-black tree implementation is inefficient (allocating memory for empty nodes), but that should not be held against the algorithm.

The documentation is even weirder. "Unlike either red-black trees or most hash tables, nedtries can store as many items with identical keys as you like." That is certainly not true, and isn't even true of the STL, which has a red-black tree without unique restriction called std::multimap.

The documentation also claims that hashtables can store arbitrarily large keys and nedtries can't, which is only as true of hashtables as it is for nedtries. As hashtables aren't ordered, and in fact require fixed-length keys, you can (and do: normally this statement would be going in the other direction of causality) use hash algorithms to hash the information to fit into the key length and then verify the value when you find it. The same thing could be used for nedtries.

I'm also having a difficult time understanding why red-black trees are not an "in place algorithm" and nedtries are. I'm not even certain I understand what it would mean for a search tree to be "in place", but the documentation makes claims about "dynamic memory allocation" being its key distinction, and I see no reason why a red-black tree need require that for search or deletion. Insert requires a single allocation, but so does a trie; and, even with the STL most people use a custom pool allocator for that purpose.

To be clear, I am not at all stating that "nedtries aren't faster than other implementations", just that that has nothing to do with their algorithm. The algorithm seems to just be an array of tries, where the array is designed to help with the common case of "there are a large number of zeros at the top of the key" (caused by using small numbers as keys).

Unfortunately, that makes the distribution among the bin-level tries drastically uneven, making that first "which bucket do I end up in" question severely unfairly unbalanced.

That specific optimization, then, doesn't seem to offer anything over a patricia trie, which itself would dramatically decrease the number of non-leaf node traversals (as nedtries does not describe using any kind of path compression at all). These path-compressed tries are what people normally use for high-performance lookup like routing.

jasonwatkinspdx · on Jan 16, 2011

I agree that his prose is a little confusing. Here's my understanding of his intent:

"red-black trees write out a lot of memory"

I think this refers to how R-B trees require re balancing on insert, particularly accessing siblings of ancestors. This is doubly bad on modern processors, as it introduces a chain of control hazards dependent on these accesses. That can burn a lot of cycles on cache misses. In contrast, the trie structures only access nodes along the direct line down to the appropriate leaf. This has a control hazard (ie when to end the loop) and a data dependency between iterations. The control hazard can be easily predicted (assume search recurses) and modern processors are quite good at unrolling the data hazards across the shadow registers (or equivalent). Basically, modern processors and compilers are much happier with the sort of inner loops you see in tries vs restructuring trees (and are happiest with the inner loop of accessing a hash table).

"The documentation also claims that hashtables can store arbitrarily large keys and nedtries can't, which is only as true of hashtables as it is for nedtries."

I believe his implementation assumes size_t sized keys. This limit comes from his root table, which needs as many entries as bits in the key. While you could code up a variant that used arbitrary length keys I think you'd lose most of the performance advantages he measures (remember, we're mostly talking about constant factors here, not complexity class). He doesn't care about this limit, as his primary application is storing free lists in an allocator, which will always have size_t keys.

"I'm also having a difficult time understanding why red-black trees are not an "in place algorithm" and nedtries are."

I'm also confused by this, but I'm assuming he's looking at concurrent access. Tries only ever push down toward leaves, so with careful ordering it's possible to ignore many locking concerns. Or put a different way, you're only ever allocating new nodes or filling in previously nil slots in existing nodes. Trees with structural re balancing on insert like R-B could invalidate a concurrent traversal with a rotation.

"That specific optimization, then, doesn't seem to offer anything over a patricia trie, which itself would dramatically decrease the number of non-leaf node traversals (as nedtries does not describe using any kind of path compression at all)."

I think for his purposes there are some assumptions about distribution: namely if he's storing memory addresses in a nedtrie, it's likely he's allocating them with brk() or the like, and so they're clustered in one or two slots of the root table. Likewise, if he's storing block sizes, then the root table acts as a segregated fit similar to dlmalloc and its many offspring.

I'd agree that the way he over generalizes the advantages is a bit confusing, but I do think there's something interesting here vs vanilla patricia tries.

saurik · on Jan 16, 2011

Re: vs. hash tables

Both nedtries (as specified / implemented) and hash tables use fixed length keys. Either can be implemented with an arbitrary fixed length, but both are fixed.

The reason you can store arbitrary length keys in hash tables is only because you hash the key to a fixed length and then specify some form of collision resolution.

This hashing means, of course, that you lose order (disallowing what this author calls "nearest fit" searches). But, if you are considering using hash tables for your use case, you already don't care about that feature.

In the case of nedtries, you don't even need to do anything fancy (linear probing) or use a secondary data structure (chaining): as nedtries already allow you to store the same value twice (so he said... I didn't pay attention to how; worst case you can use chaining ;P) you can make the same tradeoff.

I guess you could argue this is an unrelated data structure, a "hash nedtrie", but if you are comparing the overall abilities of these data structure concepts it seems unfair to not point this out, especially given that it doesn't require changing his implementation (the user can do the disambiguation, as they may even be used to doing anyway with hash tables).

saurik · on Jan 16, 2011

Re: assumptions about distribution

But if the assumption is that they are all using a couple slots of that root table, then the root table is not a performance advantage. It would be faster to just start the trie immediately and switch off the first bit.

The point of having a structure like that at the top is that it is a cheap-o form of level compression, and increases the fan-out of the root node, allowing you to cut the search space down much faster given a single very cheap lookup.

As for segregated fit, the bucket array itself is probably actually good enough for that. What /could/ make that interesting is if you /also/ stated that "small" (many leading zero) keys happened to be much more prevalent, and in fact twice as prevalent, as those in the next bucket up.

Otherwise, it is just further unbalancing his trie and is getting reasonably the same tradeoff as if he were using a patricia trie (which will use path compression to eat large numbers of 0s and otherwise will be reasonably as unbalanced without the cost of the bucket array, which is going to be many internal nodes worth of memory).

saurik · on Jan 16, 2011

In fact, it is even worse than that: in the case of random keys (which he admits in the documentation is important for trie balance without helping the user obtain them) he'd be much much better off just chopping the first five bits off the key and using that to index into his top-level array.

Doing that would let the top-level array segment the search space perfectly into 32 sub-buckets, avoiding the fact that the higher order array elements are increasingly meaningless (as they are exponentially smaller).

At that point, however, the datastructure is now a level-compressed trie ;P. (LPC tries seem to be at the state of the art for this kind of structure).

saurik · on Jan 16, 2011

Re: vs. red-black trees

I don't know... he keeps talking about dynamic memory allocation and total memory usage. If he means cache locality, he should just say cache locality.

Regardless, if you care about cache performance and dynamic allocation, as the goal of a red-black tree is to be reasonably balanced, you can just treat it as a complete tree and use standard binary tree packing on it (the same algorithm normally used for heaps; maybe allocate each level separately, as he is also concerned with the cost of reallocation).

beagle3 · on Jan 16, 2011

djb's "critbit" trees are much more interesting, I think: They are basically binary patricia trees (like nedtries, if I understand correctly). However, djb's observation is that in order to locate a specific string, you only have to look at the "critical" bits - the bits that are different between different entries in the database.

So, for example, if key 1 differs from key 2 (top two keys in the tree) in bit 317, you check bit 317, and you know which one of them it is NOT (and also, which branch of the tree it is not). Proceed recursively -- most keys will, in practice, be identified by log2(n) bits where n is the number of keys in the table, NOT the length of the key.

saurik · on Jan 16, 2011

For the record, patricia tries in the literature are defined to have this "critbit" property. The wikipedia article on that subject is a little broken, but djb's article is correct when it states:

"This idea was introduced by Morrison in 1968 under the name PATRICIA, and independently by Gwehenberger at about the same time."

liuliu · on Jan 16, 2011

Also, I found this link to be a good reading for anyone interested in the implementation detail: https://github.com/agl/critbit/blob/master/critbit.w

jasonwatkinspdx · on Jan 16, 2011

The structure discussed at the link also shares this property, but only at the root of the trie (if I understand the code correctly... my c is relatively weak and it's written in a preprocessor heavy style I find hard to follow).

tshaddox · on Jan 16, 2011

I like the concept, but the author haphazardly combines asymptotic analysis with unsubstantiated performance benchmarks and uses some imprecise language.

What does "nearly O(1) complexity" mean? Why does he mention a complexity of "O(8sizeof(void ))"? Shouldn't that constant factor be collapsed?