Hacker News new | past | comments | ask | show | jobs | submit login
Crit-bit Trees: the best way to store sets (cr.yp.to)
106 points by bumbledraven on Sept 19, 2011 | hide | past | favorite | 40 comments



If you're looking for a jvm implementation, there are at least a couple:

https://github.com/rkapsi/patricia-trie https://github.com/jfager/functional-critbit

The latter is my own, with a focus on memory efficiency (by my benchmarks, it uses about 40% less space than patricia-trie for the same data set). Would love to get some extra eyes on it to figure out how to trim it further or speed it up.


Crit-bit trees are great but they're not silver-bullet: like every tree structure, even in the lucky case a search can cause O(log n) cache misses, while a good hash table needs just one or two.

Also, as pointed out in another comment, it is not necessarily balanced (like red-black trees) and there are some real-world datasets where crit-bit (or Patricia) trees behave poorly, for example URLs. I myself have seen Patricia trees with height greater than 100, for sets of URLs where log n ~ 30. According to VTune, most time in searches was spent in cache misses.


That's a very good point. One way to circumvent this problem is to combine hashing and crit-bit trees, by constructing an unordered tree on the hash values of the set elements. The nice thing here is that if the hash function behaves like a uniform random variable the resulting tree will be balanced. Furthermore, without ordering the implementation is trivial.

At the same time you will not need rehashing once the table "fills up" and since you use don't discard bits from the hash function the expected number of collisions after inserting n elements with a 32 bit hash is n / 2^32 - or 0 for reasonable values of n.

Additionally you can use clustering (e.g. build a crit-nibble tree, with 16 pointers per node - one 64 byte cache line) to reduce the number of cache misses. This can backfire spectacularly unless the data is essentially random, so the hashing step remains important.

It may not be a silver bullet, but there are some interesting trade-offs.


That is true but you lose some of the most important properties of crit-bit trees, ordered operations.

If the data structure is unordered, I don't see any reason why it should be better than a normal hash table.


The data structure is not better than a normal hash table. It is a different trade-off.

For instance, you don't have to do a rehashing step, which is important for real time applications. Memory usage is also very deterministic at 2*(n-1) words in an n element table (with 2 words per node).

If none of this matters to you and you merely want a map with fast lookups then a simple hash table is probably a better choice.


The string b-tree also does trie-like search without sacrificing balance:

http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.59...


> A crit-bit tree for a nonempty prefix-free set S of bit strings has an external node for each string in S; an internal node for each bit string x such that x0 and x1 are prefixes of strings in S; and ancestors defined by prefixes.

And that's a perfect (missing) example for "a picture is worth a thousand words". Then again... this coming from DJB - I should be surprised the description is as verbose as it is.


that's a perfect (missing) example for "a picture is worth a thousand words".

I don't know if it helps, but here's the picture I drew at the top of my patricia trie code. (Crit-bit trees aren't the same as patricia tries, but they're close enough for expository purposes.)

    /**
     * Our Patricia tree structure can be thought of as operating on strings of
     * 9-bit bytes, where 0x00 -- 0xFF are mapped to 0x100 -- 0x1FF and 0x00
     * represents the end-of-string character (note that NUL can occur inside
     * keys).  The field (struct pnode).mask is either 0 or a power of 2; if 0,
     * the left child, if non-NULL, is a pointer to the record associated with
     * the key thus far.  For example, the strings "hello", "hello colin",
     * "hello world", and "wednesday", each associated with pointers to
     * themselves, are stored in the following tree:
     *
     *                       [0x10, 0x60, 0, ""]
     *                        |               |
     *   [0x00, 0x00, 5, "hello"]          [0x00, 0x00, 9, "wednesday"]
     *    |                    |              |                 |
     * "hello"     [0x10, 0x60, 1, " "]     "wednesday"        NULL
     *              |                |
     *   [0x00, 0x00, 5, "colin"]   [0x00, 0x00, 5, "world"]
     *    |                    |       |                 |
     * "hello colin"          NULL  "hello world"       NULL
     *
     */
Edited to add: And here's the node structure:

    /* Structure used to store a Patricia tree node. */
    struct pnode {
            struct pnode *  left;           /* Left child. */
            struct pnode *  right;          /* Right child. */
            uint8_t         mask;           /* Critical bit mask. */
            uint8_t         high;           /* High bits of this node. */
            uint8_t         slen;           /* Length of s[]. */
            uint8_t         s[];            /* Bytes since parent's s[]. */
    };


The particular words you've quoted have managed to condense a thousand pictures — an infinite number, really — into 45 words. It's kind of a counterexample.

Some examples would certainly help, and probably diagrams would help understanding the examples.


That's assuming you take "nonempty prefix-free set", "internal node", "bit string x such that x0 and x1" (not sure if they're supposed to be subscripts or are 0 and 1 bits) for granted. A picture solves the problem in many ways giving you real examples.

It's 45 words with lots of assumed knowledge that you could write books about ;)


Yeah, I gotta say, I knew exactly how it worked the moment I looked at the radix tree picture on wikipedia. After reading more than half the linked article though I still had very little understanding of what it really did.


> Another advantage is that a crit-bit tree guarantees good performance: it doesn't have any tricky slowdowns for unusual (or malicious) data.

Consider a crit-bit tree over variable-length bit strings. The set 1, 11, 111, ... (with n bit strings) has depth n.

Depending on how you count, it can be argued that this is still substantially better than the worst-case of hash tables, but in all of the ways I think of counting, a crit-bit tree is worse on this input than a balanced search tree.


Big O notation, it isn't. The big question is the cost of different types of comparisons.

The most important thing to remember is that a crit-bit lookup does O(string-length) "critical" bit lookups, and then an O(string-length) final comparison - for a total of O(string-length).

Whereas a hash lookup usually does O(string-length) computation of the hash value, average case O(1) lookups, and then O(string-length) comparison of the key value - for a total of O(string-length).

A balanced tree does O(log-n) comparisons, of O(string-length) each (the final one giving the confirmation, which was an independent step for both hash and critbit trees).

This is true even on this specific input. It boils down to how the amortized single-character comparison compares to a cache miss.

If crit-bit trees leafs are packed in a cache-friendly way (they are constant size, so this might actually be easy to arrange), they are likely to perform at least as well as balanced trees.


Good points.

Given a query string of length k and cache lines that store B bits, the crit-bit trie for the bad input uses Theta(k) computations and causes Theta(k) cache misses, I think. A balanced tree holding the same set causes Theta(k * lg n) computations and O(lg n + (k * lg n / B)) cache misses, I think. My understanding is that lg n / B is frequently much less than 1, even for small cache line sizes, so that the balanced tree will generally use less I/O than a crit-bit tree but more computation.


Isn't the set supposed to be prefix-free, though?


Yes but you can do the same example with 0, 10, 110, 1110, ...


Yes, that is the worst-case example for the set in a crit-bit tree.


Also known as a radix tree, I believe.


Crit-bit trees are not radix trees.

A crit-bit tree stores next-different-bit positions at branches and entire values at the leaves, and is searched by following branches depending on the value of the relevant bit in the search key (the "critical bit" which gives the data structure its name). Once a leaf is reached the entire search key is compared to the value at the leaf, which is necessary because a key that is not in the tree might differ at bit positions that are non-critical and would not have been tested during the tree descent.

This is a very different arrangement to a radix tree, which is essentially a form of trie.



Years ago we used this at Singshot to evenly distribute numerical assets across subdirectories. Very useful these days for shardy cloudbucket applications.


agl's C translation written in literate programming style:

PDF: https://github.com/agl/critbit/blob/master/critbit.pdf

Repo: https://github.com/agl/critbit


According to the paper, this code was actually extracted directly from qhasm, which djb released in 2006. This is notable because any code becomes more worthy of admiration once we know it was written by djb. As Mark-Jason Dominus used to say, "I'd drink poisoned cool-aid as long as djb told me he'd made it himself."


Thanks for clarification, you're right. I thought it was translated from qhasm to C.


Does 'literate programming style' require the lack of indentation in his source file?


The source file is critbit.w. See <http://en.wikipedia.org/wiki/Literate_programming>.


Not sure, maybe that's how CWEB outputs it. Read PDF instead, it has the properly indented code :-)


I'm not sure it's the best way to store sets. It's seems like potentially doing a pointer dereference per bit is quite an overhead. It would be 8 times faster (for dense sets) to have a crit-byte tree. Though then obviously if you did that naively, there would be a huge associated storage cost. To mitigate that, you can do what AMTs do and have a bitmap per node followed by an array of pointers.


Crit-bit trees don't branch per bit, they branch per differing bit. Much more efficient for most data sets.


I know that thanks


If you don't need the add and remove operations, it's possible to reduce the size significantly by applying automata theory: http://www.n3labs.com/pdf/lexicon-squeeze.pdf

It works extremely well "out of the box" for dictionary words, since the automata enables sharing of both prefixes and suffixes.


The minimal-acceptor approach is also best when you're considering dynamic programs that effectively involve some kind of FSA-intersection, including best-first lazy versions like "find the closest spelling correction to a sequence of words in your lexicon of some text", because your state (suffix set) representation is canonical.

I only skimmed the paper, but a simple n*log n approach to produce the same result is to first sort the lexicon, then build the trie while canonicalizing the suffix sets depth first (needing to hash/compare only one level deep, by induction).

This guy wrote an entire thesis comparing various methods for the same task - http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/p...


Implementing approximate matching has been on my todo list for some time. I'm planning to either implement http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5362... or http://www.sciencedirect.com/science/article/pii/S1570866704...

The data structure for the mealy recognizer described in the paper can very easily be changed to support minimal perfect hash numbers for all elements, without using more space (my implementation uses less space because of it, because the same information can be used to calculate the number of strings with a prefix in O(1) time).


You might find this interesting as a testbed -> http://www.facebook.com/careers/puzzles.php?puzzle_id=17


I couldn't find any benchmarks on the web to substantiate the speed claims.

Still, it's a cool data structure, and I respect djb enough to trust that it's at least competitive with hash tables; I'll consider it when I want more operations than offered by a hash table (especially for a set that's very sparse compared to key length).


How is this different from a trie? (http://en.wikipedia.org/wiki/Trie)


It's a particular kind of trie. http://en.wikipedia.org/wiki/Trie#Bitwise_tries


The article puts forward a tree as a better alternative to hashtables:

"I have become convinced that this strategy should change. The revised strategy is much simpler: there should be one fundamental set-storage type, namely a crit-bit tree. Here's how a crit-bit tree stacks up against the competition:

* A hash table supports insertion, deletion, and exact searches. A crit-bit tree supports insertion, deletion, exact searches, and ordered operations such as finding the minimum. Another advantage is that a crit-bit tree guarantees good performance: it doesn't have any tricky slowdowns for unusual (or malicious) data."

Oh, what's the big-O of a lookup in a tree vs lookup in a hash-table again? O(lg N) vs O(1) with bad locality of reference to boot, you say?

FAIL.


He's not arguing that crit-bit trees are better than hash tables, he's arguing that they are a better choice for the fundamental set datatype for languages like python, perl etc etc.

When you write an application, you know the performance requirements of the set representation you choose & can pick the most appropriate for your application. A language designer doesn't have this information: their choice of fundamental datatype affects everyone.

Bernstein is arguing that they should choose a set representation with good performance for as wide a range of features as possible rather than one that performs well for a small set of requirements, and very poorly for others.

I can see arguments both ways here, but FAIL is a bit too strong IMO.


You weaken his rhetoric; he writes like its crit-bit all the way for all purposes.

And since there is not a queue of people complaining that Python's dict is not iterable in order, I'd say he's wrong. Random access in a dict is the most common use-case by a very long chalk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: