Hacker News new | past | comments | ask | show | jobs | submit login
Fixie tries (cipht.net)
81 points by luu on Dec 30, 2017 | hide | past | favorite | 5 comments



The Array Mapped Trie is an interesting data structure that's really hard to represent in code. Even this rust version is forced to use unsafe and manual allocation to pull it off.

Most libraries seem to resort to using a dynamic array, like a Go slice:

    type bitmapNode struct {
        childBitmap uint64
        children []node
        bitpos uint64
    }
But that wastes space (there's no need to store the length/capacity) and may add a level of indirection, so doesn't really represent the actual data structure at all.

The original paper includes this note: http://lampwww.epfl.ch/papers/idealhashtrees.pdf

> As Nilsson and Tikkanen [1998] remark, performance critically depends on the memory allocation system and careful organization is rewarded.

There are all kinds of clever workarounds to this problem, like using part of the pointer for additional meaning, or, perhaps the most fruitful, using different mechanisms for different sized nodes. For example: https://db.in.tum.de/~leis/papers/ART.pdf and https://www.usenix.org/system/files/conference/fast17/fast17...

These adaptive radix trees have 4 node types: Node4, Node16, Node48, and Node256. Node4 uses two sorted arrays of up to 4 items, Node16 uses an idea similar to the array mapped trie, using a bitfield and compressed arrays, Node48 and Node256 are more like traditional tries, since at that size the memory loss is worth the trade off.

What's frustrating about these data structures is just how hard they are to implement. Pointers, structural types, arrays and bytes seem wholly inadequate to the task and your code becomes a mishmash of low-level, specialized instructions, hacky unsafe memory access, and byzantine bit-shifting. This note in the code just about sums it up:

    // It should be possible to implement this for arbitrary types
    // implementing BitAnd and friends, but my attempts to do so were
    // constantly thwarted.
    macro_rules! trivial_fixed_length_key_impl {
        ($($name:ident,)*) => {
            $(#[allow(trivial_numeric_casts)] impl FixedLengthKey for $name {
                const LEVELS: usize = 2 * mem::size_of::<Self>();
                #[inline]
                fn nibble(&self, idx: usize) -> u8 {
                    ((*self >> (4*(Self::LEVELS-idx-1))) & 15) as u8
                }
                #[inline]
                fn empty() -> Self { 0 }
                #[inline]
                fn concat_nibble(&mut self, nibble: u8) {
                    *self = (*self << 4) | (nibble as Self)
                }
            })*
        }
    }
It's a circle-sized data-structure trying to fit in a square-sized programming worldview.

I wonder if we end up with hash tables, not because they're technically superior, but because all of our programming languages make them easy to implement and reason about, and when try to leave that box, we constantly bang our head against a wall, and end up with code that always just looks like gobbledy gook.

The latter linked paper is particularly damning in this respect, as it introduces this esoteric world of persistent memory, which doesn't even really exist yet, and yet will supposedly dominate computing for the next decade. Are we going to get new versions of programming languages to take advantage of this technology, or are we going to continue down the path of becoming wholly reliant on super-intelligent compilers to do paradigm translation?

Do we really want a world where only computers know how to program computers anymore?


I did some work on making Array Mapped Tries cache efficient here: https://sinusoid.es/misc/immer/immer-icfp17.pdf

I am also exploring Hash Array Mapped Tries. CHAMP is a good start at making them cache efficient: https://michael.steindorfer.name/publications/oopsla15.pdf

I agree that the complexity of implementing these data-structures is annoying.

One interesting insight though is that Array Mapped Tried addressing works mostly like page table hierarchies for virtual memory addressing. I think there is a lot of value in providing Array Mapped Trie based core construct at language runtime level.

It is not an easy data-structured to be programmed ad-hoc. Clojure did a huge service to the world by using these as their default data-structure and showing their value to the world, all with a convenient and nice syntax. I think there is interesting new oportinities and research lines to use them at other levels between the OS and language runtime for concurrency and immutability centric languages...


It seems to me the gobblydok is more a problem with your chosen language than the data structure. In c it is perfectly straight forward.

Also we are long in a world where computers are needed to program computers. Or when did you last type in a program in raw machine language?


I disagree that its straightforward in C. C is built around pointers, but the bitmasked array uses a completely different kind of pointer. IE you can't just do

    n->n.num_children++;
    n->children[c] = (art_node*)child;
You have to do:

    __m128i cmp;

    // Compare the key to all 16 stored keys
    cmp = _mm_cmplt_epi8(_mm_set1_epi8(c),
                         _mm_loadu_si128((__m128i*)n->keys));

    // Use a mask to ignore children that don't exist
    unsigned bitfield = _mm_movemask_epi8(cmp) & mask;
And that's all CPU-specific.

I'm not saying it can't be done, but rather the tools are a poor fit for the problem. Have you tried to build this data structure? Can you link to the code? I'd love to see what a clean, high-level, understandable example would look like.

> Also we are long in a world where computers are needed to program computers.

Yes you're correct about this. I just think its headed even further in this direction. Languages like C++ and Rust are designed to give the programmer control, but the platforms we're programming for seemed to designed to be impossible to understand.

If I were tasked with building data structures like this I would really want some tools to help me feel confident the code was correct.


You might like my “qp trie” code.

https://dotat.at/prog/qp/README.html

It is in a rather experimental proof-of-concept state, but it has been adapted for production use in the Knot-DNS server.

The main correctness tool I used was a custom data structure fuzzer, with various sanity checks including a reference implementation in perl that produces the same results as the real implementation under test (when it works).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: