In my experience, it's not just throughput that's important, but also 99th perce...

leif · on Oct 18, 2012

That's happily not how fractal trees work, at all. We have a few talks online describing how they work. Zardosht has one here http://vimeo.com/m/26471692 . I thought Bradley had a more detailed one at MIT but I can't find it right now. (EDIT: found it! http://video.mit.edu/watch/lecture-19-how-tokudb-fractal-tre...)

Basically, I think you're thinking of the COLA. What we implement does have a literal tree structure, with nodes and children and the whole thing, so at any point you're just writing out new copies of individual nodes, which are on the order of a megabyte. At no point do we have to rewrite a large portion of the tree, so there aren't any latency issues.

DanWaterworth · on Oct 18, 2012

Thank you very much for the links. Is my understanding correct?

Essentially, a fractal tree is a B-Tree (or perhaps a B+Tree?) with buffers on each branch (per child). Operations get added to these buffers and when one becomes full, the operations get passed to the corresponding child node. Operations are applied when they reach the node that is responsible for the data concerned.

leif · on Oct 18, 2012

That's about it! It's like a B+Tree in that the data resides in the leaves (well, and in the buffers), except that the fanout is a lot lower because we save room in the internal nodes for the buffers.

alex137 · on Oct 19, 2012

Hi. I thought the block size is much bigger in fractal trees (like 4MiB instead of 4KiB) than in B-trees hence the fanout would be about the same?

I'm trying to experiment with these ideas on my side, but can't quite grok how large the buffers must be at each level of the tree.

Let's take a concrete example like: 2^40 (1T) records with an 8-byte key and an 8-byte value. In a traditional B-tree with a 8KiB block size and assuming links take 8 bytes too and assume for a while that blocks are completely full. So, that's 1G leaf blocks, a fanout of about 1K and hence & full 4-level trees: 1 root block, 1K level-2 blocks, 1M level-3 blocks, 1G leaf blocks.

In this setting, I understand that a fractal tree will attach a buffer to each internal node. How large will they be at level 1 (root), level 2, and level 3 ?

leif · on Oct 19, 2012

The buffers have the same capacity at each level of the tree. It doesn't have to be that way, and you can find different strategies in the literature, but that's the way we do it.

The degree of each internal node is way lower, because we want to use the space in internal nodes for buffers more than for keys. In a 16TB tree, we would probably have a tree of height 6 or 7.

What we restrict is the size of a whole internal node, so that any one buffer in it can get much more full than the others if it needs to. We want to pick a node size that makes sense for your media. 4MB is a good compromise for spinning disks between their typical bandwidth and seek time, and it's large enough to give the garbage collector a rest on SSDs.

alex137 · on Oct 22, 2012

OK, I'm starting to get the concept. Here is how I'm understanding the whole thing:

Assume a node size that can hold 1M records in a buffer and 32 links (that's about 16 MiB more or less, more than what you suggest, but it simplifies the explanations).

Assume further that to operate on a node, you read it entirely in memory, modify it, sort it, then write it entirely on disk and that that takes about 1 second. This is pessimistic as there are surely more efficient ways to maintain sorted 1M nodes.

Now consider completely random insertions or 16-byte records and assume that the keys will spilt completely uniformly among the subtrees.

1. Every 1M insertions, you need to spill from level-1 (root buffer) to level-2, which generates about 32 IOs 2. Additionnally, every 32M insertions, you need to spill all level-2 buffers to level-3, which generates 3232 IOs n. every 32^(n-1) M.insertions, you need an additional 32^n IOs to spill from level-n to level-(n+1)

So, all in all, to insert 32^n M.records, you need a total of (n+1)32^(n+1) I/Os, which means 32(n+1) IO per 1M insertions.

For example, in a 16TiB data store, i.e. with n=8, you need 288 IOs per million insertions, or about 2400 random insertions/second at 1s/random IO.

In comparison, a B-Tree with a fanout 1024 and a block size 16KiB would generate about 4 random IOs per insertion. These IOs would be dominated by the seek time (~10ms), so you could get only 25 insertions per second.

On the other hand, a point query in this B-Tree would take 4 IOs instead of 8 random I/Os in the fractal tree. So, that's the tradeoff: much better insertion rate vs. slightly worse query time.

Note that a small fanout (about square root of the equivalent B-tree fanout) is important. If you take 1024 as fanout in the fractal tree, the tree has half the height, but the number of IOs per 1M insertions drops to 10245 instead of 32*9.

Have I got it right?

leif · on Oct 22, 2012

The argument I like goes like this:

Flushing one buffer one level down costs O(1) I/Os. It moves B elements, so flushing one element one level down costs O(1/B) amortized I/Os. The tree has fanout O(1), so it has height O(log N). Therefore, each element must be flushed O(log N) times before it reaches its target leaf. So the cost to get one element down to its leaf is O((log N)/B) amortized I/Os.

Compare this with a B-tree, which costs O(log_B N) = O((log N)/(log B)) per insertion.

I'm pretty sure your math is right, though it attacks it from the other direction and I've only just skimmed it. It seems sound though.

The point query tradeoff is there in the theory, you're right, but it assumes a cold cache. In practice, most people have enough RAM for all their internal nodes, whether they're using a B-tree or a fractal tree, so either way it's typically 1 I/O per random point query in either data structure.

Be careful with asterisks here. :)

alex137 · on Oct 23, 2012

Good.

Now, what is the recommanded layout of the node buffers (the 4MiB buffers). I've read about Packed Memory Arrays, but don't quite get the complete idea. Can we really do better than read, merge, write ?

(A pity there's no preview button on this forum...)

leif · on Oct 23, 2012

Unsorted is fine, we're just counting I/Os, and it takes O(1) I/Os to read or write a buffer regardless of the layout.

leif · on Oct 19, 2012

I should add that, to reduce the cost of a flush by a pretty decent constant number of I/Os, you keep one buffer per child in each node. Basically, you want to do the sorting work on arrival into the node, not during the flush.

DanWaterworth · on Oct 18, 2012

I'm a little confused. I thought fractal trees worked cache obliviously or am I mistaken?

leif · on Oct 18, 2012

In theory, it is cache oblivious, and the CO-DAM model informs our decisions about the implementation, but no, the implementation itself isn't actually cache oblivious. Shh, don't tell on us.

A "fractal tree" is defined by our marketing team as "whatever it is we actually implement." If you want to talk cache-oblivious data structures, we can talk about things like the COLA and cache-oblivious streaming B-trees which have rigorous definitions in the literature. At some point, if you want to achieve a certain level of detail, you have to pick one or the other in order to continue the conversation.

stinkypete · on Oct 21, 2012

Sounds like a Y-Tree to me: "A Novel Index Supporting High Volume Data Warehouse Insertions," C. Jermaine et. al, VLDB '99.

How are fractal trees different?

leif · on Oct 22, 2012

Is that the same as Willard's Y-fast tree? If so, then no they're not very similar. If not then I'm unfamiliar and I should read that paper.

stinkypete · on Oct 24, 2012

Nope, not the same. From the descriptions you give above it is very similar (a modified B-tree with heaps of delayed insertions in internal nodes), though the analysis in the paper does not use CO techniques.

This tweak to B-trees to support fast insertions works so well I am really surprised it has not become more common in practice. The only downside I can think of is that flushes down the tree when internal buffers get large may cause lower-level pages to become overfull, but if you allow more than one flush operations at each level you can avoid this problem. Of course, that's only important if you are paranoid about internal nodes actually fitting in a page; if you don't care about that, then the worst-case analysis picture looks much better since a flush operation causes at most one flush in the pages immediately below it in the tree structure.

leif · on Oct 24, 2012

I was surprised too, when I learned this technique, that people weren't already doing it. I think it's just an age thing. B-trees are old so they have all the kinks worked out, at least in mature implementations. Fractal trees are new enough that there are still a bunch of tradeoffs to play with and evaluate.

When you actually go and implement the system, you find all these behaviors that really aren't expressed in the theory. There are lots of things we're still experimenting with. Flushing isn't too hard, if a flush makes another node way too big, you can just flush that one too. We don't allow our cascading flushes to fan out, to prevent latency-style problems, but they can flush all the way down to a leaf. There are some other fun problems around query strategies and concurrency control too. It's definitely a fun structure to play with.