Hacker News new | past | comments | ask | show | jobs | submit login
Let’s Invent B(+)-Trees (shachaf.net)
420 points by panic on April 28, 2020 | hide | past | favorite | 69 comments



That made sense. I was following along, and then all of a sudden, it just kind of ended.

As a layman who doesn't clearly remember B Trees, it would be awesome to have even a sentence at the end, like

...and that's B Trees! Commonly used for storing fields in relational databases, filesystems, and more!

For fellow laymen, https://en.wikipedia.org/wiki/B-tree isn't bad, but is there more?


The Postgres documentation on btrees is absolutely stellar if you can afford some extended time to study it [1]. From there you can read the actual code or watch some videos on how Postgresql indexes use them. You can even dump indexes locally to see their content! from root all the way to the leaf nodes [2] (this assuming you read and understood the above).

[1] https://github.com/postgres/postgres/tree/master/src/backend...

[2] https://www.postgresql.org/docs/10/pageinspect.html


Totally agree. The first example was clear and intuitive, but I was lost when the author outlined the idea of blocks indexing blocks indexing values. Maybe an additional example could clearify this subject, improving considerably the quality of the post.


This sounds exactly how the excellent Python Sorted Containers library works, and achieves "faster than C" performance in pure Python by effective use of the bisect module (which itself is written in C). http://www.grantjenks.com/docs/sortedcontainers/implementati...

> The Sorted Containers internal implementation is based on a couple observations. The first is that Python’s list is fast, really fast. Lists have great characteristics for memory management and random access. The second is that bisect.insort is fast. This is somewhat counter-intuitive since it involves shifting a series of items in a list. But modern processors do this really well. A lot of time has been spent optimizing mem-copy/mem-move-like operations both in hardware and software.

> But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding ten thousand. So the implementation of Sorted List uses a list of lists to store elements. In this way, inserting or deleting is most often performed on a short list. Only rarely does a new list need to be added or deleted.

> Sorted List maintains three internal variables: _lists, _maxes, and _index. The first is simply the list of lists, each member is a sorted sublist of elements. The second contains the maximum element in each of the sublists. This is used for fast binary-search. The last maintains a tree of pair-wise sums of the lengths of the lists.


"Faster than C", by effective use of C? Maybe I'm missing something, but that seems to immediately be a contradiction.


Yeah, you're right. On its homepage it's labelled as "fast as C extensions" and "often faster than C implementations". I guess because C implementations often use inferior data structures -- I used to say "algorithm is better than assembly".


Haha, his web directory layout cracked me up.

After reading, I thought to myself, "Hmm... I wonder what else he has written. I'll navigate upwards a level to https://shachaf.net/w "...

It returns a blank page with the letter "w". That actually made me laugh out loud.

"Ah, shucks. Let's go up again."

Returns a non-styled page that reads "Success!".


And then when you click on Success you get this graph (which is a bit sad) - https://shachaf.net/pet.txt


Oh shoot, I didn't even notice that. Looks like he's at a constant Happiness=100 though.


This is weirdly relevant. I just finished implementing a persistent B-tree library in Scheme.[1][2]

I don't know why B-trees aren't used more as a purely-functional data structure. Once they get large enough, they don't move around much when changed; there is no rebalancing except on deletes, and even that only affects the direct path to the deleted node.

[1] https://github.com/ar-nelson/schemepunk#b-trees

[2] https://github.com/ar-nelson/schemepunk/blob/master/btree.sl...


Log-structured-merge trees are that; they are not a functional data structure in memory, but on disk they are.

For those who don't know about this, here's how it works:

Construct a b-tree in memory ('level 0'). When it reaches a certain size, move it to disk (to L1, level 1) in a compact fashion (by building the tree from the leaf up. There is no need to keep spare space at any L1 node because it is a functional data structure, and won't be directly mutated. The L0-tree in RAM can now be scrapped, and built up from scratch with more insertions. Again, when it exceeds a threshold, the in-memory L0-tree and the on-disk L1 tree from earlier are merged to create a new L1 tree. This goes on until the L1 tree has grown beyond a level-1 threshold size (some k times the L0 size), at which point, the L1 tree is pushed into L2. And so on. L0 is in RAM, L1 and others are on disk.

Lookups are more expensive because multiple trees may have to be consulted, but with a judicious use of multiple cores and bloom filters, that cost can be recouped.

This avoidance of in-place mutations works esp. well with SSDs and other copy-on-write systems.


What does the phrase "bloom filter" mean to you?


eh? I'm curious why you would ask that.


I have implemented a HAMT (hash array mapped trie) in Pascal

It has the same tree structure with a lot of children. But the hash is used as key, so they need no balancing. It is just assumed the hash is sufficiently well distributed.


I like the concept of persistent B-tree, can you share about the storage mechanism?


I guess I should have disambiguated my usage of "persistent". In this case it refers to a purely-functional data structure[1], not a data structure that is persisted to disk. Inserting an item into a persistent B-tree produces a new tree, while leaving the old one intact. Ideally, this is done with minimal copying; unchanged subtrees are just pointers to the same subtrees in the old tree.

This is very similar to persistent hash tries, which are used by purely-functional languages like Haskell. But I designed this B-tree library as an implementation of SRFI 146, which uses a comparison function, not a hash function, to create a key-value mapping.

If you're interested in B-trees being persisted to disk, that's how most databases and most filesystems already work.

[1] https://www.geeksforgeeks.org/persistent-data-structures/


Persistent B-trees do seem like they'd work fine. You will have to copy whole blocks that get modified, but I don't think that should hurt much. I'm fairly curious how benchmarks will look compared to the usual choices (Haskell has what, finger trees and radix trees or something? And red-black or AVL have persistent implementations all over of course.)


The reference implementation of SRFI 146 (persistent mappings) is a red-black tree[1]. I benchmarked my B-tree implementation against it and found that, in almost every R7RS Scheme, constructing small trees (~100 elements) and querying and deletion on large trees (~10,000 elements) was 2-3 times faster than red-black trees. The major difference was constructing large trees, which could be easily 10 times faster.

[1] https://srfi.schemers.org/srfi-146/srfi-146.html


I see, the javascript world usually it call the "persistent" data structure as "immutable" data structure


Yeah that's the usual terminology in JS world. Persistence (the naming) comes out of algorithm theory, which IMO has pretty poor naming in general (dynamic programming and cache-oblivious also come to mind as unfortunate names). Not entirely sure why that is.

There's a few "levels" of persistence available, just as an aside. Immutability I think corresponds to the ~strongest level, but also means it can be hardest to achieve. One easier version allow only _querying_ old versions of the data structure for instance, no modifications except at the tip. Sometimes that's enough.


There is also a bit of weirdness here in that they aren't really the same thing, For example see this persistent union-find which is not immutable, but ensures its side-effects are safely hidden. https://www.lri.fr/~filliatr/ftp/publis/puf-wml07.pdf


persistent (for both senses of the word) b-trees are used extensively in couchdb. You can read about their design here:

https://guide.couchdb.org/draft/btree.html


I know quite a number of 'production grade' databases are using B-Tree internally, even mysql. I'm more interested on tiny implementation that just map the tree into the harddisk, without other complex features.


Interesting. :-)

I recently posted a brief document I wrote about a similar structure, immutable AVL trees. I find AVL trees a very nifty structure, I think they don't get as much credits as they deserve: logn (which in practice is about the same as "constant" for most values of n one encounters in practice) insertion, lookup, deletion and even append (for position-based trees, with some assumptions). Incredibly simple implementation. And, if immutable (based on shared memory), allows trivial snapshotting, having multiple "concurrent" trees sharing most of their memory.

Anyhow, it's here: https://github.com/alefore/weblog/blob/master/immutable-avl-...

I have a small implementation here (which I use, among other things, to hold all lines in a file, in my text editor): https://github.com/alefore/edge/blob/master/src/const_tree.h


All node-based trees with a small branching factor of 2-3 have the same performance problems on modern hardware, irrespective of the theoretical big-O algorithmic complexity.

They all have poor memory locality, high overheads, and high indirection. This is kryptonite for typical x86 or ARM CPUs. Unless you're only developing for some tiny embedded CPU with no cache, no branch predictor, and low frequency, such algorithms belong in the dustbin of history.

Practically always, you get better actual performance when using array-based structures such as hashtables, B-Trees, or the like.

Features like snapshotting can be implemented using virtual memory tricks, but is a gimmick rarely used outside of pure functional languages. You might find that simply copying an array when needed for a snapshot is faster than a fancy tree with sharing as a native capability. The exception would be certain dynamic programming scenarios where cloning is very common.


The traditional ZFS implementation(s) use AVL trees as the in-memory data structure for allocating space (ie finding free disk space).

However the performance limitations of them have started to show, so there's been work done to switch to B-trees[1].

[1]: https://www.youtube.com/watch?v=LZpaTGNvalE


Thank you for your reply. Interesting points.

I guess I'll implement an analogous immutable B-Tree structure and run some benchmarks. I suspect you're probably right. I'll probably experiment with different branching factors and tree sizes.

I rely a lot on the trivial (i.e., zero cost) snapshotting for feeding work to background threads. For my workload, having to do deep copies constantly would be prohibitively expensive (and I'd rather not deal with the complexity of explicit locking). That said, I'm now curious to see whether using immutable B-Trees will yield significantly better performance. I suspect they likely will. Exciting. :-)

Thanks again!


I don't know about you, but when I started learning data structures, my professor told me AVL trees lost the popularity battle with red-black trees, and people tended to use red-black trees more. He didn't mention AVL trees besides a footnote. Nowadays even red-black trees aren't favored due to practical concerns like caches, so I assume AVL trees are very niche.


> Nowadays even red-black trees aren't favored due to practical concerns like caches

Not sure if that's true. In addition to the main linux trunk using them, Java 8 included them also as an improvement to their HashMap (I found this by googling), so I don't they are not favored anymore.

Edit: language.


There are cases where you need them, but it's relatively rare, they definitely shouldn't be your first choice.

I'm fuzzy on all the current Linux use cases, but do remember one of the main users of rb trees was CFS. It's a neat scheduling algorithm.

Java 8 specifically: HashMap is implemented with linked list chaining. This is already not very performant, but you can only do so much with Java being a reference heavy language. RB trees are used if the bucket chain grows excessively long - so it's addressing an edge case, speeding up some worst case scenarios.

Dense hash tables based on open addressing outperform bucketed chaining. Look also at abseil's swiss table or folly's f14 if you want to see how they've further advanced to take advantage of the hardware.

In general: flat, dense, linear structures are king for performance.


>you can only do so much with Java being a reference heavy language.

This is definitely not true, implementing array based hashtable is trivial and outperforms in most cases java.utl.HashMap. It's just the decision to have LinkedHashMap (in java 1.2) extending HashMap crippled the latter. The issue has been discussed quite a few times in java core mailing list.


I agree, HashMap is unnecessarily hamstrung because the API requirements push it into a bucket chained implementation (C++ also made this mistake and is one of the downsides of std::unordered_map). And you can definitely write a faster implementation with an array based hashtable.

But I still stand by "you can only do so much being a reference heavy language." Unless you stick to purely primitive types, implement the hash table off heap, or Project Valhalla bears fruit, it's hard to get the data layout you'd want for a really good implementation. So I agree it can be better, but it's going to be hard to get to best - hence my comment.


I do agree data layout is hard - pretty much direct byte buffers (off heap) and some poor man's memory manager (plus serialization/desirilization code).

Project Valhalla and structs is something people have been asking for since Java 1.4 or so. Still nowhere near. And indeed, "best" would take a custom implementation, case by case. It's doable, but usually very far from pretty.


I was not commenting on whether one should use it or not, just that it is still used and picked as recently as Java 8. We are in agreement for the rest of your post. :+1:


Ah! I'm sorry, I had misinterpreted your original comment.


The improvement of hashmap applies only to high collision nodes that also implement comparable. It's a very special case. java.util.TreeMap has always been a red/black one.

On a flip note: HashMap in Java is sort of jack of trades, it's node based which means it has poor memory/caching characteristics - around 36bytes per standard node (compared to ~10 for array backed one) - and Nodes+array tend to be top3 objects in heap dumps. The iteration is sort of 'random' (unlikely LinkedHashMap) which has been a source of lots of issues not manifesting during testing. However, it doesn't degrade in virtually any use case. Normally, I don't use it - either a custom one (CompactHashMap), LinkedHashMap (being the go to hashtable), or ConcurrentHahshMap.


Thanks for sharing! I don’t remember C++ well at all anymore, but it’s cool to see an immutable AVL tree in it since it seems like it would be less common. I wrote one in F# a while back for fun/curiosity since I had no idea how it would be done at the time. Mine was purely key based though. The indexing idea is neat too!


I published a reference implementation of B-trees that is optimized for clarity and conciseness: https://www.nayuki.io/res/btree-set/btreeset.py ; https://www.nayuki.io/page/btree-set


You follow along and build a toy B+-tree based sqlite clone in this (incomplete) tutorial.

https://cstack.github.io/db_tutorial/


I like this website, it's simple and up to the point. And it works without Javascript :)


Does anyone here know if there’s a data structure more ideal (i.e. lower-overhead) than a B+-tree for holding onto large amounts of on-disk data, if you can constrain the operations you’re going to do? Personally, in my workload, I know I’m never going to delete from the tree, only insert and query. Does this “get me” anything?

For another consideration: B+-trees (or trees generally) aren’t necessarily the best when the workload is mostly appends to the end of the keyspace. Is there an on-disk data structure that is tuned for mostly-append, rarely-insert workloads? Or how about for insert workloads, where each insert is actually a batch of inserts of a large contiguous range of sorted keys? (I’ve observed that this workload makes LevelDB fall over, so LSM trees aren’t it.)


I've personally used a B+Tree variant in this case with special-cased insert rules. Rather than splitting nodes when they're full, if they're filled due to appending at the right-hand-side we simply start the next leaf (or next interior + leaf pair) during insert. So, for pure in-order insertions, we end up with a maximally efficient tree with no splits needed; the rest of the logic still works if you then go back and need to update or insert into the middle of the tree at a later date.

I'm sure I'm not the only one with that particular implementation, but I'm not aware of any named algorithm/variant for it.


I like this approach to describing B-trees! I recently taught B-trees for a data structures course and presented them using a similar method. Thought I'd leave the link in case it was useful: http://web.stanford.edu/class/cs166/lectures/05/Slides05.pdf


Really wonderful presentation, thank you! Especially the way the "animations" (the slides with changes, actually) convey the dynamics in the sequence of events explained.


I'm somewhat ignorant of theory but when I ran into this I just kept a separate unsorted array of values to-be added. In moments of idleness the extra array is sorted and if time allows it the entire thing is sorted into the original.

Also a fun approach for an array of sorted numbers is to use bits and have the offset represent the value.


Very... Short article. Anyway my favorite kind of tree is the R tree if you want to go down the research rabbit hole.


That is like a German pun

tree is Baum

R tree could be called Raum.

Which means space, which fits, because the tree stores the keys spatially.


I like it!


What a brilliant one-page summary of a complicated topic. Well done, a pleasure reading it. Bookmarked.


While implementing, I found that there is a large number of disk hops (and hence greater time required to read), and accessing data is slow. Only a chunk of data fits in a disk block (remember, there is a predefined size limit for blocks).

How to get around this?


B+-trees degrade quite fast with the slightest hint of randomness in the inserted data.

To get around this unfortunate property people invented Log-structured Merge Trees - you keep new part of your data which is easy to insert randomly to and once in a while merge it (as sorted data!) with main data. This way degradation is greatly reduced at the expense of reading speed in case of merge.

BTW, LTMs are a variant of the logarithmic method - a way to create dynamic structures (fast to insert and query) from static ones (slow to insert, fast to query). Even sorted arrays can be made fast for insertion in this way.

Anyway, try to keep your insertion order as sorted as possible. Delay actual insertion into disk data if needed.


Could you elaborate on the "logarithmic method"? The fastest technique I know of to insert into sorted arrays uses a square-root decomposition.


You can split a collection of 2^(k-1)<=N<2^k elements into up to k arrays with length 2^i. Insertion into empty array with one element capacity is easy, insertion into array with single element creates an array with two elements. If you already have an array of two elements, you should merge these elements into array of 4 elements.

The merge operation is fast and cache oblivious. Thus, insertion into a set of these arrays is fast.

Query is not so fast - O(log^2N). But I can direct you to Cache-Oblivious Lookahead Arrays (COLA) paper where a technique to get back to O(logN) complexity for lookup is described.

Or you can simple merge these arrays into one at the bulk insertion end.

Source code (not mine): https://github.com/giannitedesco/cola


Interval-halving? Is that a thing with trees?


I understand that B+ trees made relational databases practical.

Is that true? How are they used (sorted, according to link)? How do B+ differ from B trees?


They have some things that a relational database likes:

+ Fast search (log(N))

+ Insert/Delete are fast (O(N) ... Most of the time)

+ Iteration is fast (unlike a hash map or similar)

You could have a look at [0] for a deeper dive.

[0] https://cstack.github.io/db_tutorial/parts/part7.html


Thanks, sounds like sorting is an implementation detail, and not the main benefit in itself.


Imagine you're doing:

SELECT * FROM EMPLOYEE, DEPARTMENT WHERE EMPLOYEE.DEPARTMENTID = DEPARTMENT.ID

The naive thing would be to iterate employees and for each one read the corresponding department. However that would mean one disk seek per employee.

Disk seeks on the rotational disks relational dbs were developed for took hundreds of ms to complete. Even today's rotational disks have seeks that take dozens of ms. So doing one seek per join result doesn't work at all. You can add caching to disguise to some extent, but that costs memory.

Compare to B-Trees: You build an index on EMPLOYEE.DEPARTMENTID. That index is also a B-Tree. Because they are sorted, you can just walk both employee index and department table in parallel by id. You only need to do n/k seeks (where k is branching factor of the b-tree). You only need one record worth of caching per table (the one under the current cursor).

Bonus: because the indices are sorted, it's easy to pipeline each seek after the previous to further reduced the latency of the query.


Thanks, makes sense! I've once before seen comparing sorted values is faster, even when not all match.

If "building an index" means it's just the indices (not all rows of the table), doesn't that indirection mean an extra seek to get the full row? (I probably have a fundamental misunderstanding here)


You're right!

But you can sort the second column of the index too. So in our case, the index would be sorted by dept id, then employee id.

  deptid  empid
  1       1
  1       3
  1       7
  1       10
  ...
  2       4
  2       5
  2       9
  ...
  3       2
  3       6
  3       8
Now if we just do the basic thing and run through the index in order, we'll end up doing m sorted scans of the employee table, where m is the number of unique department ids. Not ideal, but if departments are large, still far better than seeking each employee individually.

But we can do better: If there are relatively few departments, the database can put a cursor at the beginning of each run of them:

  deptid  empid
  1       1    <- cursor
  1       3
  1       7
  1       10
  2       4    <- cursor
  2       5
  2       9
  3       2    <- cursor
  3       6
  3       8
Now the database can scan those cursors in parallel, merge sort them, and feed the result into scanning the employee id table. Now we again have a single sorted scan of the employee table, at the cost of m extra memory.

This isn't a general solution. If there are zillions of departments and each one has only a handful of employees, this doesn't work. And in modern databases low-cardinality indexes like our dept_emp_id index sometimes use more specialized data structures.

One of the beautiful/crazy things about working on databases is it's a product that promises an abstraction that can't actually be implemented perfectly in all cases. Databases uses all kinds of heuristics internally to decide different strategies to answer queries. And vendors are constantly refining, trying to get closer to an ideal they can never actually reach.

But this particular strategy is a common and important one and an example of why B-trees were so important to early relational systems.


I love B-trees. Probably my favourite data structure. But do yourself a favour and read Knuth's treatment of them.


I love the idea but I'd love to see someone implement this and then show performance and memory comparisons to trees and normal arrays.


Faint memories of MUMPS...


Very intuitive explanation.


i don’t get it. what’s next, visual bubble sort?


Uhh ... why is this the top article on HN? I know all about B-trees, and I do not understand what is new or interesting in this very brief page.


Your knowledge isn't mine -- I found this to be quite a condensed way to understand something I've little relation to.

What do you have to contribute that teaches not criticises?


The exposition is tight.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: