On Disk IO, Part 3: LSM Trees

derefr · on Sept 28, 2017

Are LSM trees still relevant under a mostly-NVMe storage architecture? If not, what is used instead?

makmanalp · on Sept 28, 2017

SSDs are still a leaky abstraction, e.g. https://news.ycombinator.com/item?id=15271254

Edit: the other thing is that choosing LSM trees (and other data structures) isn't just about storage, it's about access patterns. With a an LSM-tree, you're optimizing for insertions happening often, at the expense of querying. Insertion With B-trees, you spend more time inserting data into the tree, but it's easier to query later.

High performance data systems do optimize for this tradeoff even at the level of L1/L2/L3 cache and main memory, let alone between main memory and disk (which is way slower). The fact that this opportunity for optimization exists is just an irrefutable result of the cache hierarchy, i.e. the fact that there's faster and slower bits of memory, and the slower bits tend to be closer to the CPU and smaller. And this structure happens mainly because of expense, and the laws of physics.

If you're always inserting stuff, you want the location that you're inserting stuff into to be easily accessible. With LSM trees, you're inserting stuff sequentially, so it's easy to keep that spot in CPU cache. With B-trees, you have to scan all over the tree to insert stuff, meaning stuff is flying in and out of cache as you're locating and inserting things.

jdefarge · on Sept 28, 2017

You are absolutely right. Just an addendum: there are modern B-Trees that are optimised to leverage the cache lines, part of the so called cache oblivious data structures ( http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/pa... ). A cache oblivious B-Tree is the basis of Percona's TokuDB storage engine: https://www.percona.com/doc/percona-tokudb/ft-index.html

kdkeyser · on Sept 29, 2017

I believe TokuDB is using a flavor of B epsilon trees, i.e. a B tree with, per node, an associated staging area for writes to that node or its child nodes. Keeps inserts localised at the top of the tree, until the staging areas fill up. Then these staged inserts are propagated in batches down the tree.

In my opinion, a nice middle ground between the B tree and LSM approaches.

ewams · on Sept 28, 2017

My understanding is nvme drives are SSD drives, it's just a different interface. It goes over the pcie bus instead of thru sata / sas which have to go thru another processor that introduces overhead. So they are just faster SSD drives because the connection method and communication format is different.

jdefarge · on Sept 28, 2017

By the way, there's an interest research on designing a new file system based on LSM tree that leverages SSD's sequential writing, as it is mainly motivated by SSDs' shortcomings with random writes: https://www.usenix.org/legacy/event/fast/tech/full_papers/Mi...

loeg · on Sept 28, 2017

You are correct about current flash NVMe drives. The main benefit of NVMe over SATA is lower protocol overhead and more queues. However, it is expected that some sort of low-latency non-volatile storage will become available at some point and that will probably only make sense to put behind NVMe.

Intel's Optane is one example, although it isn't quite as fast or as durable as Intel's initial marketing claims.

haneefmubarak · on Sept 28, 2017

Yep! Even on an NVMe architecture, large sequential writes are much more efficient than randomized small writes.

Vogtinator · on Sept 29, 2017

In the first part, the author makes several mistakes including confusing O_DIRECT with DMA. So take this with a grain of salt.

ifesdjeen · on Sept 29, 2017

If you could list the mistakes, the author could try making corrections and improve the text.

As regard O_DIRECT vs DMA, not sure what exactly you're saying. Understanding Linux Kernel book, for example, is using them quite interchangeably as well (4th edition, 2.6 kernel). Block devices are handled through interrupts and DMA, and these are lower-level means for doing IO. Scylla devs are using it quite interchangeably as well: https://github.com/scylladb/seastar/blob/master/core/reactor...

Wording might have been not optimal, but author never implied that O_DIRECT is DMA.

disclaimer: I'm the author.

cryptonector · on Sept 30, 2017

The author confused DMA with mmap(). The confusion is in the second part, when referencing the first. Whatever, it's braino.