In the 8 years since this was written, Log-Structured Merge Trees (the concrete realization of this idea) have basically "won". BigTable, AppEngine, LevelDB, Cassandra, HBase, MongoDB, and several others are all built around them.
There's a powerful hardware trend driving this, namely that disk capacities and write bandwidth are still increasing rapidly, but seek times have basically plateaued. That means that data structures that rely on append-only operations can continue to scale to take advantage of bigger disks, but data structures that rely on disk seeks (eg. B-trees) have hit a bottleneck. Also, as number of cores continues to increase, playback & processing from a sequential log can often be parallelized, but updating your on-disk indexes blocks on I/O.
SSDs have even more unique access patterns - they give you (relatively) fast seek times, there's no particular penalty for random-access vs. sequential usage, but you really want to avoid writes, both to preserve the lifetime of the drive and because mixing reads & writes with a SSD lowers performance.
In practice (and I haven't worked in a big corp for a couple years now), I don't see SSDs replacing mechanical disks. Rather, they're being used for different functions. Disks are becoming a data warehouse; they're the new tape, but one that systems can write directly to, without needing scheduled backups. Operational serving is moving toward SSDs and RAM, with a periodic processing step that pulls data off disks and builds a pre-made shard that gets written all at once to the SSD. LSM trees are well-adapted to the data-warehousing part of this, which is why you continue to see big uptake of BigTable/Cassandra/Mongo. On the SSD end...most people are using custom algorithms & data structures, AFAIK, but you're seeing a resurgence of data structures like perfect hashes, bloom filters, and plain old sorted arrays, all of which have great read performance but can't be incrementally updated without rebuilding the whole data structure.
Do you think you could point me towards a source for "mixing reads & writes with a SSD lowers performance"? This seems reasonable, but I've never actually heard it before, and a cursory googling didn't turn much up. I have heard before that small, random writes on SSDs are much more expensive than sequential writes because random writes can produce write amplification and increase fragmentation. It was my impression that SSDs stand to benefit significantly from log structured filesystems despite the mixed read/write load, but I don't follow it that closely and could be wrong.
In case anyone else is interested, in addition to the above video, this [1] paper goes into some of the details, including this explanation:
"Reads and writes on SSDs can interfere with each other for many reasons. (1) Both operations share many critical resources, such as the ECC engines and the lock-protected mapping table, etc. Parallel jobs accessing such resources need to be serialized. (2) Both writes and reads can generate background operations internally, such as readahead and asynchronous write-back [6]. (3) Mingled reads and writes can foil certain internal optimizations. For example, flash memory chips often provide a cache mode [2] to pipeline a sequence of reads or writes. A flash memory plane has two registers, data register and cache register. When handling a sequence of reads or writes, data can be transferred between the cache register and the controller, while concurrently moving another page between the flash medium and the data register. However, such pipelined operations must be performed in one direction, so mingling reads and writes would interrupt the pipelining."
Also relevant to the "seek" conversation below: it turns out many SSDs have read-ahead caches built in, so sequential reads are much faster than random reads after an initial warm-up, just like hard disks (the difference, however is closer to ~5x than to 100x difference you see in HDDs).
> there's no particular penalty for random-access vs. sequential usage
This is not really accurate; random access has a penalty compared to sequential access (that's just the physics of it folks™) — the unique characteristic would be that SSDs have more channels*banks than e.g. RAM, which has similar characteristics regarding access patterns. Thus, SSDs achieve close to sequential throughput with large numbers of independent random reads (i.e. high queue depth).
This is not a nitpick - the fact that SSDs do not have seek times is an important distinction when discussing storage engines.
You might want to read the following by Jay Kreps one of the authors of Kafka, Voldermort and Samza:
"SSDs and Distributed Data Systems"
>"On traditional hard drives the trade-off between in-place and log-structured storage is a bit of a wash. Write performance for log-structured storage is vastly better, but most applications have more reads than writes. Read performance may be better or worse depending on the details of the implementation (a traditional log-structured merge tree is definitely worse for reads since it does multiple seeks for each uncached read, but hashing variants or SSTables that use bloom filters to avoid unnecessary lookups need not be). However the move to SSDs completely changes this dynamic. Since SSDs have fast seek performance grouping data by key is much less important.
...
Here are the facts you need to know about SSDs:
They eliminate “seek time” and consequentially make random reads very, very fast"[1]
There is "seek time" in SSDs, just as there is "seek time" in RAM. It's just orders of magnitude faster.
Some of the principles remain the same across disk, SSD and RAM. Sequential access is usually faster than random ("seek") access.
Seek time from a storage point of view just means the lookup latency - the cost of jumping to a random source of data (an area on disk or in memory or in another datacenter). You could think of a hash table lookup as "seek time". It could be attributed to an L1/L2 cache reference, main memory reference, physical disk seek or network roundtrip etc.
Just because the concept of "seek time" originated with spinning rust does not mean it must remain there.
No there is no seek time in SSDs at all, there is simply a constant latency. And no RAM does not have seek times either. The term "seek" is not some loose and fungible term that can be applied to different subsystems.
A seek is an agreed upon industry concept. A seek is composed of the following components[1]:
• start-up: the arm is accelerated, typically with constant torque (force) until it reaches half of the seek distance, or a fixed maximum velocity;
• coast: for a long seek, the arm will be allowed to coast across most of the distance at its maximum speed;
• slowdown: the arm is brought to rest, hopefully exactly on top of the desired track;
• settle: the disk controller uses servo positioning information to fine-tune the position of the head over the desired track on the selected surface.
Additionally a write seek takes slightly longer than a read seek on a rotating disk, this is because of settling described above. This has no corollary in SSDs.
Rotating disks also have something known as an Average latency which is calculated as the time it takes for the sector of the disk being accessed to rotate into position under a read/write head. The reason I bring this up is because seek and latency are two distinct measurements on mechanical disks and it is the latency measurement which could be applied to SSDs as it is a constant, it mechanical disk this would be gated on the rotational speed of the disk(RPMs)but this is a constant.
And the reason this all matters is that
in order to calculate the average IOPS on a mechanical disk you use the following formula:
Divide 1 by the sum of the average latency in ms and the average seek time in ms (1 / (average latency in ms + average seek time in ms).
This formula is obviously meaningless in the context of SSDs.
And average seek time is also well understood to mean:
The average of all possible seek times which technically is the time to do all possible seeks divided by the number of all possible seeks, but in practice it is determined by statistical methods or simply approximated as the time of a seek over one-third of the number of tracks. This formula goes back to IBM and the Winchester drives I believe.
So no in SSDs there is no such thing as a seek. There is simply a latency and that latency is a constant.
Log structured storage only becomes more ubiquitous with SSDs, since this is what they're all doing under the hood of their flash translation layer. The eventual long-term strategy might be open channel SSDs that allow the host system to provide the flash translation layer so that there aren't two log-structured filesystems on top of each other.
My layman's guess is something that combines the two. larger immutable storage using B-Tree's (or similar), and more live data using log structured stuff. With semi-regular promotion of data from log to b-tree and eventual clean-up of data no longer referencable in the b-tree.
Well, there is already a lot of thought that goes into LSM trees on flash storage - check out RocksDB [0] for example, which goes to great lengths to allow the user to deal with Read / Write amplification, a problem specific to flash storage. So I don't think that more SSD adoption will fundamentally change anything.
The tradeoff from LSM tree to B-tree, very generally speaking, is more about access patterns: LSM trees lend themselves to insert-heavy workloads because the structure is conceptually just a big array that you very quickly append stuff to the end of without checking the rest of the array. That's the magic of why it's so fast for insertions - there's no overhead. You just ignore the older key/values. When doing a read, you read backwards from the end, reading only the newest values. When you fill your memory budget, you flush your array to a lower layer, removing the duplicate old values. I encourage everyone to read the paper - it's not that arcane, just skip the difficult bits [1]. (nb: the original uses trees at each layer, but people have moved away from that for various reasons of changing hardware, scale, use cases)
Another point is that oftentimes these data structures span storage layers (or the "cache hierarchy" as database systems people like calling it) - e.g. you could have an LSM tree that has a top layer fitting in L3, then a bunch more in memory, then the majority of it spilling over to disk. Another example is the Bw-Tree [2], which introduces a mapping table that is a storage-agnostic lookup table that tells you wherever a record is, disk or memory or otherwise, and is smart about paging stuff in and out of memory based on hotness. So is this really one "data structure"? Is it a combo of many? Could we change each layer based on the storage type? It gets confusing.
When you squint, a lot of data structures start looking like each other, except for certain details that form the fundamental characteristics. Figure 4 in the Monkey paper [3] is one of my favourite figures ever - their observation is that there is a tradeoff continuum of read and lookup costs, and on one end of that continuum is logs (very easy to append to, hard to find things without searching the whole thing), the other end is a sorted array (very easy to find things with binary search, very hard to append to). This is already pretty cool, but then in the middle are tiered and layered LSM trees. And you can actually tweak the level of tiering and layering to reap the benefits of that tradeoff. Then when you start thinking about how big you want your layers to be, how you want to connect them, etc, you eventually arrive to binary trees, btrees, etc. It's sort of magical how related all the storage structures are.
MongoDB does not support LSM on any of the production releases of WiredTiger yet (or MMAPv1, to my best knowledge. There is support via MongoRocks (which uses RocksDB via Mongo's swappable engines), but that is Facebook's version of LevelDB
I am impressed by the writing style. Very clear and delivers a solid explanation.
What relation, if any, would this type of system have with "persistent data structures", a term I have seen used in some browsing of functional programming topics. Is this somewhat like a persistent data structure until old parts are overwritten ("garbage collected"?)? Is there a flavor of persistent data structure similar to this?
We were looking for ways to implement a key-value store on a distributed log, and we actually started from a very basic persistent red-black tree, adapting it along the way for log-structured storage. The technique worked really well: http://noahdesu.github.io/2016/08/02/zlog-kvstore-intro.html
Slightly related perhaps and something I have been curious about for a while... event sourcing seems like a very powerful pattern that I haven't seen wide adoption. The best documentation seems to be some MS dev library notes and a discussion from M Fowler.
Are there any open source implementations of a database that uses event sourcing?
I built https://github.com/amark/gun after I was using MongoDB to event source data. The key piece for me was wanting to have Firebase-like realtime sync for my event sourcing. It is MIT/ZLIB/Apache2 licensed.
I've become a bit of a CouchDB zealot of late for this exact reason... Native support for incrementally updating map-reduce combined with change-feeds makes implementing event sourcing straightforward. Oh, and it works with PouchDB or Couchbase Lite on web and mobile platforms! A sibling poster wrote Gun(DB?) which would be the next DB I'd try if CouchDB couldn't fit the bill.
If you wanted to reimplement event sourcing in couch, it can be implemented as a versioned merge in a map-reduce view that takes a sequential ordered events (e.g. `ui-event:0000025`) and maps those changes into a "versioned" JSON. This versioned JSON changes leaf values into versioned objects like `"fieldValue": { "_val": 34.00, "_ver": 25 }` which I call property fields. This is necessary because CouchDB is a B+Tree implementation and reduce operations are sequential but not contiguous. The reduce phase merges all events sequentially by choosing property fields with the latest event – making it a CRDT type – and resulting in a single JSON document with the latest fields with "_ver" info for each. This scheme has a drawback that each event needs to have a unique serially ordered ID to work. Not sure how time stamps would work. I chose the ID based scheme to allow the UI be able to know when a user interaction conflicts (a vector clock between) and let it display the conflict to the user or decide how to merge the knew state.
It's not the most efficient but for most moderate sized UI's with a single state tree, it performs fine. Enough to run on a raspberry pi. :)
The only thing I haven't implemented in this is user permissions, but I suspsect using something like JWT tokens could provide a way to "verify" a user event on the CouchDB server.
I'd be curious to know if this scheme would perform much better with an log merge database, but there doesn't seem to be any obvious benefits to me, imho. Alternatively, does anyone else know of other DB's with incremental map-reduce? Postgres materialized views (pre v10) and Mongo map reduce views both seem to require explicit calls to update them after new events.
[edit: mobile phone grammar fixes]
Another notable product based on log-structured storage is ObjectiveFS (https://objectivefs.com/), which implements a POSIX filesystem on top of Amazon S3 and other object stores. It's proprietary, so I don't know much about how it works. But it claims to be a log-structured filesystem.
There's a powerful hardware trend driving this, namely that disk capacities and write bandwidth are still increasing rapidly, but seek times have basically plateaued. That means that data structures that rely on append-only operations can continue to scale to take advantage of bigger disks, but data structures that rely on disk seeks (eg. B-trees) have hit a bottleneck. Also, as number of cores continues to increase, playback & processing from a sequential log can often be parallelized, but updating your on-disk indexes blocks on I/O.