I've been working occasionally on a database engine design that takes this idea to heart. It's been hugely beneficial.
The big picture design is simply described as an append only log of copy on write page states, with Kafka style compaction to maintain whatever window of history you want. This is supplemented with a compact in memory index that holds the LSN for the latest page states to make accessing live data as fast as possible.
The write path goes through a single batch committer thread to amortize fsync overhead (technically two threads in series so I can overlap prepping the next batch with waiting on fsync). Single writer isn't a big limit since even on consumer grade SSDs you can write at GiB/sec now. Append only log means I hit the happy case on SSDs. FTLs have come a long way in equalizing sequential vs random write performance, but this comes at the cost of a lot more write cycles behind the scenes to shuffle stuff around.
Here's two examples of how immutability has drastically simplified the design:
First, the in memory index is just a hand tweaked version of a Bagwell Trie. Since the trie is copy on write, once the fsync finishes the write thread can publish the new index state with a single atomic write. Client threads never see any intermediate state. The database moves from snapshot to snapshot atomically from their view. (This is the part I'm coding up atm)
A second example is want to have an on heap cache, to avoid overtaxing the SSD or the downfalls of a pure mmap caching approach (mmap can be the miss path for this cache however). I'm planning on using a lock free hash table. Since I'm working in a language that doesn't have such as a library, I'll have to roll my own. Ordinarily this would be an insane idea, as such structures are infamous for being nearly impossible to get correct, even when using formal modeling tools. But in my case, the relationship of LSN -> Immutable Page State is stable, so the cache can be quite sloppy. With a mutable cache I'd in essence have to ensure the CAS always went the right place, even under resizing, etc. But with immutability duplicates are only a potential memory waste, not a correctness problem, so I get to be quite a bit more sloppy.
One shouldn't treat any single thing as a panacea, but I've definitely come around to the view that immutable by default is a good approach. You get to just trivially dodge entire categories of nasty problems when doing something like the above.
I'd love to hear anything more you can share about the stuff you work on. It sounds really interesting.
I really enjoy seeing how similar themes emerge across the industry.
I've been working on a key-value store that is purpose built for low-latency NVMe devices.
I have an append only log (spread across file segments for GC concerns), with the only contents being the nodes of a basic splay tree. The key material is randomly-derived, so I don't have to worry about the sequential integer issues on inserts. Writes to the database are modified roots of the splay tree, so the log is a continuous stream of consistent database snapshots. For readers that don't require serialization, they can asynchronously work with the historical result set as appropriate without any translation or sync required. You just point the reader to a prior root node offset and everything should work out.
Can you explain the API a bit more? If you're batching writes in a single thread, doesn't that imply that clients don't 'know' when their writes are actually committed to disk? Or are their requests kept open until fsync?
So first a disclaimer. I've been thinking about this design or something like it in the back of my head for a couple years while keeping up to date on VLDB papers and the like. It's only recently that I've gotten serious about shipping a proof of concept. As it stands I'm just working bottom up trying to confirm the more risky components will work. So it's by no means a done deal. Obviously I think I'm on the right trail or I wouldn't be doing it though.
I'm working in golang. Client goroutines that execute read write transactions buffer up their read and write sets, then submit them to the committer goroutine via a channel. Part of the struct they submit has a blocking channel for the committer notify them of the result. So nothing returns to the client unless it's stable on disk. Assuming I get far enough, in a future version that'll also include optionally waiting for raft replication.
The big picture design is simply described as an append only log of copy on write page states, with Kafka style compaction to maintain whatever window of history you want. This is supplemented with a compact in memory index that holds the LSN for the latest page states to make accessing live data as fast as possible.
The write path goes through a single batch committer thread to amortize fsync overhead (technically two threads in series so I can overlap prepping the next batch with waiting on fsync). Single writer isn't a big limit since even on consumer grade SSDs you can write at GiB/sec now. Append only log means I hit the happy case on SSDs. FTLs have come a long way in equalizing sequential vs random write performance, but this comes at the cost of a lot more write cycles behind the scenes to shuffle stuff around.
Here's two examples of how immutability has drastically simplified the design:
First, the in memory index is just a hand tweaked version of a Bagwell Trie. Since the trie is copy on write, once the fsync finishes the write thread can publish the new index state with a single atomic write. Client threads never see any intermediate state. The database moves from snapshot to snapshot atomically from their view. (This is the part I'm coding up atm)
A second example is want to have an on heap cache, to avoid overtaxing the SSD or the downfalls of a pure mmap caching approach (mmap can be the miss path for this cache however). I'm planning on using a lock free hash table. Since I'm working in a language that doesn't have such as a library, I'll have to roll my own. Ordinarily this would be an insane idea, as such structures are infamous for being nearly impossible to get correct, even when using formal modeling tools. But in my case, the relationship of LSN -> Immutable Page State is stable, so the cache can be quite sloppy. With a mutable cache I'd in essence have to ensure the CAS always went the right place, even under resizing, etc. But with immutability duplicates are only a potential memory waste, not a correctness problem, so I get to be quite a bit more sloppy.
One shouldn't treat any single thing as a panacea, but I've definitely come around to the view that immutable by default is a good approach. You get to just trivially dodge entire categories of nasty problems when doing something like the above.
I'd love to hear anything more you can share about the stuff you work on. It sounds really interesting.