I've been working occasionally on a database engine design that takes this idea ...

bob1029 · on June 26, 2021

I really enjoy seeing how similar themes emerge across the industry.

I've been working on a key-value store that is purpose built for low-latency NVMe devices.

I have an append only log (spread across file segments for GC concerns), with the only contents being the nodes of a basic splay tree. The key material is randomly-derived, so I don't have to worry about the sequential integer issues on inserts. Writes to the database are modified roots of the splay tree, so the log is a continuous stream of consistent database snapshots. For readers that don't require serialization, they can asynchronously work with the historical result set as appropriate without any translation or sync required. You just point the reader to a prior root node offset and everything should work out.

jasonwatkinspdx · on June 26, 2021

Yup, I think a bunch of us are running up the same trail here. Definitely the zero coordination required read only transactions are super appealing.

fishtockos · on June 26, 2021

Can you explain the API a bit more? If you're batching writes in a single thread, doesn't that imply that clients don't 'know' when their writes are actually committed to disk? Or are their requests kept open until fsync?

jasonwatkinspdx · on June 26, 2021

So first a disclaimer. I've been thinking about this design or something like it in the back of my head for a couple years while keeping up to date on VLDB papers and the like. It's only recently that I've gotten serious about shipping a proof of concept. As it stands I'm just working bottom up trying to confirm the more risky components will work. So it's by no means a done deal. Obviously I think I'm on the right trail or I wouldn't be doing it though.

I'm working in golang. Client goroutines that execute read write transactions buffer up their read and write sets, then submit them to the committer goroutine via a channel. Part of the struct they submit has a blocking channel for the committer notify them of the result. So nothing returns to the client unless it's stable on disk. Assuming I get far enough, in a future version that'll also include optionally waiting for raft replication.

dataangel · on June 26, 2021

> Bagwell Trie

I'm getting almost no google results for "Bagwell Trie". Where can I read about this?

jasonwatkinspdx · on June 26, 2021

https://lampwww.epfl.ch/papers/idealhashtrees.pdf

These are the prior work for HAMTs, which have been implemented in a bunch of languages.