CRDTs Turned Inside Out

lxpz · on Jan 25, 2024

As the main author of the Merkle Search Tree paper, I'm really glad to see the design become more known and start to be used. I did not go on and build a practical system that made use of the structure, although I originally came up with the data structure while thinking about how to build decentralized social networks and data stores for collaborative applications. Thankfully, the design itself is generic and can be adpted to many use cases, and it's great to see that people have gone as far as to build independant implementations in Go and Rust. I'd be curious to see how it performs in practice, in systems that have a real-world use, and not in the simplistic synthetic benchmark scenario of the paper. Hopefully it holds up to the promise :)

pfraze · on Jan 26, 2024

At Martin Kleppman's recommendation, we adopted MSTs for AT Protocol's data repository structures. This isn't my specialty so I wasn't deeply involved in evaluating or implementing them, but my understanding is that the self-balancing was a deciding factor and that everyone has been very happy with the outcomes. You should chat with Dan Holmgren at some point if you'd like to hear about his experience with it. Appreciate your work.

iamwil · on Jan 25, 2024

I thought it was really neat when I came across it. There was a detail that I didn't find in the paper (or might have missed).

What if the key you're adding has a lot more (or a lot less) leading zeros in its hash than the number of leading zeros in the current root layer? Do you just add nodes in between the root and the new node? What should go in those in-between nodes?

lxpz · on Jan 26, 2024

I think as long as your construction is deterministic, both options are possible: you can either add the intermediate nodes that have only one children, or you can skip adding them and allow links between non-consecutive levels. In that second scenario you would add intermediary-level nodes only when necessary, i.e. only when they actually need to have several children. The second approach is better for performance but might have a bit more implementation complexity.

LAC-Tech · on Jan 25, 2024

Normally, this is the job of vector clocks in State-based CRDTs

Is the author mistaking vector clocks for version vectors? If so he would not be the first; the famous Amazon Dynamo white paper makes the same mistake when discussing their CRDT implemenetation. But verson vectors and vector clocks are different things.

Alternately, I may be ignorant of a class of CRDTs solved with vector clocks.

iamwil · on Jan 26, 2024

I think you're correct. Thanks for the correction. I've updated the post.

The difference is subtle, and for those interested, the difference is detailed here: https://haslab.wordpress.com/2011/07/08/version-vectors-are-...

Vector Clocks need to establish a partial order among a, potentially ever growing, set of events occurring in the processes.

Version Vectors need to establish a partial order (more precisely a pre-order) among the replicas in the distributed system.

macintux · on Jan 25, 2024

At Basho, we also got version vectors and vector clocks reversed, probably thanks to the Dynamo paper. As I recall, it's a subtle difference.

LAC-Tech · on Jan 25, 2024

Yeah version vectors were first coined in a very underrated paper from 4 decades ago; "Detection of Mutual Inconsistency in Distributed Systems" [0].

In it they not only invent version vectors, they also perfectly describe what came to be known as the "multi-value register CRDT", some three decades before the term was coined. (Amazon Dyano paper does the same, but only a few years before).

Also, it contains this line which always makes me laugh:

Network partitioning can completely destroy mutual consistency in the worst case, and this fact has led to a certain amount of restrictiveness, vagueness, and even nervousness in past discussions, of how it may be handled.

[0] https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2018/Papers...

0x457 · on Jan 26, 2024

IIRC clocks are incremented every time peer sends or receives information, while versions incremented only on mutations.

taylorius · on Jan 25, 2024

I think the article would benefit from quickly defining what a CRDT is - (Conflict free Replicated Data Type). Even just expanding the acronym once would give a bit of an idea.

iamwil · on Jan 25, 2024

Ah, I'll put it in as a note in the beginning. You can see a more basic introduction in a previous post I wrote:

https://interjectedfuture.com/trade-offs-between-different-c...

https://news.ycombinator.com/item?id=38916647

Here's some introductory material I found helpful:

- [An interactive intro to CRDTs](https://jakelazaroff.com/words/an-interactive-intro-to-crdts...)

- [An introduction to state-based CRDTs](https://www.bartoszsypytkowski.com/the-state-of-a-state-base...)

- [CRDTs for non-academics](https://www.youtube.com/watch?v=vBU70EjwGfw)

- [CRDT: The Hard Parts](https://www.youtube.com/watch?v=x7drE24geUw)

- [Readings in CRDTs](https://christophermeiklejohn.com/crdt/2014/07/22/readings-i...)

- [crdt.tech](https://crdt.tech)

taylorius · on Jan 26, 2024

Thanks! Interesting stuff!

h4ch1 · on Jan 25, 2024

I usually just do a quick search on hn.algolia for previous discussions regarding a topic since I somehow imbibe information faster through reading discourse, where different opinions/viewpoints help me wrap my head around a concept faster. Also have noticed the same topic is often shared through different mediums (videos/articles), that also helps, getting curated, diverse content.

May not be the same for you, but just a random observation d:)

sitkack · on Jan 25, 2024

Mixture of debating armchair experts.

vacuity · on Jan 25, 2024

Sometimes threads will veer off into opinionated fights, but there's something wonderful about the threads where people chime in on a topic and keep building on the content. I often learn something new from those.

sitkack · on Jan 27, 2024

That's why we are here!

sroussey · on Jan 26, 2024

Should be added 7B to help people make the connection. ;)

lmm · on Jan 25, 2024

Disagree. CRDT is pretty much a word at this point, searching for the acronym will get you more useful results than expanding it.

koolba · on Jan 25, 2024

Without context it’s impossible to know the topic though. Given the audience here, my first thought at the title was something k8s related for Custom Resource Definition Templates?

lmm · on Jan 26, 2024

> Without context it’s impossible to know the topic though.

The topic is CRDTs; of course if you don't know what that is you have to look it up, but that's true for any word you might find in a headline (e.g. Fortnite, looking at the front page).

> Given the audience here, my first thought at the title was something k8s related for Custom Resource Definition Templates?

Searching for "CRDT" gets me at least 3 pages of nothing but CRDTs, and nothing about Custom Resource Definition Templates. So I think the meaning is pretty established.

squeaky-clean · on Jan 26, 2024

That Fortnite article still explains what Fortnite is in the opening. "The Epic Games Store will include popular game Fortnite". They're not asking for it to be forced into the title, just somewhere near the top of the page.

marcellus23 · on Jan 26, 2024

There's no reason not to expand it. Expanding the acronym does no harm to anyone who is already familiar with it, and is very helpful to people who are not.

eru · on Jan 26, 2024

That's why you give both the acronym and expand it.

echelon · on Jan 25, 2024

tl;dr - "multiplayer web apps"

The technical specifics can follow.

tadfisher · on Jan 25, 2024

Prolly trees are currently used (and maintained) in Dolt [1, 2], who dive a little deeper into the implementation.

[1]: https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl... [2]: https://docs.dolthub.com/architecture/storage-engine/prolly-...

zachmu · on Jan 25, 2024

Our implementation of the Merkle DAG data structure is optimized for tabular data since Dolt is a SQL database, but a lot of it generalizes to other kinds of data. For example, we put a lot of work into making sure that chunk sizes had a predictable distribution, since very large chunks kill query performance:

https://www.dolthub.com/blog/2022-06-27-prolly-chunker/

We also have a lot of work to do with preserving structural sharing across schema changes, which we're not great at yet.

https://www.dolthub.com/blog/2024-01-19-structural-sharing-w...

jchanimal · on Jan 26, 2024

Fireproof uses a 2-layer CRDT (the inner CRDT is encrypted). We also use prolly-trees for indexing. You can read more about the storage engine here: https://fireproof.storage/posts/remote-access-crdt-wrapped-m...

Edit to add: we extracted the outer blockstore so anyone doing immutable data or ipfs can take advantage of our sync connectors. https://www.npmjs.com/package/@fireproof/encrypted-blockstor...

michaelmure · on Jan 26, 2024

A practical example of a op-based Merkle-DAG CRDT is (I believe) git-bug[1]. Some doc here[2].

I originally wrote something akin to an op-based CRDT and enforcing a purely linear series of commits in git's DAG, but eventually found that it doesn't really work in a multiplayer configuration. Eventually, I realized that I could instead have a real DAG to capture the concurrent editions, with "empty commits" as DAG merge operation.

The result is more or less what is described in that article, with some nice properties:

- written in go, I now have a generic implementation[3][4] that, given a set of operation, can easily support many practical use cases (bug tracker issue is the first, kanban and pull-requests coming).

- git itself is taking care of the storage and the synchronization with peers (aka git remotes). I get the full set of upsides described in the article.

- unfinished for now, but I can leverage some git construct to crypto sign each interaction with the data structure, to prove authenticity and later construct a proper access right system (who can edit a comment, who has admin right ...).

- additionally to the DAG structure, I also have lamport clock to give an order between each independent DAGs (last edited bug ...). They are also used as a help to compute the final order within a DAG if there is ambiguity (concurrent editing).

I'm much more an engineer than a researcher, so it'd be awesome to have the opinion of the community and especially iamwil, hecturchi or lxpz.

[1]: https://github.com/MichaelMure/git-bug [2]: https://github.com/MichaelMure/git-bug/blob/master/doc/model... [3]: https://github.com/MichaelMure/git-bug/blob/master/entity/da... [4]: https://github.com/MichaelMure/git-bug/tree/master/entity/da...

hecturchi · on Jan 26, 2024

I'm the main author of the Merkle-DAGs CRDTs and I'm very happy to see this here. I'm a bit sorry now that I used the "Merkle-CRDT" name all over the paper because Merkle-Search-Trees CRDTs also deserve that tag.

I think it is good to explain Merkle-DAG CRDTs (MD-CRDTs) as "merklerized "op-based CRDTs", and I would add that Merkle-Search-Tree CRDTs (MST-CRDTs) are akin to "state-based" CRDTs, to complete the analogy. This helps to bridge traditional and Merkle CRDTs and I would probably introduce it the same way, but it is also not quite right when going into some more depth. I will try to explain why.

In MD-CRDTs we have a growing-DAG that works like a chain of operations that are applied in the order that the DAG is traversed. It looks a lot like an operation log, except it can have multiple branches and, if we follow my paper, operations do not need to have any total order (unlike op-based crdts). We know the latest operation in the DAG (the root node) happened AFTER its children, so there is an embedded notion of order. However, if it has multiple children, we don't know how to order the operations in their respective branches, and we don't need to because they should be commutative etc. For that to work, the paper explains that DAG-nodes embed not operations, but "state deltas". So in reality, MD-CRDTs are delta-based crdts where the merkle-dag structure stores the deltas as they are broadcasted from replicas. It's just that deltas look a lot like operations.

In MST-CRDTs, if my understanding is correct, replicas will be broadcasting the roots of the DAG pointing the full state (similar to state-based CRDTs broadcasting the full state). However, thanks to Merklelization etc., changes to the state only update sections of the full DAG, and diffing, retrieving or broadcasting only the changed parts is super easy and very compact. Since the nodes values will be CRDTs themselves, they can be merged on conflict just fine. In practice, it is sort of also dealing with delta-crdts.

The main thing in both is that reconstructing the full state from deltas is easy and efficient because everything is linked, so you just follow links and drop branches that you already have. Deltas without headaches.

Perhaps another way to conceptually understand both is that in MD-CRDTs, you add DAG-nodes to the top of the DAG as root DAG nodes, leaving the bottom of the DAG untouched. In MST-CRDTs, you add dag nodes to the bottom (the leaves) of the DAG, which cascades into having new root DAG nodes. In both cases you broadcast new root DAG nodes and the rest of people traverse and sync.

Now, what are the practical consequences for implementations:

- The main problem in pure MD-CRDTs (with no caveats) is the unbounded growth (ala blockchain) even when the state becomes smaller. This only matters if you plan to delete or overwrite things from the state. In MST-CRDTs it is "easy" to discard orphaned data. However, one of the main advantages in MD-CRDTs is that "syncing" happens from the last operation/delta. This can be important if you care about having certain notion of order, i.e. that "elements added last" are the first ones to be synced into an empty replica. Another advantage may be to be able to reconstruct multiple views of the same state by re-traversing the DAG, which has "everything".

- MST-CRDTs are very good for database tables, syncing is very efficient, things are compact. I haven't implemented them myself but they seem quite adequate for that use-case. When thinking of downsides, I wonder if they suffer from the cost of re-hashing operations as they grow with a very large number of keys and with heavy usage (larger blocks to hash or more levels to update). One very important advantage of MST-CRDTs is that they are more resilient if a leave-DAG-node is unretrivable. In MD-CRDTs an unretrievable DAG node can break DAG-traversal and with it a large portion of the state while in MST-CRDTs, this will affect only a key or a set of keys related to the missing DAG node.

In the end, for both the devil will be in implementation details: how good is your pubsub code? How good and parallelizable is your update-processing code? How good is your GC process? How good is your blockstore? All those things become performance bottlenecks even when the theoretical properties work perfectly.

hecturchi · on Jan 26, 2024

I forgot: key-value store using MD-CRDTs was implemented here: https://github.com/ipfs/go-ds-crdt

The trickiest part was not the CRDT, but the DAG traversal with multiple workers processing parallel updates on multiple branches and switching CRDT-DAG roots as they finish branches.

iamwil · on Jan 26, 2024

Were you inspired by Git? How did you come up with MD-CRDTs?

hecturchi · on Jan 26, 2024

I worked on IPFS Cluster and people like my co-authors had been experimenting with this already, but didn't have a name nor was formalized. I needed a way to sync state and decided to write it down "properly" as a paper along the way.

I think it's very valuable to bring people into the topic which is very cool (that's why I added a very long intro to try to provide all the background).