In fact, convergence is a very easy property to preserve in all distributed systems. The trivial but technically valid version of convergence is to throw away all the writes and always return an empty document. A "last writer wins" version at the document level is what you get from a blob store like S3, but while it does converge, it's not that great either.
What we probably want from a distributed system is useful convergence properties that preserve the intent of the participants. A CRDT might not be a good fit for a bank account: if we can both withdraw the last $20 from my account, the bank will be upset. On the other hand, it's a pretty great way of combining independent observations into a list: it doesn't matter what order the observations arrive. Easy!
Most CRDTs aim to preserve causality: if I see your change, and then make my change, my new value will win. If we both make changes without knowing about each other, that's a conflict.
Of course if we both edit unrelated fields -- maybe it's not a conflict! At least, that's how we handle it in Automerge.
In the most conservative case, we should never merge data automatically. Most systems have unmodeled constraints. For example, sometimes a `git` merge will produce no conflicts but fail to compile anyway. Git's model (another CRDT) doesn't model program behaviour, nor do we expect it to. In this case, we rely on a combination of our experience, programming tools, and git's version history tooling to figure out what went wrong.
The conclusion I have is that a CRDT should give us robust tools for minimizing conflict, but also needs to be able to explain how things got to be the way they are and what you can do to make them how you want.
We've made a decent amount of progress on this in Automerge and have a paper coming up about this problem soon but I agree there's still more distance to go. If there are particular questions you have about merge semantics, I'm all ears! We'll continue to explore this space for the foreseeable future and I love to hear about new questions.
The last thing I want to add is that when you say "CRDTs are only really useful for a narrow subset of data", you're really drawing a lot of conclusions all at once about other people's needs and interests. From my perspective, CRDTs are useful for a lot of kinds of data. Not everything certainly, but from where I sit, perhaps more kinds of data than a limited single-node relational database and more kinds than a POSIX file which doesn't retain any history at all.
> From my perspective, CRDTs are useful for a lot of kinds of data.
Yep I 100% agree.
I think the highest value uses for technology like this are in creative applications. I think about wikis, blogs, shared whiteboards, music production and video editing. In all of these cases, "referential integrity" (database constraints) don't really matter that much, and the working set is usually pretty small.
Sketch was outcompeted by Figma because figma used a CRDT as its backend, which enabled it to be collaborative. Sketch had an arguably better product, and was first to market. But it was stuck in the single-editor model because they didn't have a tool like automerge.
As for conflicts, increasingly my favorite CRDT for "general purpose" data (JSON trees) is MVRegisters. In the case of a conflict, a MV (Multi-value) register stores all of the conflicting values. But the application doesn't have to care - we can still treat it like a "single writer wins" register.
To make this work, the CRDT provides two APIs: a simple API and a complex API:
- The simple API just gives the application "the current value". In the case of concurrent edits, the system quietly chooses a winner. This is enough for most software most of the time. Its certainly enough to get started.
- The complex API returns all current values when a conflict has happened. Applications further along in their development lifecycle can use this API to present conflicts to the user and ask the user what should happen. (Or the application can resolve the conflict itself using application-specific logic).
The nice thing about this approach is that the data itself doesn't have to change. Its just an application / UI change to show conflicts. So collaborative applications can be written without caring about conflicts (at first). And later, when conflicts between multiple users cause problems, the applications can move to a richer API if they want to. (And remember, it all works like git under the hood anyway. We can store the full history so even when conflicts are resolved in a weird way, you still haven't lost the users' original edits.)
> Most CRDTs aim to preserve causality: if I see your change, and then make my change, my new value will win. If we both make changes without knowing about each other, that's a conflict.
I haven't kept track of CRDTs since I worked with them in ~2015 and having read the paper by Shapiro et al, but I thought a casual description would be more along the lines of "once we both receive each other's changes, we will agree on the final state"? Or does that no longer reflect current state of the art, or was I just mistaken at the time?
Would you say that automerge is useful for applications that don’t involve a human? I’m imagining a cluster of “service registry” services that use automerge as a way to manage shared state between them. There wouldn’t be a human to fix a merge conflict, so all possible merge outcomes would need to be well defined.
The CRDT examples I see are all oriented around human collaboration, are they a bad choice for something more akin to a distributed database?
What we probably want from a distributed system is useful convergence properties that preserve the intent of the participants. A CRDT might not be a good fit for a bank account: if we can both withdraw the last $20 from my account, the bank will be upset. On the other hand, it's a pretty great way of combining independent observations into a list: it doesn't matter what order the observations arrive. Easy!
Most CRDTs aim to preserve causality: if I see your change, and then make my change, my new value will win. If we both make changes without knowing about each other, that's a conflict.
Of course if we both edit unrelated fields -- maybe it's not a conflict! At least, that's how we handle it in Automerge.
In the most conservative case, we should never merge data automatically. Most systems have unmodeled constraints. For example, sometimes a `git` merge will produce no conflicts but fail to compile anyway. Git's model (another CRDT) doesn't model program behaviour, nor do we expect it to. In this case, we rely on a combination of our experience, programming tools, and git's version history tooling to figure out what went wrong.
The conclusion I have is that a CRDT should give us robust tools for minimizing conflict, but also needs to be able to explain how things got to be the way they are and what you can do to make them how you want.
We've made a decent amount of progress on this in Automerge and have a paper coming up about this problem soon but I agree there's still more distance to go. If there are particular questions you have about merge semantics, I'm all ears! We'll continue to explore this space for the foreseeable future and I love to hear about new questions.
The last thing I want to add is that when you say "CRDTs are only really useful for a narrow subset of data", you're really drawing a lot of conclusions all at once about other people's needs and interests. From my perspective, CRDTs are useful for a lot of kinds of data. Not everything certainly, but from where I sit, perhaps more kinds of data than a limited single-node relational database and more kinds than a POSIX file which doesn't retain any history at all.