LWW / LUW systems are a fine first approximation, but generally do not fulfill the strictest guarantees and desirable outcomes for a database without accounting for a lot of edge cases.
A good conflict management system will have that only as a default. Better ones will also provide mechanisms for overriding LWW in specific scenarios on a per table or even row basis depending on the in/out values from both nodes. The trick is building these conflict management rules so they're independently deterministic, or you'll end up with node divergence.
Normally we just do what you said, and strongly recommend interacting with sticky sessions, geographically disparate segments, etc., to prevent the problem before it happens. It's far easier to avoid a conflict than to correct it after the fact.
While a consensus model "fixes" this, it drastically cuts your throughput to a function of the latency of the most remote node in the quorum, essentially defeating the purpose of having a local Master node. Ideally you would reserve this for specific transactions that are sensitive in that they require absolute consistency regardless of incurred latency (See PACELC). And even this requires a distributed deadlock detector unless more than just the transactions are managed through the consensus layer.
BDR for example provides both of these models for the above reasons. Multi-Master is not an easy problem to solve, and requires a multi-faceted approach to even come close to something usable in a generalized scenario.
I don't agree with the "fine" in the "fine first approximation". I believe it is indeed a bad solution.
LWW/LUW basically leads to data loss. Something unexpected for Postgres users. Which potentially undermines its heritage of caring of data to the extreme.
From a practical perspective, if you run an inventory or sales of physical goods, a counter over a multi-master with said conflict resolution may lead to double bookings, lost bookings or negative inventory stocks. None of these events are fine to me.
I'd prefer not to have conflict resolution and have the replication system explicitly check that data sets are disjoints. In this case, these multi-master systems would be very useful (for a particular set of applications).
That consensus is "slow" is a problem also well solved in the real world. Sure, it's not easy, but when you partition and parallelize, you overcome many of the problems. Look at Spanner, AzureDB or the Postgres-compatible APIs of YugabyteDB or CockroachDB. They all have implemented multi-master with consensus.
Right tool for the right job. Users in Tokyo won't be editing accounts for users in Austrailia. There's often a natural partition there where you can leverage the data locality with an eventual consistency basis safely.
Conflict-free Replicated Data Types (CRDTs) and column-level conflict resolution also largely address the issue with data merges and coarse counter systems. I'll just ignore the general design flaw there, considering these should be an insert-only ledger system rather than counters, because bad or naive implementations abound. I get that the application layer shouldn't necessarily have to coddle the database, given absolute ACID would resist non-ideal usage patterns.
I haven't played with the others, but I set up an 8-node CockroachDB and managed to get pgbench to run on it. I was clearly doing something wrong, because I wasn't impressed at the throughput even with very high client counts. From what I could see there, I was clearly being throttled by some mix of RTT and locking contention between the 3-node default consensus group. Supposing this is not problematic in non-overlapping shard locality, you still hit a hard performance ceiling when multiple clients interact with the same pool fragment.
> Right tool for the right job. Users in Tokyo won't be editing accounts for users in Austrailia. There's often a natural partition [...]
Exactly. That's why I'd say for tools that support multi-master to provide and enhance (enforce) support for this use case and avoid (don't support) the case where data durability and consistency are lost.
> Conflict-free Replicated Data Types (CRDTs) [...]
Yes, that's another solution for the problem... but I don't see them provided by BDR or Pgcat.
> I'll just ignore the general design flaw there, considering these should be an insert-only ledger system rather than counters [...]
It's not that I advocate for that system, but many users may choose this implementation. That you say "hey, you need to do all this and implement this best practices (and pass the Jepsen test) just to be sure you don't screw up data consistency and durability... it's putting a heavy burden in the user. Databases, if anything, are abstracting users away from complex issues with data and providing solid primitives to work on them. Again: Postgres is known as a very well trusted data store with given ACID properties. Conflict resolution breaks away with this.
> I haven't played with the others, but I set up an 8-node CockroachDB and managed to get pgbench to run on it. I was clearly doing something wrong, because I wasn't impressed at the throughput even with very high client counts [...]
I haven't benchmarked it myself, but I heard stories of low performance. Actually, I always believed that the internal key-value store that they use would never scale to represent table workloads. But give a shot to the other solutions. Spanner, for one, is known to process millions of transactions per second, with global (planet scale) replication (of course, with notable lag, but total consistency). That's pretty cool ;)
> Actually, I always believed that the internal key-value store that they use would never scale to represent table workloads.
Care to elaborate here? As someone working at that layer of the system, our RocksDB usage is but a blip in any execution trace (as it should be, any network overhead you have given it's a distributed system would dominate single-node key-value perf). That aside, plenty of RDBMS systems are designed such that they sit atop internal key-value stores. See MySQL+InnoDB[0], or MySQL+RocksDB[1] used at facebook.
Don't get me wrong, both RocksDB and the work done by CCDB is pretty cool.
Yet I still believe that layering a row model as the V of a K-V introduces by definition inefficiencies when accessing columnar data in a way that row stores do, as compared to a pure row storage. Is not that it can't work, but that I believe it can never be as efficient as a more row-oriented storage (say like Postgres).
I have no idea what you're saying. What's a "row-oriented storage" if not storing all the column values of a row in sequence, in contrast to storing all the values in a column across the table in sequence (aka "column store"). What does the fact that it's exposed behind a KV interface have to do with anything? What's "more" about Postgres' "row-orientedness" compare to MySQL?
In case you didn't know, a row [idA, valA, valB, valC] is not stored as [idA: [valA, valB, valC]]. It's more [id/colA: valA, id/colB: valB, id/colC: valC] (modulo caveats around what we call column families[0], where you can have it be more like option (a) if you want). My explanation here is pretty bad, but [1][2] go into more details.
I know well. I have read most CCDB posts and architecture documentation. Pretty good job.
There are several ways to map a row to K-V stores, and different databases have chosen different approaches, I'm not referring specifically to CCDB's.
Whether you do [idA: [valA, valB, valC]] or [id/colA: valA, id/colB: valB, id/colC: valC], what I say is that I believe it is less efficient than [idA, valA, valB, valC], which actually also supports more clearly the case of compound keys (aka [idA, idB, idC, valA, ....]). Both are the ways Postgres stores rows.
A good conflict management system will have that only as a default. Better ones will also provide mechanisms for overriding LWW in specific scenarios on a per table or even row basis depending on the in/out values from both nodes. The trick is building these conflict management rules so they're independently deterministic, or you'll end up with node divergence.
Normally we just do what you said, and strongly recommend interacting with sticky sessions, geographically disparate segments, etc., to prevent the problem before it happens. It's far easier to avoid a conflict than to correct it after the fact.
While a consensus model "fixes" this, it drastically cuts your throughput to a function of the latency of the most remote node in the quorum, essentially defeating the purpose of having a local Master node. Ideally you would reserve this for specific transactions that are sensitive in that they require absolute consistency regardless of incurred latency (See PACELC). And even this requires a distributed deadlock detector unless more than just the transactions are managed through the consensus layer.
BDR for example provides both of these models for the above reasons. Multi-Master is not an easy problem to solve, and requires a multi-faceted approach to even come close to something usable in a generalized scenario.