Pgcat – Enhanced PostgreSQL logical replication

anarazel · on Feb 11, 2020

> Instead of worker processes and low-level row ingression, pgcat uses sql template to apply high-level sql commands, so it could make largest compatibility on target table type. It is written in golang and runs in separate process.

Doing "pgcat uses sql template to apply high-level sql commands" unconditionally is somewhat sad. It has WAY higher overhead than the referenced "low-level row ingression". It makes sense to so in some cases (into a view, triggers, apply functions, ...), but always doesn't seem great.

jinhualuo · on Feb 11, 2020

Yes, the template generated sql would bring some overhead. But it's the cost for flexibility: The target table would not be limited to base table. It could be a view, a foreign table, partitioned table, or citus distributed table. And, it also support table name mapping.

anarazel · on Feb 11, 2020

You can detect whether a target is a plain table or not. And use a different path if not.

And independent of using SQL or not: Executing them separately out of process kills you via latency, even if you're on the same machine. The context switches are just too expensive. Which makes single threaded apply untenable for anything with a few more clients, or bulk load.

jinhualuo · on Feb 11, 2020

I would try to do more test for performance, especially in product environment.

Detect the target table is easy, but it seems tricky to redirect the flow to pg build-in logical replication (not necessary to re-implement it, especially it was written in C language, so redirection is the right way).

In fact, you could use pg logical replication directly for some cases like big and real-time data flows. But currently, it has one big flaw: bi-directional replication or multi-master replication is impossible, where it would cause dead cycle loops, because it do not filter any origin currently, you could check function pgoutput_origin_filter.

And, even if the filter is right, it still lacks of conflict solution, but pgcat could help there, and you don't need to use pgcat, because pgcat has trigger way to resolve conflict, you could check pgcat_setup_lww command.

ahachete · on Feb 11, 2020

I'm not a big fan of conflict resolution, and LWW is a pretty bad choice in most cases. Conflict resolution means data loss (it violates the D of ACID, as an user expects that a transaction that was committed would endure, but it is deleted or overwritten by another one), and most likely C.

The effects of this are hard to predict and can hit pretty hard your business logic. This is not the kind of expectation you get from Postgres. This would fail Jepsen at second 1.

So I'd advise not to use it. The only use case I see for bi-directional replication is where you ensure that all the master nodes have disjoint sets of data, and the logical replication just aggregates them, without a possibility for conflicts.

If you really want multi-master you need a system that works on top of a consensus protocol. And there are definitely tools out there for this. But not LWW.

jinhualuo · on Feb 11, 2020

Yes, if multi-masters change the same rows of same table, it would cause conflict.

Consensus protocol like raft, has limitation: it requires low delay among the nodes, so normally, it would be deployed in the same data-center. Moreover, if all changes (conflict or not), handled by consensus protocol, the performance would be bad. In fact, data-centers are normally loose coupling (the application normally choose the same data-center as the data source/target). So data-centers just need somehow light-weighted conflict resolution.

pgcat lww, just like Cassandra, provides counter column concept, which is used to retain each change among data-centers, and simple aggregation helper function there. pgcat lww is done by pg trigger, so if you need customization of your business, you could modify the trigger function to fit it.

bonesmoses · on Feb 11, 2020

LWW / LUW systems are a fine first approximation, but generally do not fulfill the strictest guarantees and desirable outcomes for a database without accounting for a lot of edge cases.

A good conflict management system will have that only as a default. Better ones will also provide mechanisms for overriding LWW in specific scenarios on a per table or even row basis depending on the in/out values from both nodes. The trick is building these conflict management rules so they're independently deterministic, or you'll end up with node divergence.

Normally we just do what you said, and strongly recommend interacting with sticky sessions, geographically disparate segments, etc., to prevent the problem before it happens. It's far easier to avoid a conflict than to correct it after the fact.

While a consensus model "fixes" this, it drastically cuts your throughput to a function of the latency of the most remote node in the quorum, essentially defeating the purpose of having a local Master node. Ideally you would reserve this for specific transactions that are sensitive in that they require absolute consistency regardless of incurred latency (See PACELC). And even this requires a distributed deadlock detector unless more than just the transactions are managed through the consensus layer.

BDR for example provides both of these models for the above reasons. Multi-Master is not an easy problem to solve, and requires a multi-faceted approach to even come close to something usable in a generalized scenario.

ahachete · on Feb 11, 2020

I don't agree with the "fine" in the "fine first approximation". I believe it is indeed a bad solution.

LWW/LUW basically leads to data loss. Something unexpected for Postgres users. Which potentially undermines its heritage of caring of data to the extreme.

From a practical perspective, if you run an inventory or sales of physical goods, a counter over a multi-master with said conflict resolution may lead to double bookings, lost bookings or negative inventory stocks. None of these events are fine to me.

I'd prefer not to have conflict resolution and have the replication system explicitly check that data sets are disjoints. In this case, these multi-master systems would be very useful (for a particular set of applications).

That consensus is "slow" is a problem also well solved in the real world. Sure, it's not easy, but when you partition and parallelize, you overcome many of the problems. Look at Spanner, AzureDB or the Postgres-compatible APIs of YugabyteDB or CockroachDB. They all have implemented multi-master with consensus.

bonesmoses · on Feb 11, 2020

Right tool for the right job. Users in Tokyo won't be editing accounts for users in Austrailia. There's often a natural partition there where you can leverage the data locality with an eventual consistency basis safely.

Conflict-free Replicated Data Types (CRDTs) and column-level conflict resolution also largely address the issue with data merges and coarse counter systems. I'll just ignore the general design flaw there, considering these should be an insert-only ledger system rather than counters, because bad or naive implementations abound. I get that the application layer shouldn't necessarily have to coddle the database, given absolute ACID would resist non-ideal usage patterns.

I haven't played with the others, but I set up an 8-node CockroachDB and managed to get pgbench to run on it. I was clearly doing something wrong, because I wasn't impressed at the throughput even with very high client counts. From what I could see there, I was clearly being throttled by some mix of RTT and locking contention between the 3-node default consensus group. Supposing this is not problematic in non-overlapping shard locality, you still hit a hard performance ceiling when multiple clients interact with the same pool fragment.

I'll have to look at the others now.

ahachete · on Feb 11, 2020

> Right tool for the right job. Users in Tokyo won't be editing accounts for users in Austrailia. There's often a natural partition [...]

Exactly. That's why I'd say for tools that support multi-master to provide and enhance (enforce) support for this use case and avoid (don't support) the case where data durability and consistency are lost.

> Conflict-free Replicated Data Types (CRDTs) [...]

Yes, that's another solution for the problem... but I don't see them provided by BDR or Pgcat.

> I'll just ignore the general design flaw there, considering these should be an insert-only ledger system rather than counters [...]

It's not that I advocate for that system, but many users may choose this implementation. That you say "hey, you need to do all this and implement this best practices (and pass the Jepsen test) just to be sure you don't screw up data consistency and durability... it's putting a heavy burden in the user. Databases, if anything, are abstracting users away from complex issues with data and providing solid primitives to work on them. Again: Postgres is known as a very well trusted data store with given ACID properties. Conflict resolution breaks away with this.

> I haven't played with the others, but I set up an 8-node CockroachDB and managed to get pgbench to run on it. I was clearly doing something wrong, because I wasn't impressed at the throughput even with very high client counts [...]

I haven't benchmarked it myself, but I heard stories of low performance. Actually, I always believed that the internal key-value store that they use would never scale to represent table workloads. But give a shot to the other solutions. Spanner, for one, is known to process millions of transactions per second, with global (planet scale) replication (of course, with notable lag, but total consistency). That's pretty cool ;)

irfansharif · on Feb 11, 2020

> Actually, I always believed that the internal key-value store that they use would never scale to represent table workloads.

Care to elaborate here? As someone working at that layer of the system, our RocksDB usage is but a blip in any execution trace (as it should be, any network overhead you have given it's a distributed system would dominate single-node key-value perf). That aside, plenty of RDBMS systems are designed such that they sit atop internal key-value stores. See MySQL+InnoDB[0], or MySQL+RocksDB[1] used at facebook.

[1]: https://en.wikipedia.org/wiki/InnoDB

[0]: https://myrocks.io/

ahachete · on Feb 11, 2020

Don't get me wrong, both RocksDB and the work done by CCDB is pretty cool.

Yet I still believe that layering a row model as the V of a K-V introduces by definition inefficiencies when accessing columnar data in a way that row stores do, as compared to a pure row storage. Is not that it can't work, but that I believe it can never be as efficient as a more row-oriented storage (say like Postgres).

irfansharif · on Feb 11, 2020

I have no idea what you're saying. What's a "row-oriented storage" if not storing all the column values of a row in sequence, in contrast to storing all the values in a column across the table in sequence (aka "column store"). What does the fact that it's exposed behind a KV interface have to do with anything? What's "more" about Postgres' "row-orientedness" compare to MySQL?

In case you didn't know, a row [idA, valA, valB, valC] is not stored as [idA: [valA, valB, valC]]. It's more [id/colA: valA, id/colB: valB, id/colC: valC] (modulo caveats around what we call column families[0], where you can have it be more like option (a) if you want). My explanation here is pretty bad, but [1][2] go into more details.

[0]: https://www.cockroachlabs.com/docs/stable/column-families.ht...

[1]: https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mappin...

[2]: https://github.com/cockroachdb/cockroach/blob/master/docs/te...

ahachete · on Feb 11, 2020

I know well. I have read most CCDB posts and architecture documentation. Pretty good job.

There are several ways to map a row to K-V stores, and different databases have chosen different approaches, I'm not referring specifically to CCDB's.

Whether you do [idA: [valA, valB, valC]] or [id/colA: valA, id/colB: valB, id/colC: valC], what I say is that I believe it is less efficient than [idA, valA, valB, valC], which actually also supports more clearly the case of compound keys (aka [idA, idB, idC, valA, ....]). Both are the ways Postgres stores rows.

awoods187 · on Feb 11, 2020

PM @CockroachDB here. Have you seen our performance page? https://www.cockroachlabs.com/docs/dev/performance.html CRDB can easily scale throughput.

jinhualuo · on Feb 12, 2020

Note that LWW is optional, pgcat implement LWW in trigger way, see the pgcat_setup_lww (https://github.com/kingluo/pgcat/tree/master/cmd/pgcat_setup...).

_-___________-_ · on Feb 11, 2020

What use cases are there for a replication that doesn't necessarily produce an accurate replica?

akdor1154 · on Feb 11, 2020

Looks really nice! Out of interest, at what stage would the dbas here consider running a newish third party project such as this on a production workload?

edoceo · on Feb 11, 2020

I use pg logical in production today on a small system (I'm only a couple 100 gig). This tool looks really promising. However, for my core data I look at 6mo of replica realworld load in test before putting anywhere close to live.

Also, for some bits we also run some near shit-hack custom pg extensions in production so...I'm not risk adverse.

CuriouslyC · on Feb 11, 2020

Personally I would test this in a greenfield project first, unless this solved a serious problem I was facing.

gigatexal · on Feb 11, 2020

Yep. Former DBA here. Don’t trust something new with price ion data without getting really familiar with it and learning its corner cases and how to recover and learning to trust it with your data