Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Logical Replication Explained (fastware.com)
125 points by kristianpaul on March 17, 2023 | hide | past | favorite | 23 comments



Couple of questions, maybe someone here can answer them.

If DDL is not supported, can you not sync schema changes automatically? If true, what is the recommended way?

If a stored procedure is triggered on the publisher and it mutates data… is that mutation what is sent? Or would the sproc also execute on the subscriber?

Does it do a good job of syncing, or is it common to reach a state where you have to blast out the subscribers to get them back in a good synced state with the publisher?


PostgreSQL logical replication only cares about the data. It doesn't require that you have the same database schema - or even a database - on the receiving end. It doesn't know about stored procedures.

You would typically use it for situations where you don't want to replicate your whole database.

You can subscribe to a narrow/specific set of changes (for example if you only care about inserts to a specific table, no other changes).

It has some significant limitations that means, it can't really be used as "Kafka Streams lite":

You can't have different permissions for different publications. If a subscriber can read one, it can read them all.

There is no separate storage for the publications or the subscribers: if any subscriber is slow/stuck/offline, it will hold back all your WAL for the entire database cluster.

--

If you want an exact replica of your database, you have to use the "physical replication". This still doesn't specifically care about stored procedures, but it replicates data and schema change exactly.


Schema does not change automatically. We generally add new columns to the subscriber before adding them to the publisher if we want zero impact. But if we don't, there's no data loss we just end up with replication held up until the column is added.

Stored procedures, triggers, etc. don't exist on the subscriber. They just get any mutations on any tables they are subscribed to.


I really like this comment because I'm curious about all of these too.

For the second one, I'm wondering if it's like MySQLs binlog replication where functions should be deterministic, which makes me think that they get reran on the replicated server.


> if it's like MySQLs binlog replication where functions should be deterministic, which makes me think that they get reran on the replicated server.

For DML, MySQL replication actually allows you to choose between replicating the raw SQL statement (statement-based replication) vs replicating its effect (row-based replication). The former requires functions to be deterministic, but the latter does not.

Both modes are still technically logical replication though, as it's a separate higher-level abstraction which is independent from each storage engine's physical transaction logs.


Oh very cool - thanks for filling me in!


Logical replication is also great for replicating to other systems - for example Debezium [1] that writes all changes to a Kafka stream.

I'm using it to develop a system to replicate data to in-app SQLite databases, via an in-between storage layer [2]. Logical replication is quite a low-level tool with many tricky cases, which can be difficult to handle when integrating with it directly.

Some examples:

1. Any value over 8KB compressed (configurable) is stored separately from the rest of the row (TOAST storage), and unchanged values included in the replicated record by default. You need to keep track of old values in the external system, or use REPLICA IDENTITY FULL (which adds a lot of overhead on the source database).

2. PostgreSQL's primary keys can be pretty-much any combination of columns, and may or may not be used as the table's replica identity, and it may change at any time. If "REPLICA IDENTITY FULL" is used, you don't even have an explicit primary key on the receiver side - the entire record is considered the identity. Or with "REPLICA IDENTITY NOTHING", there is no identity - every operation is treated as an insert. The replica identity is global per table, so if logical replication is used to replicate to multiple systems, you may not have full control over it. This means many different combinations of replica identity needs to be handled.

3. For initial sync you need to read the tables directly. It takes extra effort to make sure these are replicated in the same way as with incremental replication - for example taking into account the list of published tables, replica identity, row filters and column lists.

4. Depending on what is used for high availability, replication slots may get lost in a fail-over event, meaning you'll have to re-sync all data from scratch. This includes cases where physical or logical replication is used. The only case where this is not an issue is where the underlying block storage is replicated, which is the case in AWS RDS for example.

[1]: https://debezium.io

[2]: https://powersync.co


> Logical replication is also great for replicating to other systems - for example Debezium [1] that writes all changes to a Kafka stream.

So I'm looking for a Rust library (or a C library) that receives a Postgres logical replication stream and is compatible with Postgres 15, does you know of something?

Debezium is unfortunately written in Java, and I think your Powersync is proprietary right? (I didn't find a Github link). I can find stuff like https://www.npmjs.com/package/pg-logical-replication but again, it's Javascript.


For C there should be good options.

For Rust it doesn't appear that well-supported.

A very simple approach is to poll for changes using `pg_logical_slot_get_changes()` - that should work with any driver. That's what I used for my initial experimentation, before switching over to the streaming replication protocol for better performance.

The streaming replication protocol is not that complicated, but currently you'll have to handle some of the low-level protocol yourself, or work with some very experimental implementations. There's a project to help get you started at [1], and some more discussion at [2].

For the logical decoder, wal2json is quite nice to experiment with, but I've found pgoutput is not that complicated and gives you something closer to the raw data.

[1]: https://github.com/seddonm1/logicaldecoding/

[2]: https://github.com/sfackler/rust-postgres/issues/116


We use this exact setup (debezium to kafka), on AWS RDS Aurora. It works well but has major drawbacks: for example AWS doesn’t let you do major postgresql updates if you have a live replication slot, so you have to go through a dance of deleting and recreating replication slots and kafka connectors, which also means stopping all writes to the DB, which means a whole lot of downtime


This is a great resource.

The PG docs are generally quite good, but logical replication is one of those things where experience and this sort of resource is invaluable.

That said, I also feel like logical replication is still pretty hard to use. One example, not covered here, is that the status of slots is not replicated to any secondaries. There are solutions for this now (see https://www.percona.com/blog/how-patroni-addresses-the-probl...) but generally the domain of dealing with logical changes is still one where I feel like MySQL binlog has a leg up (though still some sharp edges) in making it a bit easier to work with.


Wow, managed to write an entire article without pointing to the official documentation once. Every link is to other articles on their website, fastware.com.


Great post. Pg replication is a hard topic with lots of edge cases and I would say its still far from end user usability. We recently introduced PG cdc (via logical replication) to any destination https://www.cloudquery.io/blog/postgres-cdc-to-any-destinati...


I understand PG16 laying the foundation for multiple primary setup. Anyone know what are the rough edges and how far it will be production ready?


Last time I checked there was no support for synchronous logical replication in the `pglogical` plugin that's used for logical decoding. Did something change lately?



Yeah, but I remember the plugin ignoring that setting.


Does PostgreSQL replication made easy still mean paying for a 3rd party solution?


logical replication does wonders for live monitoring.


Great post. If it is forcing you to replicate the entire pg cluster then it sounds like you’re setting up streaming replication. For logical replication you would set up a publication either for all tables in a certain db or for tables of your choosing and then you would subscribe to that publication from the destination db.


[flagged]


I suspect your perspective is based on your personal exposure. India has a lot of people. As such, they have a lot of content producers. Some of which will naturally be "a few years behind". There is still a significant amount of modern and "cutting edge" content being produced from India.


Possibly that you heavily index on the latest and greatest hype, and are not tapped into that info stream from India? Also possible that culturally they focus on providing business value, which is often by using tried and true technologies?

It’s not like they’re talking about PHP becoming object oriented here…


The article in the OP talks about logical replication in PostgreSQL 15, how is this outdated?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: