Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Peerdb Streams – Simple, native Postgres change data capture
177 points by saisrirampur 7 months ago | hide | past | favorite | 40 comments
Hello HN, I am Sai Srirampur, one of the Co-founders of PeerDB. (https://github.com/PeerDB-io/peerdb). We spent the past 7 months building a solid experience to replicate data from Postgres to data warehouses. Now we're expanding to queues.

PeerDB Streams provides a simple and native way to replicate changes as they happen in Postgres to Queues (Kafka, Redpanda, Google PubSub, etc). We use Postgres logical decoding to enable Change Data Capture (CDC).

Blog post here: https://blog.peerdb.io/peerdb-streams-simple-native-postgres.... 10-min quickstart here: https://docs.peerdb.io/quickstart/streams-quickstart.

We chose queues as many users found that existing tools are complex. Debezium is the most used tool for this use-case. It has large production usage. However, a common pain point among our users is that it has a significant learning curve taking months to productionize.

A few issues are: a) Interacting through a command line interface, understanding the various settings, and learning best practices for running it in production is not trivial. Debezium UI, released to address usability concerns [1], is still in an incubating state [2]. Additionally, reading Debezium resources to get started can be overwhelming [3]. b) Supporting data formats and transformations isn’t easy. It needs a Java project, building JAR packages and setting up a runtime path on the kafka connect plugin. c)Debezium is not as native as Kafka for other queues and doesn’t offer the same level of configurability. For example, with Event Hubs, it is difficult to stream to topics spread across namespaces and subscriptions.

TL;DR Debezium aims to provide a comprehensive experience for engineers to implement CDC rather than making it dead simple for them. So you can do a lot with Debezium but need to know a lot about it.

At PeerDB, we are building a simple yet comprehensive experience for Postgres CDC. The goal is to enable engineers to build prod-grade Postgres CDC with a minimal learning curve, within a few days.

PeerDB’s feature-set isn't at Debezium's level yet, and as we evolve, we might face similar challenges. However, we're putting usability at the forefront and we believe that we can achieve the above goal.

First, PeerDB offers a simple UI to set up Postgres and Kafka by creating PEERs and initiating CDC by creating a MIRROR. Through the UI, users can monitor the progress of CDC, including throughput and latency; set up alerts to Slack/Email based on replication slot growth; investigate Postgres-native metrics, including slot size, etc. Here is a demo showing of PeerDB UI in action:

https://www.loom.com/share/ebcfb7646a1e48738835853b760e5d04

Second, for users who prefer a CLI, we provide a Postgres-compatible SQL layer to manage CDC. This offers the same level of features as the UI and is more intuitive compared to bash scripts.

Third, users can perform row-level transformations using Lua scripts executed at runtime. This enables features such as encrypting/masking PII data, supporting various data formats (JSON, MsgPack, Protobuf, etc.), and more. We offer a script editor along with a bunch of useful templates [5].

Fourth, we provide native connectors to non-Kafka targets. We also provide native configurability options tailored to these platforms. For example, with Event Hubs, users can perform CDC to topics distributed across different namespaces and subscriptions [4].

Finally, We are laser focused on Postgres, enabling specific optimizations like native metrics for replication, wait-events, and # of connections. Features like faster initial loads through parallel snapshotting and decoding transactions in-flight are in private beta.

Our hope is to provide the best data-movement experience for Postgres. PeerDB Streams is another step in that direction. We would love to get your feedback on product experience, our thesis and anything else that comes to your mind. It would be super useful for us. Thank you!

References:

[1] https://debezium.io/blog/2020/10/22/towards-debezium-ui/ [2] https://debezium.io/documentation/reference/stable/operation... [3] https://medium.com/@cooper.wolfe/i-hated-debezium-so-much-i-... [4] https://blog.peerdb.io/enterprise-grade-replication-from-pos... [5] https://github.com/PeerDB-io/examples [5] https://app.peerdb.cloud [6] https://github.com/PeerDB-io/PeerDB




Nice to see more product development and offerings in this area. Well done.

[Full disclosure, I work for Prisma and we have a similar product called Pulse (https://prisma.io/pulse)]

Another use case for CDC is compliance. I reckon that in the near future, to ensure with data compliance regulations, CDC will become the better option for devs vs traditional seek/update/delete functions.


What is Change Data Capture (CDC) ?

Peerdb doesn't seem to inform on the core of the problem it solves, here's a reference from Debezium (mentioned in the text)

> set up and configure Debezium to monitor your databases, and then your applications consume events for each row-level change made to the database. Only committed changes are visible, so your application doesn't have to worry about transactions or changes that are rolled back.

It's good to know! This model seems to turn row changes into effectively a side-effect invocation for a queue.


To my knowledge, CDC provides a solution to the following 2 pain points. Would love to hear more:

1. Replication: Imagine you have 5 different analytical environments. Do you backup & restore to each destination, essentially re-writing 10TB of data, when only 2MB worth of data changed? You'd be surprised how many organizations still do this.

2. System Triggering: How can you inform a downstream system an event occurred? There are many mechanisms to do so, and they vary in complexity, but given the requirements are simple (if a row in the DB changes, let's tag this change as an event and just pass it to all the downstream systems that need to know this)


Additionally, it can help with systems where 2-phase commits can lead to weird and complex situations. This is where applying SAGA and using Transactional Outbox can help: it should allow for abstracting away the event stream away from the main application logic, the CDC mirror can then handle generating the stream. This can also allow for plug-and-play of other downstream services/sinks from the CDC stream.


Change Data Capture (and its use cases) is a well known topic in the data-engineering and streaming communities. Hence we didn't mention more about it. But we will add it to our future posts. Thanks for the feedback here!

Here goes some commentary on CDC use-cases: CDC provides a reliable stream of database changes, enabling various use cases such as real-time alerting (e.g., fraud detection in banking), replicating OLTP data to other types of stores (OLAP, search, etc.) for real-time analytics and search, and implementing the transactional outbox pattern which exposes database changes to other microservices/apps (e.g., a bank exposing a raw feed of transactions to its rewards microservice, etc.).


Noob question: What is the advantage of replicating data into a warehouse vs. just querying it in place on a postgres database?


Typically data warehouses are OLAP databases that have much better performance than OLTP databases for large queries.

There might also be several applications in a company, each with their own database, and a need to produce reports based on combinations of data from multiple applications.

I think that in many cases your question is based on an idea that is completely right. engineers are too eager to split out applications into multiple databases and tacking on separate data warehouses. The costs of maintaining separate databases is often higher than initially thought. Especially when some of the data in the warehouse needs to go back into the application database, for example for customer facing analytics. I think many companies would be better served by considering traditional data warehousing needs directly in their main application databases and abstain from splitting out databases. Having one single ACID source of truth and paying a bit more for a single beefy database server makes a lot more sense than is commonly thought. Especially now when many customer facing products, like recommendation systems, are “data driven”. At least that’s my impression after working in the space for a while.


If the postgres database is recording business transactions, you don't want to cause your business to stop being able to take credit cards because you generated a report.


Assuming you use a connection pool, why would it stop? Either the query returns a result or it doesnt? Am I missing something?


Reporting queries can put a significant load on the db, to the point that it interrupts service.


Futhermore, Postgres is an OLTP (transactional) database, designed to efficiently perform updates and deletes on individual rows. OLAP (analytical) databases/query engines like Clickhouse, Presto, Druid, etc. are designed for efficient processing across many rows of mostly unchanging data.

Analytical queries (like "find the average sales price across all orders over the past year, grouped by store and region") can be 100-1000x faster in an OLAP database compared to Postgres.


I see, thanks!


What about using a read-only replica for reporting. Are there any downsides to that? Seems to be easier to manage


That’s the use case for cdc, to make it equally easy to use a DW. As always the complexity is just air you move in the balloon. The oltp db can spit out the events and forget them, how you load them efficiently is now a data engineer’s problem to solve ( if it was easy to write event grain on an olap you would not need an oltp). Kafka usually enters the room at this stage and the simplification promise is becoming tenuous.


that works great if all the data you need for reporting is in the database you're replicating.

You'd likely want a data warehouse if you also need to report on data that isn't in your prod database (e.g. stripe, your CRM, marketing data, etc.).

If setting up a data warehouse, etl, BI, etc. sounds like a lot of work to get reports, you're right, it is.

shameless plug: we're making this all much simpler at https://www.definite.app/


Additionally, unless your data model is designed as append-only (which is unusual and requires logic downstream), you won't be able to track updates and deletions, which are valuable for reporting


Event sourcing

Also reporting/analytics work loads tend to be ad hoc queries and hard to optimize so you generally favor fast storage over indexes. Frequently for analytics and reporting it's more efficient to use a columnar DB than a relational db


When you need to do large/medium-scale analytical queries. Postgres is fairly slow for aggregate/group queries needed for analytics. Think if you're building Google Analytics type functionality.


Data warehouses are structured to handle large volumes of data and complex queries more efficiently than a typical transactional database like PostgreSQL.


We're currently in the process of adding Kafka support to EDFS (Event Driven Federated Subscriptions), a specification to add NATS, Kafka, and other pub sub or streaming services to a federated graph.

There's one thing missing and we could hook it up to your CDC solution. For each message, we need to set the __typename field. Does your solution have a way to accomplish this?

EDFS reference: https://cosmo-docs.wundergraph.com/router/event-driven-feder...


Yes, that is doable through a very simple Lua-based row-level transformation. :) Here is an example transformation that adds an additional field: https://github.com/PeerDB-io/examples/blob/main/usdcad.lua . More details on row-level transformations are planned to be released later this week. :)


If you're open to it, we could collaborate on a demo of the two components, PeerDB CDC and EDFS, and work together on cross promoting the solutions. We've got a large customer base in the enterprise segment that's interested in EDFS with Kafka, and I think that your solution fits very well into the picture.

If you're interested, just fill our contact form and mention this thread. This could be really cool!


How do you handle Postgres cluster failover? Does PeerDB automatically restore logical replication slot on a new primary?


Great question! We have retry logic in place to handle Postgres restarts. If the failure occurs in-place, you should be fine as the slot will persist. If Postgres fails over to the standby, PeerDB will wait until the slot is created. We did consider automatically creating the slot if it doesn't exist on retries, but ensuring data reliability (creating the slot right after failover without data being ingested) is tricky. So, as of now, we leave it to the user.


Thanks! I'd say, this is a very complicated problem left to be solved by user. Debezium also does not solve the problem it, and from DBA point of view it is a blocker for adoption. Would be nice to have some solution that would ensure that logical replication slot persists through failover.


Things got finally a bit easier with Postgres 16, which allows you to create replication slots on stand-by servers. It still requires manual intervention to make sure they don't fall too far behind their primary counterpart and to do the failover, but it's not too terrible either and it's guaranteed that failover can happen without missing any events. I've described the approach here: https://www.decodable.co/blog/logical-replication-from-postg.... Postgres 17 should make this even simpler.


Quite a bit of work is going on to support failover slots in upstream - https://github.com/search?q=repo%3Apostgres%2Fpostgres+failo... From what I understood it could get into Postgres 17 (released later this year) if there are no big issues. :)


Are BigQuery's clustered and partitioned supported (both as a source and sink).

Plus how is the deduplication process handled? Fivetran for example creates staging tables and scans the target table. Since it does support BigQuery's integer based partitioning. A table partitioned by Primary key helps in cost optimizations.


Yep, the target tables on BQ are automatically clustered on the primary key, and there is an option to further partition them on the peerdb_synced_at column. We've seen this reduce replication costs by 50% and also help reduce costs for off-stream transformations. Sharing a customer case study that you might find interesting: :) https://www.peerdb.io/customers/harmonic-customer-story

Regarding the deduplication process, we stage raw CDC data into a staging table and run periodic MERGE operations. The raw table is also auto-clustered and partitioned similar to final table. This helps save costs.


You can partition your BigQuery table however you like and Fivetran will leave it in place. I don’t think there’s any benefit to partitioning the staging table.


Why do you recommend a heartbeat table to mitigate WAL slot growth if the PeerDB Stream targets a specific table? Presumably this means that the WAL slot is subscribed to all table changes, even if only specific tables are actually included in the CDC? Why not just subscribe to the WAL that you need?


what is the difference between using this library and PG_LISTEN/NOTIFY with triggers for each row change?


Triggers have much larger overhead than writing to the replication stream (which most databases should be doing anyway), so if you have a lot of traffic this is faster.


sorry but didn't understand, what exactly is this replication stream, are you talking about the wal logs? any ideas why postgres hasn't implemented this natively if it is better?


Would love to know how this compares to Supabase's realtime. https://github.com/supabase/realtime


SQS and EventBridge targets not even on the roadmap? Why? Any specific reason?


So far, we have prioritized connectors that most of our users requested, including Redpanda, PubSub, and Event Hubs. We will add SQS and EventBridge to our backlog. It shouldn't be too hard to add these connectors; for example, adding the PubSub connector took a few weeks. Thanks for your feedback! :)


you're underselling, I did pubsub over an afternoon. But it was an easy stretch since peerdb already had gcp auth logic for bigquery & pubsub mostly matches kafka so code only needed a couple tweaks


Does peerdb help in replicating from one citus cluster to another citus cluster ?

If not, then is there any other solution?


We can use PeerDB to replicate each node (worker and coordinator) from one Citus cluster to corresponding nodes in another Citus cluster. It will require some customization, and I am happy to assist you through the process. Also, just as a heads up, prior to PeerDB, I worked at Citus for 8 years as an early (solutions) engineer, so I have a pretty good understanding of Citus internals. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: