There are a few different approaches. The main categories, from simplest to most complex:
1) Read replicas with copied data. The most straightforward, allowing using the same SQL syntax and tooling. Examples: Postgres read replica and BemiDB (disclaimer: I'm a contributor)
2) Operational databases with integrations. Designed for sub-second real-time, bring their own extended SQL syntax for things like window functions. Examples: Materialize and RisingWave
3) Analytical databases with syncing. Allow writing and reading directly, optimized for analytical workloads. Examples: ClickHouse and DuckDB
4) Data warehouses with ETL. Great for large volumes of data, traditionally used with ETL batch processing. Examples: Snowflake and Redshift
Hey HN! We’re Evgeny and Arjun. We’ve built a managed version of BemiDB that syncs with your existing PostgreSQL database and gives you fast analytical queries without heavy ETL pipelines.
BemiDB Cloud automatically replicates your data into an Apache Iceberg table stored in your own S3-compatible bucket. The data is compressed columnar Parquet under the hood. We embed DuckDB for query execution, and we speak the Postgres wire protocol so you can use all your existing ORMs, BI tools, or notebooks.
We’ve seen teams push Postgres read replicas too far for analytics or wrangle big ETL flows. We’re aiming for something simpler. Instead of standing up a data warehouse with complex pipelines, you can point BemiDB at your Postgres instance, click to sync, and start querying. All data remains in an open format so you aren’t locked in.
Our initial approach was to implement periodic full table re-syncing. We're starting to work on CDC with logical replication for incremental syncing. Here is our roadmap https://github.com/BemiHQ/BemiDB#future-roadmap
Our initial approach is to do full table re-syncs periodically. Our next step is to enable incremental data syncing by supporting insert/update/delete according to the Iceberg spec. In short, it'd produce "diff" Parquet files and "stitch" them using metadata (enabling time travel queries, schema evolution, etc.)
That's why our current approach is to build missing or not fully functional features ourselves to move fast. For example, DuckDB performs reads from Iceberg tables not according to the spec, can't perform writes, etc.
Great ideas! We'll keep this suggestion related to read/write separation in mind. We started with a simple unified solution, but we'll keep iterating, listening and addressing any feedback :)
- Compute and storage separation simplifies managing a system making compute "ephemeral"
- Compute resources can be scaled separately without worrying about scaling storage
- Object storage provides much higher durability (99.999999999% on S3) compared to disks
- Open table formats on S3 become a universal interface in the data space allowing to bring many other data tools if necessary
- Costs at scale can actually be lower since there is no data transfer cost within the same region. For example, you can check out WarpStream (Kafka on object storage) case studies that claim saving 5-10x
I'd say that querying data from S3 is not ideal when low-latency queries are required. Generally, there could be a few roundtrip requests to fetch metadata (JSON, Avro) and data (Parquet) files, which may lead to around 1s or so latency. However, we have caching on our roadmap (it could be just a simple TTL for the fetched data or some more sophisticated caching depending on the synced & queried data)
BemiDB natively supports two storage layers, a local disk and S3 (we assumed that most people would choose this in production environments to simplify management).
When I query Iceberg tables stored on SSD, it works superfast.
1) Read replicas with copied data. The most straightforward, allowing using the same SQL syntax and tooling. Examples: Postgres read replica and BemiDB (disclaimer: I'm a contributor)
2) Operational databases with integrations. Designed for sub-second real-time, bring their own extended SQL syntax for things like window functions. Examples: Materialize and RisingWave
3) Analytical databases with syncing. Allow writing and reading directly, optimized for analytical workloads. Examples: ClickHouse and DuckDB
4) Data warehouses with ETL. Great for large volumes of data, traditionally used with ETL batch processing. Examples: Snowflake and Redshift