More

exAspArk · 2025-04-02T13:39:13 1743601153

There are a few different approaches. The main categories, from simplest to most complex:

1) Read replicas with copied data. The most straightforward, allowing using the same SQL syntax and tooling. Examples: Postgres read replica and BemiDB (disclaimer: I'm a contributor)

2) Operational databases with integrations. Designed for sub-second real-time, bring their own extended SQL syntax for things like window functions. Examples: Materialize and RisingWave

3) Analytical databases with syncing. Allow writing and reading directly, optimized for analytical workloads. Examples: ClickHouse and DuckDB

4) Data warehouses with ETL. Great for large volumes of data, traditionally used with ETL batch processing. Examples: Snowflake and Redshift

exAspArk · 2025-01-30T19:46:01 1738266361

Hey HN! We’re Evgeny and Arjun. We’ve built a managed version of BemiDB that syncs with your existing PostgreSQL database and gives you fast analytical queries without heavy ETL pipelines.

BemiDB Cloud automatically replicates your data into an Apache Iceberg table stored in your own S3-compatible bucket. The data is compressed columnar Parquet under the hood. We embed DuckDB for query execution, and we speak the Postgres wire protocol so you can use all your existing ORMs, BI tools, or notebooks.

We’ve seen teams push Postgres read replicas too far for analytics or wrangle big ETL flows. We’re aiming for something simpler. Instead of standing up a data warehouse with complex pipelines, you can point BemiDB at your Postgres instance, click to sync, and start querying. All data remains in an open format so you aren’t locked in.

We’d love your feedback! Check us out at https://bemidb.com and our open source repo at https://github.com/BemiHQ/BemiDB. What do you think?

exAspArk · 2024-11-08T15:01:46 1731078106

This is a great DIY setup. We're hoping to compress this stack and simplify it down to a single binary

exAspArk · 2024-11-08T14:39:42 1731076782

Our initial approach was to implement periodic full table re-syncing. We're starting to work on CDC with logical replication for incremental syncing. Here is our roadmap https://github.com/BemiHQ/BemiDB#future-roadmap

exAspArk · 2024-11-08T14:37:19 1731076639

Sorry, we haven't benchmarked it against ClickHouse yet. Our initial point of reference was just Postgres

exAspArk · 2024-11-08T03:38:18 1731037098

Our initial approach is to do full table re-syncs periodically. Our next step is to enable incremental data syncing by supporting insert/update/delete according to the Iceberg spec. In short, it'd produce "diff" Parquet files and "stitch" them using metadata (enabling time travel queries, schema evolution, etc.)

exAspArk · 2024-11-08T03:32:42 1731036762

That's why our current approach is to build missing or not fully functional features ourselves to move fast. For example, DuckDB performs reads from Iceberg tables not according to the spec, can't perform writes, etc.

exAspArk · 2024-11-08T03:26:18 1731036378

Great ideas! We'll keep this suggestion related to read/write separation in mind. We started with a simple unified solution, but we'll keep iterating, listening and addressing any feedback :)

dantodor · 2024-11-09T02:38:12 1731119892

FWIW, I'm using ro-api atm for that.

exAspArk · 2024-11-08T03:17:23 1731035843

My few cents:

- Compute and storage separation simplifies managing a system making compute "ephemeral"

- Compute resources can be scaled separately without worrying about scaling storage

- Object storage provides much higher durability (99.999999999% on S3) compared to disks

- Open table formats on S3 become a universal interface in the data space allowing to bring many other data tools if necessary

- Costs at scale can actually be lower since there is no data transfer cost within the same region. For example, you can check out WarpStream (Kafka on object storage) case studies that claim saving 5-10x

exAspArk · 2024-11-08T02:56:17 1731034577

I'd say that querying data from S3 is not ideal when low-latency queries are required. Generally, there could be a few roundtrip requests to fetch metadata (JSON, Avro) and data (Parquet) files, which may lead to around 1s or so latency. However, we have caching on our roadmap (it could be just a simple TTL for the fetched data or some more sophisticated caching depending on the synced & queried data)

anentropic · 2024-11-08T12:55:27 1731070527

How tied is it to S3?

Could I easily point it to Iceberg tables on another storage target?

exAspArk · 2024-11-08T14:54:35 1731077675

Yes!

BemiDB natively supports two storage layers, a local disk and S3 (we assumed that most people would choose this in production environments to simplify management).

When I query Iceberg tables stored on SSD, it works superfast.

anentropic · 2024-11-09T09:27:36 1731144456

awesome, thanks