Very cool! Could you share the key difference between this and the previous pg_a...

retakeming · on May 13, 2024

Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.

This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.

It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)

nitinreddy88 · on May 14, 2024

If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.

epsilonic · on May 14, 2024

Can you elaborate on what you mean by the "Postgres way of doing things"? Also, what is wrong with using Postgres to query data in external object stores? It is a common occurrence for businesses to store parquet artefacts in object storage, and querying them is often desirable.

philippemnoel · on May 14, 2024

It depends. If you're happy with Databricks, etc. you might be good. But we've seen many users want the simplicity of querying data from Postgres for analytics, especially in case of JOINing both analytics and transactional data