I'm duck-curious. Looking at how it's deployed, as an in process database, how d...

bigdubs · on Feb 10, 2023

We use DuckDB extensively where I work (https://watershed.com), the primary way we're using it is to query Parquet formatted files stored in GCS, and we have some machinery to make that doable on demand for reporting and analysis "online" queries.

dstroot · on Feb 10, 2023

Storing data in Parquet files and querying via DuckDB is fast and kind of magical.

bigdubs · on Feb 10, 2023

Shockingly fast and nice and having the intermediate files be immutable is super nice.

Tostino · on Feb 10, 2023

Haven't used it yet, but this aspect seems very appealing.

mritchie712 · on Feb 10, 2023

Do you load the Parquet files in duckdb or just query them directly?

bigdubs · on Feb 10, 2023

We query them directly in most cases with the httpfs plugin, but for hot paths we fetch them and cache them on disk locally.

benrutter · on Feb 10, 2023

There's a great podcast/interview with the creator of duckdb. He's pretty clear of thinking of the use case as more or less equivalent to mysql but for aggregated queries. I think trying to use it in place of something like a fully fledged postgres server might get weird, less because of any issues with duckdb and more because that isn't what it's designed for.

Podcast/interview link for anyone interested: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...

romski · on Feb 10, 2023

Yes using it in production - Stateless and ephemeral. For sure there’s a learning curve.

d_watt · on Feb 10, 2023

I see. Would it be fair to say you treat it almost like Pandas, except that it has a lower memory footprint since data is written to disk instead of memory. IE you use it for on the fly analysis of large frames of data, not like more traditional database/datawarehouse?

chrisjc · on Feb 10, 2023

I'm not sure they're saying that.

BTW, your questions are exactly those that I've been ask over the last few months, but also with a lot of focus over the last few days. Still learning as much as I can so the following might not be true.

For what it's worth, there's a difference between using duckdb to query a set of files vs loading a bunch of files in to a table. But once the data has been loaded into a table it can be backed up as a duckdb db file.

Therefore it might be more performant to preprocess duckdb db files (perhaps a process that works in conjunction with whatever manages your external tables) and load these db files into duckdb as needed (on the fly analysis) instead of loading datafiles into duckdb, transforming and CTAS every time.

https://duckdb.org/docs/sql/statements/attach

Of course all of this might be introducing more latency esp if you're trying to do NRT analytics.

I assume you could partition your data into multiple db files similar to how you would probably do it with your data files (managing external tables).

d_watt · on Feb 10, 2023

Ah, almost like a little bit of a lower level trino, where DuckDB is able to push out queries to different data storage endpoints?

chrisjc · on Feb 10, 2023

Unfortunately not. At least not without a little intervention. See this blog post for more details about what I mean. They inspect the iceberg table's catalogue to list the related parquet files and then load them into duckdb.

https://www.arecadata.com/getting-started-with-iceberg-using...

You would still need to interact with some kind of catalogue to understand which .db files you need to fetch.

And honestly I don't really know or understand the performance implications of the attach command.

I'm excited to see if the duckdb team will be able to integrate with external tables directly one day. (not that data files would be .db files)

Imagine this:

1) you have an external managed external table (iceberg, delta, etc... managed by Glue, databricks, etc)

2) register this table in duckdb

    CREATE OR REPLACE EXTERNAL TABLE my_table ... TYPE = 'ICEBERG' CATALOG = 's3://...' CREDENTIALS = '...' etc

3) simply interact with table in duckdb as you would any other table

dunefox · on Feb 10, 2023

Good questions, I'm thinking about evaluating it along with pola.rs as data science tools. Would be interesting to hear from someone using it already.