Hacker News new | past | comments | ask | show | jobs | submit login

I'm duck-curious.

Looking at how it's deployed, as an in process database, how do people actually use this in production? Trying to figure out where I might actually want to think about replacing current databases or analyses with DuckDB.

EG if you deployed new code

1. Do you have a stateful machine you're doing an old school "Kill the old process, start the new process" deploy, and there's some duckdb file on disk that is maintained?

2. Or do you back that duckdb file in some sort of shared disk (Eg EBS), and have a rolling deploy where multiple applications access the same DB at the same time?

3. Or is DuckDB is treated as ephemeral, and you're using it to process data on the fly, so persisted state isn't an issue?




We use DuckDB extensively where I work (https://watershed.com), the primary way we're using it is to query Parquet formatted files stored in GCS, and we have some machinery to make that doable on demand for reporting and analysis "online" queries.


Storing data in Parquet files and querying via DuckDB is fast and kind of magical.


Shockingly fast and nice and having the intermediate files be immutable is super nice.


Haven't used it yet, but this aspect seems very appealing.


Do you load the Parquet files in duckdb or just query them directly?


We query them directly in most cases with the httpfs plugin, but for hot paths we fetch them and cache them on disk locally.


There's a great podcast/interview with the creator of duckdb. He's pretty clear of thinking of the use case as more or less equivalent to mysql but for aggregated queries. I think trying to use it in place of something like a fully fledged postgres server might get weird, less because of any issues with duckdb and more because that isn't what it's designed for.

Podcast/interview link for anyone interested: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...


Yes using it in production - Stateless and ephemeral. For sure there’s a learning curve.


I see. Would it be fair to say you treat it almost like Pandas, except that it has a lower memory footprint since data is written to disk instead of memory. IE you use it for on the fly analysis of large frames of data, not like more traditional database/datawarehouse?


I'm not sure they're saying that.

BTW, your questions are exactly those that I've been ask over the last few months, but also with a lot of focus over the last few days. Still learning as much as I can so the following might not be true.

For what it's worth, there's a difference between using duckdb to query a set of files vs loading a bunch of files in to a table. But once the data has been loaded into a table it can be backed up as a duckdb db file.

Therefore it might be more performant to preprocess duckdb db files (perhaps a process that works in conjunction with whatever manages your external tables) and load these db files into duckdb as needed (on the fly analysis) instead of loading datafiles into duckdb, transforming and CTAS every time.

https://duckdb.org/docs/sql/statements/attach

Of course all of this might be introducing more latency esp if you're trying to do NRT analytics.

I assume you could partition your data into multiple db files similar to how you would probably do it with your data files (managing external tables).


Ah, almost like a little bit of a lower level trino, where DuckDB is able to push out queries to different data storage endpoints?


Unfortunately not. At least not without a little intervention. See this blog post for more details about what I mean. They inspect the iceberg table's catalogue to list the related parquet files and then load them into duckdb.

https://www.arecadata.com/getting-started-with-iceberg-using...

You would still need to interact with some kind of catalogue to understand which .db files you need to fetch.

And honestly I don't really know or understand the performance implications of the attach command.

I'm excited to see if the duckdb team will be able to integrate with external tables directly one day. (not that data files would be .db files)

Imagine this:

1) you have an external managed external table (iceberg, delta, etc... managed by Glue, databricks, etc)

2) register this table in duckdb

    CREATE OR REPLACE EXTERNAL TABLE my_table ... TYPE = 'ICEBERG' CATALOG = 's3://...' CREDENTIALS = '...' etc
3) simply interact with table in duckdb as you would any other table


Good questions, I'm thinking about evaluating it along with pola.rs as data science tools. Would be interesting to hear from someone using it already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: