Looking at how it's deployed, as an in process database, how do people actually use this in production? Trying to figure out where I might actually want to think about replacing current databases or analyses with DuckDB.
EG if you deployed new code
1. Do you have a stateful machine you're doing an old school "Kill the old process, start the new process" deploy, and there's some duckdb file on disk that is maintained?
2. Or do you back that duckdb file in some sort of shared disk (Eg EBS), and have a rolling deploy where multiple applications access the same DB at the same time?
3. Or is DuckDB is treated as ephemeral, and you're using it to process data on the fly, so persisted state isn't an issue?
We use DuckDB extensively where I work (https://watershed.com), the primary way we're using it is to query Parquet formatted files stored in GCS, and we have some machinery to make that doable on demand for reporting and analysis "online" queries.
There's a great podcast/interview with the creator of duckdb. He's pretty clear of thinking of the use case as more or less equivalent to mysql but for aggregated queries. I think trying to use it in place of something like a fully fledged postgres server might get weird, less because of any issues with duckdb and more because that isn't what it's designed for.
I see. Would it be fair to say you treat it almost like Pandas, except that it has a lower memory footprint since data is written to disk instead of memory. IE you use it for on the fly analysis of large frames of data, not like more traditional database/datawarehouse?
BTW, your questions are exactly those that I've been ask over the last few months, but also with a lot of focus over the last few days. Still learning as much as I can so the following might not be true.
For what it's worth, there's a difference between using duckdb to query a set of files vs loading a bunch of files in to a table. But once the data has been loaded into a table it can be backed up as a duckdb db file.
Therefore it might be more performant to preprocess duckdb db files (perhaps a process that works in conjunction with whatever manages your external tables) and load these db files into duckdb as needed (on the fly analysis) instead of loading datafiles into duckdb, transforming and CTAS every time.
Unfortunately not. At least not without a little intervention. See this blog post for more details about what I mean. They inspect the iceberg table's catalogue to list the related parquet files and then load them into duckdb.
Looking at how it's deployed, as an in process database, how do people actually use this in production? Trying to figure out where I might actually want to think about replacing current databases or analyses with DuckDB.
EG if you deployed new code
1. Do you have a stateful machine you're doing an old school "Kill the old process, start the new process" deploy, and there's some duckdb file on disk that is maintained?
2. Or do you back that duckdb file in some sort of shared disk (Eg EBS), and have a rolling deploy where multiple applications access the same DB at the same time?
3. Or is DuckDB is treated as ephemeral, and you're using it to process data on the fly, so persisted state isn't an issue?