I suppose that leads to a broader question: when should you use an in-memory database, and when should you use a data frame library? The distinction between the two seems to be getting blurry (which maybe is a good thing).
Very blurry. The answer now is just "whichever is easier for the small part of the task right now". Since duckdb happily talks arrow, you can use pandas for part of it, quickly do some SQL where that is easier (with no data copying) then switch back to pandas for something. You don't really have to choose which one to use any more.
Exactly. In my DuckDB workflow I use Pandas data frames and DuckDB queries interchangeably.
import duckdb as db
import pandas as pd
df = pd.read_excel(“z.xlsx”)
df2 = db.query(“select * from df join ‘s3://bucket/a.parquet’ b on df.col b.col”).df()
df3 = df2.col.apply(lambda x: x)
DuckDB can refer any Pandas data frame in the namespace as a SQL object. You can query across Parquet, CSV and Pandas data frames seamlessly.
Need to join Excel with Parquet with CSV? No problem. You can do it all within DuckDB.