I suppose that leads to a broader question: when should you use an in-memory dat...

IanCal · on Feb 10, 2023

Very blurry. The answer now is just "whichever is easier for the small part of the task right now". Since duckdb happily talks arrow, you can use pandas for part of it, quickly do some SQL where that is easier (with no data copying) then switch back to pandas for something. You don't really have to choose which one to use any more.

wenc · on Feb 10, 2023

Exactly. In my DuckDB workflow I use Pandas data frames and DuckDB queries interchangeably.

  import duckdb as db
  import pandas as pd 

  df = pd.read_excel(“z.xlsx”)
  df2 = db.query(“select * from df join ‘s3://bucket/a.parquet’ b on df.col b.col”).df()
  df3 = df2.col.apply(lambda x: x)

DuckDB can refer any Pandas data frame in the namespace as a SQL object. You can query across Parquet, CSV and Pandas data frames seamlessly.

Need to join Excel with Parquet with CSV? No problem. You can do it all within DuckDB.