Do you use duckdb for real-time queries or just historical? You mentioned parque...

wenc · 2024-08-03T14:41:15 1722696075

Also a tip: for interactive queries, do not store Parquet in S3.

S3 is high-throughput but also high-latency storage. It's good for bulk reads, but not random reads, and querying Parquet involves random reads. Parquet on S3 is ok for batch jobs (like Spark jobs) but it's very slow for interactive queries (Presto, Athena, DuckDB).

The solution is to store Parquet on low-latency storage. S3 has something called S3 Express Zones (which is low-latency S3, costs slightly more). Or EBS, which is block storage that doesn't suffer from S3's high latency.

eismcc · 2024-08-03T16:26:56 1722702416

You can do realtime in the sense that you can build Numpy arrays in memory from realtime data and then use these as columns in DuckDb. This is approach I took when designing KlongPy to interop array operations with DuckDb.

wenc · 2024-08-03T14:13:32 1722694412

Not real time, just historical. (I don’t see why it can’t be used for real time though... but haven’t thought through the caveats)

Also, not sure what you mean by Parquet is not good at appending? On the contrary, Parquet is designed for an append-only paradigm (like Hadoop back in the day). You can just drop a new parquet file and it’s appended.

If you have 1.parquet, all you have you to do is drop 2.parquet in the same folder or Hive hierarchy. Then query>

  Select * from ‘*.parquet’

DuckDB automatically scans all the parquet in that directory structure when it queries. If there’s a predicate, it uses Parquet header information to skip files that don’t contain the data requested so it’s very fast.

In practice we use a directory structure called Hive partitioning, which helps DuckDB do partition elimination to skip over irrelevant partitions, making it even faster.

https://duckdb.org/docs/data/partitioning/hive_partitioning

Parquet is great for appending!

Now, it's not so good at updating because it's a write-once format (not read-write). To update a single record in a Parquet file entails regenerating the entire Parquet file. So if you have late-arriving updates, you need to do extra work to identify the partition involved and overwrite. Either that or use bitemporal modeling (add data arrival timestamp [1]) and do a latest date clause in your query (entailing more compute). If you have a scenario where existing data changes a lot, Parquet is not a good format for you. You should look into Timescale (time-series database based on Postgres)

[1] https://en.wikipedia.org/wiki/Bitemporal_modeling