DuckDB – An in-process SQL OLAP database management system

SnowflakeOnIce · on Feb 10, 2023

DuckDB is terrific. I'm bullish on its potential for simplifying many big data pipelines. Particularly, it's plausible that DuckDB + Parquet could be used on a large SMP machine (32+ cores and 128GB+ memory) to deal with data munging for 100s of gigabytes to several terabytes, all from SQL, without dealing with Hadoop, Spark, Ray, etc.

I have successfully used DuckDB like above for preparing an ML dataset from about 100GB of input.

DuckDB is undergoing rapid development these days. There have been format-breaking changes and bugs that could lose data. I would not yet trust DuckDB for long-term storage or archival purposes. Parquet is a better choice for that.

jetter · on Feb 11, 2023

I use Clickhouse to store close to 1TB of API analytics data (which would be 10TB in MongoDB, Clickhouse has insane compression ) and it's a wonderful and stable SQL-first alternative to DuckDB - which is a very exciting piece of software, but is indeed too young to embed into boring production. The last time I checked DuckDB npm package, it used callbacks instead of awaits..

antonycourtney · on Feb 12, 2023

I can understand how the older callback API for node.js might form a negative impression, but it's really not indicative of the maturity of the core db engine at all. And remember: the vast majority of users use the Python API. Even better news is that, as of a couple of months ago, there is now this package (which I wrote at MotherDuck and we have open sourced) which provides typed promise wrappers for the Duckdb API: https://www.npmjs.com/package/duckdb-async. This is an independent npm package for now, but was developed in close coordination with the DuckDb core team.

RobinL · on Feb 10, 2023

I'd love to hear any real world experiences of anyone who's tried to run jobs that would usually require a spark cluster on a single machine with loads of cores and memory.

How big can you go, and how does speed compare to Spark? (I'm guessing significantly faster from my experience using Duckdb on smaller machines)

wenc · on Feb 10, 2023

I have a single machine EC2 instance with 32 cores and 240GB memory and about 200 GB of partitioned Parquet files. I use DuckDB and Python with complex SQL (window functions, inequality joins, quantile functions etc) to extract data from this data.

Because it’s a single machine (no distributed cluster) DuckDB can heavily parallelize and vectorize. I don’t know if I can give you perf numbers but complex analytic queries over the entire dataset regularly finish in 1-2 mins (not scientific since I’m not telling what kinds of queries I’m running).

I’ve used Spark SQL and DuckDB overall is just more ergonomic, less boilerplate and is much faster since it is so lightweight.

Granted DuckDB can only process data on one machine (whereas Spark can scale up indefinitely by adding machines) but most data sets I work with fit on a single beefy machine.

Distributed computing — most of the time, you ain’t gonna need it.

It’s like StackOverflow: it serves 2B requests a month but only runs on a few on-prem servers. Most people think this is impossible but you can actually do a lot with very few machines if you’re smart about it. Same with data. Big data is overrated.

RobinL · on Feb 10, 2023

Thanks - very interesting

jakewins · on Feb 10, 2023

I used it as glue for a monthly job, last week, processing ~1B event records, juggling them a bit and pushing them as 2M parquet segment files to S3.

Mixed experience, would definitely not put it in a system that isn’t ok crashing frequently.

It segfaults on Alpine (argh, C++!), and force exits the whole NodeJS process when it gets unexpected HTTP responses from S3.

In an archive run of 2M pushes it’ll crash the process 4-5 times.

Overall still really, really like it, but learned to not trust it

tracker1 · on Feb 10, 2023

Node + AWS for data ingress has been pretty painful in my experience (mostly dynamo feeds from large csv (whois database)). In the end, rewrote in C# (core 2) and it was able to complete more reliably. I'm guessing that go and rust would also be better. I like node, really like JS, but I just think that maybe the AWS libraries aren't that great in the space. It would run for 3-5 hours, then just blow up unexpectedly, even with plenty of memory overhead, and not really bandwidth limited, with appropriate retries and slowdown for dynamo rejections.

If I never have to write ETL pipelines again, I won't be upset about it.

Master_Odin · on Feb 10, 2023

There was a bug that was recently fixed in node 16.17+ that was causing hard node processes crashes when doing stuff with S3 and I think had to do with receiving multiple packets at once or something.

SnowflakeOnIce · on Feb 10, 2023

My experience with a modest machine and a ~100GB dataset was that DuckDB was significantly easier to use and much faster (20x) than Ray Data. Have not compared directly with Spark.

There was no cluster to set up or administer using DuckDB.

adgjlsfhk1 · on Feb 11, 2023

spark is just a tool to let you take a computation that would run in an hour on your laptop if coded properly and send it to a server with 1000 cores where it runs in 2 hours.

benjaminwootton · on Feb 10, 2023

DuckDB is a relational OLAP store. If you want to do transformations on relational data using SQL then I think nowadays you would look at the modern data stack and do it with DBT.

If you have genuinely big and unstructured data then of course you need a cluster and would reach for Spark.

If you have smallish data then maybe DuckDB has a role because working with SQL is nicer than Pandas. But a lot of time you actually need the complexity of Pandas to do the transformation you need.

DuckDB is neat but I still can’t quite convince myself of a killer use case.

neeleshs · on Feb 10, 2023

I am not sure I understand the first comment very well. Are you saying that instead of duckdb use modern data stack? Because DBT and DuckDB don't seem to contradict, but can work together. FWIW, I think the only important breakthrough in the "modern data stack" is really dbt. The rest, nothing modern about it

benjaminwootton · on Feb 11, 2023

It’s more a comment on where the market is at rather than a recommendation. I’m not saying it is bad tech, but I don’t see it’s niche.

If you have a few billion records and you want to filter, join, aggregate them using SQL then DBT against an OLAP server solves that issue so well that it doesn’t leave much white space for DuckDB.

I mentioned modern data stack because when you have SaaS, low code, consumption based billing, open source etc then it treads even more on the DuckDB value prop. DuckDB would have been great if Oracle was my only choice, but when I have Snowflake and Clickhouse in the toolbox it is a tougher market for them to carve out a niche.

hobs · on Feb 11, 2023

Also fwiw its always lowercase - dbt.

89vision · on Feb 10, 2023

> working with SQL is nicer than Pandas

Really? I prefer working with dataframe apis. You get a nice sql-like paradigm plus all the control structures of the runtime.

camgunz · on Feb 11, 2023

Databases are just much, much faster than Pandas, and that's before you start factoring the extraction and loading of data. I treat Pandas as a last resort when I can't do something in SQL, generally this is something like integrating with external services or running recordlinkage.

RobinL · on Feb 11, 2023

If you're curious, I've written a FOSS record linkage library that executes everything as SQL. It supports multiple SQL backends including DuckDB and Spark for scale, and runs faster than most competitors because it's able to leverage the speed of these backends: https://github.com/moj-analytical-services/splink

camgunz · on Feb 11, 2023

Oh hot tip! Thank you! Love the blog btw

jammycrisp · on Feb 11, 2023

You might be interested in checking out Ibis (https://ibis-project.org/). It provides a dataframe-like API, abstracting over many common execution engines (duckdb, postgres, bigquery, spark, ...). Ibis wrapping duckdb has pretty much replaced pandas as my tool of choice for local data analysis. All the performance of duckdb with all the ergonomics of a dataframe API. (disclaimer: I contribute to Ibis for work).

je42 · on Feb 11, 2023

Interesting ! One newbie question, how does ibis differ from sqlalchemy ?

89vision · on Feb 13, 2023

sql alchemy is an orm, where ibis looks to be a dataframe api that is sort of a dsl over sql. It doesn't try to map relational domains to an object oriented paradigm like sql alchemy does

wenc · on Feb 10, 2023

SQL is much nicer for anything non-trivial. Pandas methods get unwieldy for complex aggregations.

Also Pandas methods are imperative so cannot be optimized. SQL is declarative so it can be optimized to the hilt and DuckDB is faster than Pandas in almost all cases, even on Pandas data frames themselves! (partly due to vectorization).

nerdponx · on Feb 10, 2023

As far as I can tell, DuckDB is an alternative to "data frame" libraries like Data.table, Polars, Pandas, etc. Is that the case? What makes DuckDB a better choice than, say, Polars?

corford · on Feb 10, 2023

This blog post offers a nice summary: https://motherduck.com/blog/six-reasons-duckdb-slaps/

nerdponx · on Feb 10, 2023

The blog post doesn't really make a comparison between DuckDB and data frame libraries. It mentions that the DuckDB Python bindings can interoperate with Pandas, but it doesn't really explain why you would use DuckDB instead of Pandas, or Polars (which is both faster and more portable than Pandas).

simonw · on Feb 10, 2023

Don't Polars and Pandas both require your entire data to fit in memory?

nerdponx · on Feb 10, 2023

Pandas doesn't. Polars I think has some lazy-loading capability, but it's not the default mode of operation and I don't think it supports all features. If DuckDB doesn't, then that's a big advantage.

jcheng · on Feb 11, 2023

I think you mean that Pandas does require your entire data to fit in memory? https://pandas.pydata.org/docs/user_guide/scale.html

nerdponx · on Feb 11, 2023

Yes, thank you, that was a typo.

corford · on Feb 11, 2023

Apparently duckdb paired with arrow lets you work performatively on bigger than memory parquet data (I haven't tried this though)

riezebos · on Feb 11, 2023

Arrow seems to be the common denominator for larger than ram. Polars is completely built on arrow, so the same advantages should apply.

https://github.com/pola-rs/polars#handles-larger-than-ram-da...

syntaxfree · on Feb 10, 2023

It’s a drop in alternative to SQLite that’s column-oriented/OLAP. I’ve been profiling entire projects in production switching between SQLite and duckdb (no clear conclusions yet)

nerdponx · on Feb 10, 2023

I suppose that leads to a broader question: when should you use an in-memory database, and when should you use a data frame library? The distinction between the two seems to be getting blurry (which maybe is a good thing).

IanCal · on Feb 10, 2023

Very blurry. The answer now is just "whichever is easier for the small part of the task right now". Since duckdb happily talks arrow, you can use pandas for part of it, quickly do some SQL where that is easier (with no data copying) then switch back to pandas for something. You don't really have to choose which one to use any more.

wenc · on Feb 10, 2023

Exactly. In my DuckDB workflow I use Pandas data frames and DuckDB queries interchangeably.

  import duckdb as db
  import pandas as pd 

  df = pd.read_excel(“z.xlsx”)
  df2 = db.query(“select * from df join ‘s3://bucket/a.parquet’ b on df.col b.col”).df()
  df3 = df2.col.apply(lambda x: x)

DuckDB can refer any Pandas data frame in the namespace as a SQL object. You can query across Parquet, CSV and Pandas data frames seamlessly.

Need to join Excel with Parquet with CSV? No problem. You can do it all within DuckDB.

brahbrah · on Feb 13, 2023

Pandas is in a separate category from all of these, including polars. If you were to say “pandas in long format only” then yes that would be correct, but the power of pandas comes in its ability to work in a long relational or wide ndarray style. Pandas was originally written to replace excel in financial/econometric modeling, not as a replacement for sql. Models written solely in the long relational style are near unmaintainable for constantly evolving models with hundreds of data sources and thousands of interactions being developed and tuned by teams of analysts and engineers.

For example, this is how some basic operations would look in pandas.

Bump prices in 2020 up $1:

    prices_df.loc['2020'] += 1

Add expected temperature offsets to base temperature forecast:

    temp_df + offset_df

Now imagine thousands of such operations, and you can see the necessity of pandas in models like this.

benjaminwootton · on Feb 10, 2023

SQL is easier and more natural to work with than Pandas.

brahbrah · on Feb 13, 2023

Sometimes yes and sometimes no. For example if you had price data and you wanted to bump prices up a dollar in 2020 how would you do that in sql?

In pandas it’s:

    prices_df.loc['2020'] += 1

If you had a temperature forecast and you wanted to add the expected temperature miss to them, how would you do that on sql?

In pandas it’s:

    temps_df + expected_miss_df

nojito · on Feb 11, 2023

How would you do df.T in sql?

wenc · on Feb 11, 2023

df.T is a special Pandas dataframe transpose on the dataframe index and the columns.

DuckDB produces Pandas dataframes, so you would just do df.T. No need to choose between one the other.

But to answer your original question, the SQL analogue to a transpose are PIVOT/UNPIVOT operations which are mathematically rotation operations on invariants (your dimensions). This makes them much more general than a transpose -- which are just rotation operations on the rows/cols. PIVOT/UNPIVOT work on non-square data and allow you to specify different types of aggregations. PIVOT/UNPIVOT keywords are not yet implemented in DuckDB but are on the roadmap if I'm not mistaken.

nojito · on Feb 13, 2023

>DuckDB produces Pandas dataframes, so you would just do df.T. No need to choose between one the other.

Transforming into a pandas df isn't zero copy.

UNPIVOT and PIVOT are quite verbose compared to df.T.

greggyb · on Feb 10, 2023

32 cores and 128GB RAM are now desktop-class specs. Latest generation commodity servers can supply you with hundreds of cores and TBs of RAM. Note: "commodity" != "cheap", at least not necessarily.

Ninja edit before anyone misconstrues this. I am not saying that the typical desktop has these specs. I am saying that the class of hardware that is most commonly run on desktops includes SKUs that can meet this spec. Desktop-class means the same motherboard socket and processor architecture.

krimpenrik · on Feb 10, 2023

Recently tried the GUI tool for ducks, forgot what's it called, something like 'Tab' and was quite disappointed. I feel duckdb needs a good tool like sqliteviewer to really take off.

antonycourtney · on Feb 12, 2023

I think you're referring to Tad (https://www.tadviewer.com), which I developed. Tad isn't "the GUI tool for DuckDb"; it's a desktop app that provides a pivot table based viewer for tabular data files (CSV, Parquet, and DuckDb/SQLite database files). It uses DuckDb as its engine, but pre-dates DuckDb and was developed independently. It's listed in the DuckDb docs along with several others tools that work with or use DuckDb. All that said, I'm sorry you found it disappointing, and would welcome any constructive feedback on what specifically you found lacking, either here or to tad-feedback@tadviewer.com.

krimpenrik · on Feb 13, 2023

Then I think I used it wrong, happens often. Thank you for your work. I was/am new to duckdb and since it was listed in the docs I assumed it was something like SQL viewer

0cf8612b2e1e · on Feb 10, 2023

DBeaver supports duckdb.

capkutay · on Feb 11, 2023

reminds me of this blog on streaming data to Parquet files and running queries on data in the native format.

https://pedram.substack.com/p/streaming-data-pipelines-with-...

samwillis · on Feb 10, 2023

DuckDB is awesome!

Last year I was working on something using SQLite, users could perform analytical queries that would scan the entire 3gb db and generate aggregates. It would take at least 45 seconds to do the queries.

I did a dump of the db and imported to DuckDB. The same queries now only take 1.5 seconds with exactly the same SQL.

Obviously there is a trade off, inserts are slower on DuckDB. But for a low write, analytical read app it's perfect.

I tried to use the SQLite connector but struggled to get it working. Need to circle back and have another go.

With their SQLite and Postgres connectors, as a Django dev I would love an app that lets you run specific queries on your DB via DuckDB almost transparently. Would be awesome for analytical dashboards.

dunefox · on Feb 10, 2023

I'd need a MySQL/MariaDB connector for it to be really useful for me. But still, it looks great.

mcdonje · on Feb 10, 2023

Yeah, I see it as the sqlite of the reporting world. Especially with columnar stores and parquet support.

d_watt · on Feb 10, 2023

I'm duck-curious.

Looking at how it's deployed, as an in process database, how do people actually use this in production? Trying to figure out where I might actually want to think about replacing current databases or analyses with DuckDB.

EG if you deployed new code

1. Do you have a stateful machine you're doing an old school "Kill the old process, start the new process" deploy, and there's some duckdb file on disk that is maintained?

2. Or do you back that duckdb file in some sort of shared disk (Eg EBS), and have a rolling deploy where multiple applications access the same DB at the same time?

3. Or is DuckDB is treated as ephemeral, and you're using it to process data on the fly, so persisted state isn't an issue?

bigdubs · on Feb 10, 2023

We use DuckDB extensively where I work (https://watershed.com), the primary way we're using it is to query Parquet formatted files stored in GCS, and we have some machinery to make that doable on demand for reporting and analysis "online" queries.

dstroot · on Feb 10, 2023

Storing data in Parquet files and querying via DuckDB is fast and kind of magical.

bigdubs · on Feb 10, 2023

Shockingly fast and nice and having the intermediate files be immutable is super nice.

Tostino · on Feb 10, 2023

Haven't used it yet, but this aspect seems very appealing.

mritchie712 · on Feb 10, 2023

Do you load the Parquet files in duckdb or just query them directly?

bigdubs · on Feb 10, 2023

We query them directly in most cases with the httpfs plugin, but for hot paths we fetch them and cache them on disk locally.

benrutter · on Feb 10, 2023

There's a great podcast/interview with the creator of duckdb. He's pretty clear of thinking of the use case as more or less equivalent to mysql but for aggregated queries. I think trying to use it in place of something like a fully fledged postgres server might get weird, less because of any issues with duckdb and more because that isn't what it's designed for.

Podcast/interview link for anyone interested: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...

romski · on Feb 10, 2023

Yes using it in production - Stateless and ephemeral. For sure there’s a learning curve.

d_watt · on Feb 10, 2023

I see. Would it be fair to say you treat it almost like Pandas, except that it has a lower memory footprint since data is written to disk instead of memory. IE you use it for on the fly analysis of large frames of data, not like more traditional database/datawarehouse?

chrisjc · on Feb 10, 2023

I'm not sure they're saying that.

BTW, your questions are exactly those that I've been ask over the last few months, but also with a lot of focus over the last few days. Still learning as much as I can so the following might not be true.

For what it's worth, there's a difference between using duckdb to query a set of files vs loading a bunch of files in to a table. But once the data has been loaded into a table it can be backed up as a duckdb db file.

Therefore it might be more performant to preprocess duckdb db files (perhaps a process that works in conjunction with whatever manages your external tables) and load these db files into duckdb as needed (on the fly analysis) instead of loading datafiles into duckdb, transforming and CTAS every time.

https://duckdb.org/docs/sql/statements/attach

Of course all of this might be introducing more latency esp if you're trying to do NRT analytics.

I assume you could partition your data into multiple db files similar to how you would probably do it with your data files (managing external tables).

d_watt · on Feb 10, 2023

Ah, almost like a little bit of a lower level trino, where DuckDB is able to push out queries to different data storage endpoints?

chrisjc · on Feb 10, 2023

Unfortunately not. At least not without a little intervention. See this blog post for more details about what I mean. They inspect the iceberg table's catalogue to list the related parquet files and then load them into duckdb.

https://www.arecadata.com/getting-started-with-iceberg-using...

You would still need to interact with some kind of catalogue to understand which .db files you need to fetch.

And honestly I don't really know or understand the performance implications of the attach command.

I'm excited to see if the duckdb team will be able to integrate with external tables directly one day. (not that data files would be .db files)

Imagine this:

1) you have an external managed external table (iceberg, delta, etc... managed by Glue, databricks, etc)

2) register this table in duckdb

    CREATE OR REPLACE EXTERNAL TABLE my_table ... TYPE = 'ICEBERG' CATALOG = 's3://...' CREDENTIALS = '...' etc

3) simply interact with table in duckdb as you would any other table

dunefox · on Feb 10, 2023

Good questions, I'm thinking about evaluating it along with pola.rs as data science tools. Would be interesting to hear from someone using it already.

bwanab · on Feb 10, 2023

There's a nice intro to DuckDB for julia developers by Bogumił Kamiński, the creator of DataFrames.jl here: https://juliazoid.com/welcome-to-duckdb-3c4e75f50b97.

Interesting since in some ways, as he points out, it's in direct competition with DataFrames for use cases, but he gives it a very positive treatment and shows how they can work together using advantages of standard SQL along with processing power of DataFrames.

nickpeterson · on Feb 10, 2023

When someone gives fair opinions on something that directly competes were their own work, you should take their opinion very seriously. It’s an excellent quality in a person, and shows they’re more focused on the problem than their ego.

phillc73 · on Feb 10, 2023

Quite a while ago, when duckdb was just a duckling, I wrote an R package that supported direct manipulation of R dataframes using SQL.[1] duckdb was the engine for this.

The approach was never as fast as data.table but did approach the speed of dplyr for more complex queries.

Life had other things in store for me and I haven’t touched this library for a while now.

At the time there was no Julia connector for duckdb, but now that there is, I’d like to try this approach in that language.

[1] https://github.com/phillc73/duckdf

lopkeny12ko · on Feb 10, 2023

I appreciate the clarity on the explicitly unsupported use cases in "When to not use DuckDB."

There are so many infrastructure products, especially database products for some reason, where the marketing team takes control of the messaging away from engineers, and push outlandish claims on how their new DB is faster than all the competition, can support any workload, can scale infinitely, etc.

ctc24 · on Feb 10, 2023

We're big fans of DuckDb at https://prequel.co! We use it as part of our own dataframe implementation in Go. The speed is unbeatable and the tool is top notch. There are a few rough edges (it's not quite 1.0 level of stability yet), but the team is super reactive and has fixed bugs we've reported in < 48hrs pretty much every time.

out_of_step · on Feb 10, 2023

DuckDB has an awesome integration with Apache Arrow and very useful implementations in R and Python. Great for easy larger-than-memory analyses

https://duckdb.org/2021/12/03/duck-arrow.html

vgt · on Feb 10, 2023

DuckDB is such a breath of fresh air!

We at MotherDuck at working very closely with the DuckDB folks to build a DuckDB-based cloud service. I'm talking to various folks in the industry about the details in 1:1. Feel free to reach out to tino at motherduck.com.

(co-founder and head of Produck at MotherDuck)

jengleton · on Feb 10, 2023

Great to see this posted here! DuckDB is an integral part of an in-browser data analytics tool that I've been working on. It compiles to WASM and runs in a web worker. Queries against WASM DuckDB regularly run 10x faster than the original JavaScript implementation!

thenaturalist · on Feb 10, 2023

In case you're interested in or willing to share more details about the tool or your architecture I'd be an interested reader.

hummus_bae · on Feb 10, 2023

It's amazing to see Rust used so much even in web projects. It's my favorite language and I don't want to use it for web programming anymore. I've used it for too many "real" web apps (I mean webrtc signaling and web socket) to go through the pains of optimization. But it's still fun to work with.

bluehat974 · on Feb 10, 2023

DuckCon 2023 - State of the Duck - Mark Raasveldt and Hannes Mühleisen

https://youtu.be/rdnPkLSkoyU

epberry · on Feb 10, 2023

We just switched to DuckDB from Postgres to analyze AWS billing data and wrote up our experience, https://www.vantage.sh/blog/querying-aws-cost-data-duckdb

Arguably Postgres was never the right tool to use for this analysis but nonetheless I was surprised at how much faster DuckDB was.

thewisenerd · on Feb 10, 2023

ha! i'm using duckdb for a similar use-case.. plotting 18-month aggregates take less than 2 seconds.. vendor only provides data for the last 3 months (through the web portal).

(i'm "only" working with 30 million rows though)

kiratp · on Feb 10, 2023

Its disappointing that C++ was chosen to build something that is going to live in-process. Rust would have been so much safer. All the segfaults you get when running DucDB supports this statement.

mcdonje · on Feb 10, 2023

I agree about Rust's memory safety advantage over C++, but I disagree that it's disappointing from a project perspective. Some DB experts made a good DB using a performant language they're comfortable with.

You can't make project choices in a vacuum, and you can't assume others can either. People have limited time. The choice they were facing was probably not C++ vs Rust, but C++ vs nothing because they didn't have time to learn a new language before starting their project.

Also, their first release was in 2019, so they probably heard of it, but that's around the beginning of its recent spike in popularity. It's starting to be viewed as a good long term option, but back then a lot of people were still wondering if it was a fad.

I'm learning Rust, and I'm a big fan, but this is a bad take.

kiratp · on Feb 11, 2023

We built <osmos.io> in Rust and we started in 2019. Our first lines of product code were also my first lines of Rust.

It can be done and it’s not that hard.

I think the choice of C++ for an in-process DB that is going to be very popular makes the entire industry less secure. If Chrome, one of the largest budget C++ code bases, still has memory bugs then there is no way DuckDB won’t.

everybodyknows · on Feb 10, 2023

Anyone tried using the Go driver? It has a few open bugs:

https://github.com/marcboeker/go-duckdb/issues

And a third-party effort, only:

https://duckdb.org/docs/api/overview

kagitac · on Feb 10, 2023

I've found the CGO boundary to be quite slow for large result sets and have taken to just running commands that do SELECT and COPY to files on the system and then read those.

linhvn · on Feb 11, 2023

What is the differentiation of DuckDB compared to previous OLAP databases such as Snowflake or Singlestore?

wasd · on Feb 10, 2023

Our database is heroku postgresql database. What's the best way to get this working with DuckDB? I see there's a postgresql connector but I'm not totally following how to deploy it. Would I just spin up a dyno with the docker image / custom build pack and connect it to the DB?

chrisjc · on Feb 10, 2023

Doesn't postgres have a columnar option? If so, you could prob get better performance for your analytical interactions if you switched some tables to columnar.

Otherwise check out postgres scanner. https://github.com/duckdblabs/postgres_scanner

They have a blog entry about it too: https://motherduck.com/blog/duckdb-ecosystem-newsletter-two/

lichtenberger · on Feb 10, 2023

I wonder if we'll in the near future also see JVM based in-memory or hybrid OLAP database systems, which will make use of SIMD instructions and columnar storage layouts with the incubating Vector API.

It would be also interesting to see how we can process semi-structured data in a similar way.

__mharrison__ · on Feb 10, 2023

Not trying to troll, but assuming proficiency in python, when would someone prefer this to say Pandas (or Polars)?

I've written a lot of OLAP queries (wrote a materialization layer for MonetDb and Postgres years ago). I find Pandas so much easier to work with for semi complicated work.

cdavid · on Feb 10, 2023

The obvious one is speed/data size: you can handle much larger data with duckdb compared to pandas. Depending on the data, maybe 10x larger, sometimes even more.

From an ergonomics perspective, I find pandas much harder to use casually than SQL. When I was an IC and was using it a lot, I was proefficient in it. But now that I code maybe 5 hours / month at work, I can't really do anything non trivial besides basic stuff/pivots. OTOH, I never really forget SQL.

sa46 · on Feb 10, 2023

Does anyone have experience using client-side DuckDB with WASM? It looks really compelling to ship analytics to the browser by running DuckDB over parquet files.

aynyc · on Feb 11, 2023

I don't get this:

* When to not use DuckDB: Multiple concurrent processes reading from a single writable database*

So no concurrent reads? Or is it no concurrent reads while writing?

tkhattra · on Feb 11, 2023

the latter - you can either have multiple concurrent readers or a single writer. see https://duckdb.org/faq.html#how-does-duckdb-handle-concurren....

aynyc · on Feb 11, 2023

What if I run multiple processes? One is to write, the rest are reads?

brunoqc · on Feb 10, 2023

Anyone both tried duckdb and clickhouse-local?

wenc · on Feb 11, 2023

I use duckdb and looked into clickhouse-local. The dealbreaker for me was that clickhouse-local supported only a subset of SQL (that most people would probably be ok with, but not sufficient for a lot of complex analytics work). For instance, clickhouse doesn't support lead/lag functions natively (though it does propose workarounds).

Here's clickhouse's SQL support: https://clickhouse.com/docs/en/sql-reference/

Compare this to DuckDB's (on the sidebar): https://duckdb.org/docs/sql/introduction

DuckDB's SQL coverage is much more complete and matches my experience with full blown databases like Postgres and Redshift.

As well, performance-wise DuckDB is currently still somewhat faster than clickhouse-local [1] but I would say this is a secondary consideration -- as long as either is "fast enough for your purposes" this shouldn't be an issue -- and clickhouse is plenty fast.

The primary consideration for me would be the SQL support. That said, if you don't use any complex SQL, clickhouse-local seems like it would be a worthy contender.

[1] https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

brunoqc · on Feb 13, 2023

> The dealbreaker for me was that clickhouse-local supported only a subset of SQL

Yeah, I think there was some advanced psql stuff that I wanted to do with duckdb but wasn't able, so I can imagine that with clickhouse-local it would be even worse. Not that duckdb isn't enough for me.

Now I need to make my mind about duckdb vs nushell. Both are greats. Well I can use both, maybe.

Thank you very much!

argella · on Feb 10, 2023

Isn’t SQLite adding features for analytic queries that should take the wind out of the sails of duckdb?

tanin · on Feb 10, 2023

I doubt sqlite will catch up soon in terms of analytics due to a few below reasons.

The SQL dialect is so much lacking that it seems intentional. They are meant to be a transactional database, not an analytics one.

Sqlite also takes pride on stability (deployed on a billion android devices). Adding 100+ analytics capabilities e.g. functions is not gonna be easy in terms of maintaining stability.

I want to be wrong though because my paid app (superintendent.app) uses Sqlite. Not supporting analytics well is the number one complaint.

simonw · on Feb 10, 2023

Not really. There was an academic project around this but I've not seen signs that it will be integrated with SQLite core: https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/

The key thing to consider here is trade-offs.

Analytical databases tend to be optimized for analytical queries at the expense of fast atomic read-write transactions.

SQLite is mainly used in situations where fast atomic read-write transactions are key - that's why it's used in so many mobile phone applications, for example.

It's not going to grow analytical-query-at-scale capabilities if that means negatively impacting the stuff it's really good at already.

intrasight · on Feb 10, 2023

I see no MDX query support. How is this an OLAP database? Or do I misunderstand what it is?

slt2021 · on Feb 10, 2023

MDX is microsoft proptech, OLAP is more generic term to define analytical processing (in contrast to transaction processing as in CRUD).

OLAP cube is microsoft's take on OLAP using 90s technologies for tech stacks from 1990s (Windows Server + SQL Server + SSAS).

DuckDB is a modern take on OLAP

intrasight · on Feb 14, 2023

Every OLAP database that I've used supports MDX

MDX == OLAP

SQL == relational

dang · on Feb 10, 2023

Related—others?:

Motherduck Raises $47.5M for DuckDB - https://news.ycombinator.com/item?id=33610218 - Nov 2022 (2 comments)

Modern Data Stack in a Box with DuckDB - https://news.ycombinator.com/item?id=33191938 - Oct 2022 (5 comments)

Querying Postgres Tables Directly from DuckDB - https://news.ycombinator.com/item?id=33035803 - Sept 2022 (38 comments)

Notes on the SQLite DuckDB Paper - https://news.ycombinator.com/item?id=32684424 - Sept 2022 (28 comments)

Show HN: CSVFiddle – Query CSV files with DuckDB in the browser - https://news.ycombinator.com/item?id=31946039 - July 2022 (13 comments)

Show HN: Easily Convert WARC (Web Archive) into Parquet, Then Query with DuckDB - https://news.ycombinator.com/item?id=31867179 - June 2022 (15 comments)

Range joins in DuckDB - https://news.ycombinator.com/item?id=31530639 - May 2022 (24 comments)

Friendlier SQL with DuckDB - https://news.ycombinator.com/item?id=31355050 - May 2022 (133 comments)

Fast analysis with DuckDB and Pyarrow - https://news.ycombinator.com/item?id=31217782 - April 2022 (53 comments)

Directly running DuckDB queries on data stored in SQLite files - https://news.ycombinator.com/item?id=30801575 - March 2022 (23 comments)

Parallel Grouped Aggregation in DuckDB - https://news.ycombinator.com/item?id=30589250 - March 2022 (10 comments)

DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB - https://news.ycombinator.com/item?id=29433941 - Dec 2021 (13 comments)

DuckDB-Wasm: Efficient analytical SQL in the browser - https://news.ycombinator.com/item?id=29039235 - Oct 2021 (58 comments)

Comparing SQLite, DuckDB and Arrow with UN trade data - https://news.ycombinator.com/item?id=29010103 - Oct 2021 (79 comments)

DuckDB is the better SQLite with APIs for Java/Python/R and it's got potential - https://news.ycombinator.com/item?id=28692997 - Sept 2021 (2 comments)

Fastest table sort in the West – Redesigning DuckDB's sort - https://news.ycombinator.com/item?id=28328657 - Aug 2021 (27 comments)

Querying Parquet with Precision Using DuckDB - https://news.ycombinator.com/item?id=27634840 - June 2021 (32 comments)

DuckDB now has a Node.js API - https://news.ycombinator.com/item?id=25289574 - Dec 2020 (4 comments)

DuckDB – An embeddable SQL database like SQLite, but supports Postgres features - https://news.ycombinator.com/item?id=24531085 - Sept 2020 (160 comments)

DuckDB: SQLite for Analytics - https://news.ycombinator.com/item?id=23287278 - May 2020 (67 comments)