If you don't feel like watching a whole presentation, the main take away is local is faster, until it isn't.
Polars outperforms spark by a wide margin, until data size gets sufficiently large that things swap over.
Some operations can be performance sensitive without having scaling issues. So I guess those kind of jobs are great candidates for something like polars.
For disclosure I'm a minor contributor to dask so probably am a little biased.
I guess one side I probably haven't put forward though is that the memory footprint of something like dask/spark is higher because of its overheads. If you don't have scalable resources, then a polars / duckdb option would probably be your most reliable choice (I.e. the one that'll hit the fewest memory errors in the given architecture)
Matt Rocklin (the dask creator) has put some effort into benchmarking across all these recently. You can see the presentation here: https://youtu.be/5YXiRl41DKA?si=5Pt9XQHp5Y1cG7tM
If you don't feel like watching a whole presentation, the main take away is local is faster, until it isn't.
Polars outperforms spark by a wide margin, until data size gets sufficiently large that things swap over.
Some operations can be performance sensitive without having scaling issues. So I guess those kind of jobs are great candidates for something like polars.