Hacker News new | past | comments | ask | show | jobs | submit login

Dask is actually similar to spark in that it will let you compute outside of your computer over multiple machines.

Matt Rocklin (the dask creator) has put some effort into benchmarking across all these recently. You can see the presentation here: https://youtu.be/5YXiRl41DKA?si=5Pt9XQHp5Y1cG7tM

If you don't feel like watching a whole presentation, the main take away is local is faster, until it isn't.

Polars outperforms spark by a wide margin, until data size gets sufficiently large that things swap over.

Some operations can be performance sensitive without having scaling issues. So I guess those kind of jobs are great candidates for something like polars.




An analogy would be whether I should buy a 2 seated sports car or a family car when I know I have have to drive more than one person around.

I think I am a guy who prefer reliability and versatility over speed.

(When I was a kid you'd say Alfa Romeo vs Volvo)


I'm right there with you!

For disclosure I'm a minor contributor to dask so probably am a little biased.

I guess one side I probably haven't put forward though is that the memory footprint of something like dask/spark is higher because of its overheads. If you don't have scalable resources, then a polars / duckdb option would probably be your most reliable choice (I.e. the one that'll hit the fewest memory errors in the given architecture)


Nice video, thank you. It's interesting to see how each technology behaves depending on the scale of the dataset. Duckdb is definitely killing it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: