Hacker News new | past | comments | ask | show | jobs | submit login

Spark is never going to be the right choice when running on a single system.

Spark is for when you have a hundreds of machines worth of processing to do




> Spark is for when you have a hundreds of machines worth of processing to do

Absolutely agree. However, most uses of Spark I've seen in my career are people thinking they have hundreds of machines worth of processing to do.


And even when you jave quite a lot of machines worth of processing some single threaded streaming of data on a single machine can still beat out any distributed framework as the the overhead of distribution is large.


A favorite paper, “Scalability! But at what COST?”. Authors show a single machine implementation (even single threaded) can wipe the floor with the maximum parallel capable implementation.

http://dsrg.pdos.csail.mit.edu/2016/06/26/scalability-cost/


We haven't developed the PySpark pipeline. It was given to us to be improved, which we did a whole rewrite to leave it more clean and understandable. We also tried a persistence switch to test if it was a better choice just in case a step failed we could resume from a prevoius one. I also had zero hands-on on PySpark and DuckDB. But yes, I was amazed at how far it was falling behind DuckDB. I wasn't expecting such a difference. Ah also this pipeline did indeed run on the cloud, but it was not posible to test it there, so the only choice was to run it locally.


Motherduck has an excellent article about this: https://motherduck.com/blog/the-simple-joys-of-scaling-up/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: