And even when you jave quite a lot of machines worth of processing some single threaded streaming of data on a single machine can still beat out any distributed framework as the the overhead of distribution is large.
A favorite paper, “Scalability! But at what COST?”. Authors show a single machine implementation (even single threaded) can wipe the floor with the maximum parallel capable implementation.
We haven't developed the PySpark pipeline. It was given to us to be improved, which we did a whole rewrite to leave it more clean and understandable. We also tried a persistence switch to test if it was a better choice just in case a step failed we could resume from a prevoius one. I also had zero hands-on on PySpark and DuckDB. But yes, I was amazed at how far it was falling behind DuckDB. I wasn't expecting such a difference. Ah also this pipeline did indeed run on the cloud, but it was not posible to test it there, so the only choice was to run it locally.
Spark is for when you have a hundreds of machines worth of processing to do