One more: Don't parallelize when you're going to blow L1 caches for no good reason.
I wrote a gradient descent solver a while back, before I realized Vowpal Wabbit existed.
Vowpal Wabbit had a dedicated thread to read in records and pass them through a queue to another thread doing the math. I didn't, because it was a quick POC impl and I couldn't be arsed. My impl was faster.
This advice probably goes for most usages of auto-parallel collection impls that the author was pointing out.
That is a good point. If you have around N things going on, and N cores, and each thing can saturate the 1 core pretty well, it is better to keep them each on one core so they can rip through their own L1 cache rather than ruin each others.
The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.
I call this heterogeneous vs. homogeneous threads/processes. Heterogeneous threads are more icache and dcache friendly. They also make your program more modular.
Linux has controls to set the affinity of threads for cores.
I wrote a gradient descent solver a while back, before I realized Vowpal Wabbit existed.
Vowpal Wabbit had a dedicated thread to read in records and pass them through a queue to another thread doing the math. I didn't, because it was a quick POC impl and I couldn't be arsed. My impl was faster.
This advice probably goes for most usages of auto-parallel collection impls that the author was pointing out.