One more: Don't parallelize when you're going to blow L1 caches for no good reas...

jackmott · on Sept 3, 2016

That is a good point. If you have around N things going on, and N cores, and each thing can saturate the 1 core pretty well, it is better to keep them each on one core so they can rip through their own L1 cache rather than ruin each others.

The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.

That would be interesting to explore.

chubot · on Sept 3, 2016

I call this heterogeneous vs. homogeneous threads/processes. Heterogeneous threads are more icache and dcache friendly. They also make your program more modular.

Linux has controls to set the affinity of threads for cores.

ehsanu1 · on Sept 2, 2016

Interesting, but hard to draw any conclusions given that there were probably several other things that were different between the two implementations.