That is a good point. If you have around N things going on, and N cores, and each thing can saturate the 1 core pretty well, it is better to keep them each on one core so they can rip through their own L1 cache rather than ruin each others.
The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.
I call this heterogeneous vs. homogeneous threads/processes. Heterogeneous threads are more icache and dcache friendly. They also make your program more modular.
Linux has controls to set the affinity of threads for cores.
The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.
That would be interesting to explore.