99% of the time a loop works just fine, because there are no measurable gains to...

99% of the time a loop works just fine, because there are no measurable gains to be had from parallelism. For the 1% where performance matters, it's usually a bit more involved that simply using a map or fold, and hopefully already packaged as an off-the-shelf library. To have measurable gains from parallelism one has to be very intentional in balancing communication vs computation. Think carefully designed libraries like cuDNN.