Task parallelism was something used to work on 20 years ago when we thought it w...

Task parallelism was something used to work on 20 years ago when we thought it was the solution to scaling. But then we found that the supercomputer people were right all along, that the only thing that really scales very well is data parallelism. So the focus in the last 5/10 years has been finding data parallel solutions to the problems we care about (say deep neural network training), and then mapping them to either a distributed pipeline (MapReduce) or GPU solution.

> It is within recent memory that if you wanted 20+ CPUs at your disposal, you'd have to build a cluster with explicit job management, topologically-optimized communications, and a fair amount of physical redundancy.

You are still thinking about concurrency, not parallelism. Yes, the cluster people had to think this way, they were interested in performance for processing many jobs; no the HPC people who needed performance never thought like this, they were only interested in the performance of one job.

> As many of the applications requiring low-end clusters tended to involve random numbers or floating point calculations, we also had the annoyance of minor discrepancies such as clock drift affecting the final output.

Part of the problem, I think, is that we've been confused for a long time. Our PHBs saw problems (say massive video frame processing) and saw solutions that were completely inappropriate for it (cluster computing). Its only recently that we've realized there are often other/better option (like running MapReduce on that cluster).