1. I certainly did not mean to imply that ATLAS or MKL were the best CPU linear ...

jedbrown · on April 21, 2015

2. Most languages have a way to store raw arrays that can be passed directly to the numerical routines. If you have to marshal, then it depends on the problem size and amount of work done in the numerical routine.

3. Yes.

4. NVIDIA has a track record of misleading comparisons. They cleaned up their act in the these CUDA-7 benchmarks: http://devblogs.nvidia.com/parallelforall/cuda-7-release-can... in response to this G+ discussion https://plus.google.com/+JeffHammondScience/posts/G1MzHqZaxy... . Thanks to Szilárd Páll for calling attention to this discussion and to Mark Harris for updating the plots. Note that this should not be interpreted as "MKL/Xeon is catching up" but rather that performance comparisons are sensitive to the details of the experiment and it can be hard to recognize the consequences of biased configurations. Better normalization and standards for comparison can help.