1. I certainly did not mean to imply that ATLAS or MKL were the best CPU linear algebra libraries out there. Thanks for pointing out OpenBLAS and offering that comparison.
2. If I were to run a highly iterative algorithm, say with 100,000 or more iterations, would the memory marshaling not be significant? Or would it be unnoticeable since I would be using these libraries anyway? I also personally find the array notation and slicing to be quite nice.
3. The use of "restrict" is nice. I was not aware of that feature. Looking at this reference [1], it seems that Fortran basically does the equivalent of automatically using "restrict".
4. According to NVIDIA [2], cuBLAS outperforms MKL by 6x-17x (it seems MKL is catching up since I last checked). Is this misleading by NVIDIA? What is your expectation for a reasonable execution speed increase? The fixed-cost efficiency and variable-cost efficiency are good points to bring into the discussion. Your point about carefully evaluating workload is also a good one. I don't want people to think that throwing an algorithm into CUDA will magically speed up the overall program. Network I/O, disk I/O, and device (GPU) I/O are all important bottlenecks to consider.
2. Most languages have a way to store raw arrays that can be passed directly to the numerical routines. If you have to marshal, then it depends on the problem size and amount of work done in the numerical routine.
3. Yes.
4. NVIDIA has a track record of misleading comparisons. They cleaned up their act in the these CUDA-7 benchmarks: http://devblogs.nvidia.com/parallelforall/cuda-7-release-can... in response to this G+ discussion https://plus.google.com/+JeffHammondScience/posts/G1MzHqZaxy... .
Thanks to Szilárd Páll for calling attention to this discussion and to Mark Harris for updating the plots. Note that this should not be interpreted as "MKL/Xeon is catching up" but rather that performance comparisons are sensitive to the details of the experiment and it can be hard to recognize the consequences of biased configurations. Better normalization and standards for comparison can help.
2. If I were to run a highly iterative algorithm, say with 100,000 or more iterations, would the memory marshaling not be significant? Or would it be unnoticeable since I would be using these libraries anyway? I also personally find the array notation and slicing to be quite nice.
3. The use of "restrict" is nice. I was not aware of that feature. Looking at this reference [1], it seems that Fortran basically does the equivalent of automatically using "restrict".
4. According to NVIDIA [2], cuBLAS outperforms MKL by 6x-17x (it seems MKL is catching up since I last checked). Is this misleading by NVIDIA? What is your expectation for a reasonable execution speed increase? The fixed-cost efficiency and variable-cost efficiency are good points to bring into the discussion. Your point about carefully evaluating workload is also a good one. I don't want people to think that throwing an algorithm into CUDA will magically speed up the overall program. Network I/O, disk I/O, and device (GPU) I/O are all important bottlenecks to consider.
[1]: http://www-cs-students.stanford.edu/~blynn/c/ch05.html
[2]: https://developer.nvidia.com/cuBLAS