You can't prove the effect of "temporary arrays, choice of language, and parallelism" and "pointer indirection, cache efficiency, and SIMD vectorization" don't matter unless you compare one thing that does those things to another thing that doesn't.
BLAS libraries are all within the range of each other because they all store memory in basically the same way, do indirection in basically the same way, care for the cache in basically the same way and do SIMD in basically the same way.
As soon as you step out of those prebuilt blocks and build your own function over squares of numbers, you're going to be losing far more than a factor of 3 until you understand how to handle these issues.
>You can't prove the effect of "temporary arrays, choice of language, and parallelism" and "pointer indirection, cache efficiency, and SIMD vectorization" don't matter unless you compare one thing that does those things to another thing that doesn't.
I think he said the exact opposite. He's basically saying algorithmic time complexity matters (and therefore identifying your problem class matters, since you might get a more efficient algorithm out of it) since you can only get so far with the generic algorithm, however good your implementation is.
BLAS libraries are all within the range of each other because they all store memory in basically the same way, do indirection in basically the same way, care for the cache in basically the same way and do SIMD in basically the same way.
As soon as you step out of those prebuilt blocks and build your own function over squares of numbers, you're going to be losing far more than a factor of 3 until you understand how to handle these issues.