I'm always impressed how much faster you can go compared to a naive algorithm - if you know what you are doing. This clocks in at 730x and that might not be the peak.
Compare: The original steam enginge trains went I believe 10km/h, so if it received the same level of speedup, it reaches somehere between mach 5 and 10. You can classify this algorithm as hypersonic ;-)
Tangent: There’s a square power law for velocity due to wind resistance. People aren’t impressed that a pro bicyclist is going “only” twice as fast as them but that’s four times as strong, for three times as long.
If modern train engine produced 700 times as much power that’s only 26 times faster for the same cargo (don’t they pull more now?) And we do have trains that can go 260 km/hr... just not with cargo.
I actually find the practicalities of algorithm choice even more fascinating. For example, Strassen's algorithm is asymptotically the fastest known algorithm for dense matrix multiplication in certain situations (medium sized matrices), but it is rarely used because it is complicated to implement and less numerically stable.
The problem with the algorithms faster than conventional mm is, that they are defined on mathematical numbers.
E.g. strassens-algorithms assumes something like:
(A+B)C - BC = AC
With the convential floating point representation this becomes a problem if B >> A because of rounding errors.
Most algorithms faster than Strassen assume:
(( fA + B)C - BC)/f = AC
and
(fA + B)C = BC +fAC
f is then choosen in such a way, that fAC is neglegible. But if that is the case fA is much smaller than B and we automatically get numerical problems.
There is a procedure to remove the fAC terms (which would imply that f could be choosen close to 1) which armortizes over enough recursions. But for Coppersmith-Winograd (and derivatives like Le-Galls algorithm). We need to divide the matrix in each recursion step into enough submatrices that the indices can be treated heuristically(lets say a division of each index in each iteration in 100 subindices). This would imply the sides of each matrix must be at least of size 100^n_recusion. We won't need to multiply matrices of that size.
Oh, right indeed. I had forgotten about those since they aren't really practical. I honestly find it pretty amazing that any sub-cubic algorithms can even exist for dense matrix multiplication. It's just far enough away from my expertise to seem magical.
And when you see the world record bicycle speed of 144km/h on a recumbent with full body fairing, you realize just how significant wind resistance is and how important aerodynamics is.
When we care about speed, we would use OpenBLAS, MKL or Eigen for matrix multiplication. How is Armadillo compared to them, or does Armadillo actually uses BLAS? In that case, what BLAS implementation is used in this benchmark?
Armadillo is built on any BLAS you might like to use. (Eigen is a hand-implemented replacement for BLAS, by the way.) So I, too, am interested in what BLAS implementation is used here. OpenBLAS vs. regular LAPACK/BLAS (or ATLAS) can make quite a large difference.
Eigen, Armadillo, Blaze, and ETL all have their own replacement implementations for BLAS but can be linked against any version. By the way, MKL supports AVX512, while OpenBLAS does not as of yet. Benchmarks show a factor of 4 between the two for gemm.
It's a factor of three for the large-matrix serial case on KNL -- the OpenBLAS issue on KNL -- whereas you might expect a factor of two by analogy with avx/avx2.
For avx512 (and maybe other x86_64, which is now dynamically dispatched) large BLAS, use BLIS. BLIS also provides a non-BLAS interface. For small matrix multiplication, use libxsmm, of course.
Remember that the world isn't all amd64/x86_64, in which case BLIS is infinitely faster than MKL, and it's probably faster even on Bulldozer/Zen. (I haven't compared on Bulldozer recently, and don't have Zen.)
I was looking at [0] for that number. You're right, it's closer to 3 than 4; I must have rounded ~12k down to 10k and 37k up to 40k. I could imagine some other factors speeding it up further, as well. There were a number of missing instructions in AVX2 that they've filled in for AVX512 which could play a role.
Thanks for the heads-up RE: BLIS, I'd forgotten about them; it's probably the best option, especially considering its open source status.
Interestingly, given this blog post, i'm 100% sure that there are further improvements that can be made. To my knowledge, strassen's isn't actually used by most high performance matrix libraries because the constants are too high that it outweighs the theoretical gains.
If I'm reading page 65 of that presentation correctly, naive Strassen (what this person did) should be approximately 75% as fast as MKL on an 8 core machine for a 4Kx4K matrix, and even the improved algorithm outlined in the paper is only approximately equivalent, and I wouldn't call it "Strassen's".
I stand by what I said (although that paper was a cool read!)
So ABC Strassen (one of the variants in the paper) is crossing over at 4000x4000 on 10 cores and ahead on fewer cores. I shared this as the current state of performance engineering for Strassen-like algorithms. It offers modest benefits in more practical regimes than the conventional wisdom. I agree that many applications of dense matrices do not benefit.
Worth noting that for many use cases using wide floating point vectors from AVX and on is a bad idea. It can take upwards of 2ms to turn on and has weird effects (this delay is due to having to adjust the voltage the core gets).
Unless you know this is what your consumers want, I wouldn’t use it in library code.
Reasons you might avoid the widest SIMD in particular cases, particularly for avx2 v. avx512, which include the number of available SIMD units, clock speed, and number of available registers. However, recommending against avx in library code -- we have dense matrix multiplication here! -- is odd. (Various kernels other then fft and gemm have had non-trivial work to use avxN assembler or intrinsics and doubtless others could benefit.)
There are relevant snippets of information on SIMD trades-off around FFTW, BLIS, and, especially, libxsmm. This stuff definitely isn't simple.
Yeah, to be clear, I wasn’t saying it’s definitely the wrong choice for all library code, or anything so strong. More like, it’s worth considering the case your users will be using, since for several of them it is likely not the right choice.
For dense matrix multiplication it’s probably worth it, but the article is phrased much more broadly in its conclusions, and does not mention the fact that this issue exists (possibly because the author isn’t aware of it — which I don’t fault them for).
To be clear, a 2ms stall when calling certain functions has the potential to be devastating for anything that wants to run at speeds approximating realtime.
Another approach is to start with the obvious 3 nested loops. Then replace with a single loop of n3 iterations. Take the bits of this single loop variable and deinterleave them to obtain the 3 indicies of the nested loop variant. In other words treat the loop counter as if it's bits are a3b3c3a2b2c2a1b1c1a0b0c0 and so on. This is simpler with 2 interleaved values but still not bad. It produces a sort of 3 dimensional z-order traversal. The overhead can be minimized by looping with a stride of some power of 2 and unrolling the inner portion to do that many multiply-accumulates.
If you do this while also storing the matrix in z-order in memory the offsets in the unrolled part become constants independent of the matrix size.
This technique is applicable to anything using nested loops where the indicies take you to a lot of memory repeatedly.
I'm still waiting for a couple with bit interleave and deinterleave instructions, but those wouldn't be usable without intrinsics.
Compare: The original steam enginge trains went I believe 10km/h, so if it received the same level of speedup, it reaches somehere between mach 5 and 10. You can classify this algorithm as hypersonic ;-)