Performance experiments with matrix multiplication in Rust

hyperman1 · on May 13, 2018

I'm always impressed how much faster you can go compared to a naive algorithm - if you know what you are doing. This clocks in at 730x and that might not be the peak.

Compare: The original steam enginge trains went I believe 10km/h, so if it received the same level of speedup, it reaches somehere between mach 5 and 10. You can classify this algorithm as hypersonic ;-)

hinkley · on May 13, 2018

Tangent: There’s a square power law for velocity due to wind resistance. People aren’t impressed that a pro bicyclist is going “only” twice as fast as them but that’s four times as strong, for three times as long.

If modern train engine produced 700 times as much power that’s only 26 times faster for the same cargo (don’t they pull more now?) And we do have trains that can go 260 km/hr... just not with cargo.

progress, man.

gh02t · on May 13, 2018

I actually find the practicalities of algorithm choice even more fascinating. For example, Strassen's algorithm is asymptotically the fastest known algorithm for dense matrix multiplication in certain situations (medium sized matrices), but it is rarely used because it is complicated to implement and less numerically stable.

electricslpnsld · on May 14, 2018

> Strassen's algorithm is asymptotically the fastest known algorithm for dense matrix multiplication

Coppersmith–Winograd approaches take the crown for asymptotic scaling in matrix-matrix multiplication [1].

It kind of blows my mind that a tight bound on the complexity of matrix-matrix multiplication still isn't known!

Edit: Some quick googling shows that François Le Gall took the crown in 2014.

[1] https://en.wikipedia.org/wiki/Computational_complexity_of_ma...

Paul_Diraq · on May 14, 2018

The problem with the algorithms faster than conventional mm is, that they are defined on mathematical numbers.

E.g. strassens-algorithms assumes something like:

(A+B)C - BC = AC

With the convential floating point representation this becomes a problem if B >> A because of rounding errors.

Most algorithms faster than Strassen assume:

(( fA + B)C - BC)/f = AC

and

(fA + B)C = BC +fAC

f is then choosen in such a way, that fAC is neglegible. But if that is the case fA is much smaller than B and we automatically get numerical problems.

There is a procedure to remove the fAC terms (which would imply that f could be choosen close to 1) which armortizes over enough recursions. But for Coppersmith-Winograd (and derivatives like Le-Galls algorithm). We need to divide the matrix in each recursion step into enough submatrices that the indices can be treated heuristically(lets say a division of each index in each iteration in 100 subindices). This would imply the sides of each matrix must be at least of size 100^n_recusion. We won't need to multiply matrices of that size.

geezerjay · on May 14, 2018

Thank you for posting your comment. Very insightful and packed with goodies!

gh02t · on May 15, 2018

Oh, right indeed. I had forgotten about those since they aren't really practical. I honestly find it pretty amazing that any sub-cubic algorithms can even exist for dense matrix multiplication. It's just far enough away from my expertise to seem magical.

PyroLagus · on May 14, 2018

And when you see the world record bicycle speed of 144km/h on a recumbent with full body fairing, you realize just how significant wind resistance is and how important aerodynamics is.

iopq · on May 14, 2018

I was in Japan and the shinkansen definitely had cargo on it

attractivechaos · on May 13, 2018

When we care about speed, we would use OpenBLAS, MKL or Eigen for matrix multiplication. How is Armadillo compared to them, or does Armadillo actually uses BLAS? In that case, what BLAS implementation is used in this benchmark?

hellepardo · on May 14, 2018

Armadillo is built on any BLAS you might like to use. (Eigen is a hand-implemented replacement for BLAS, by the way.) So I, too, am interested in what BLAS implementation is used here. OpenBLAS vs. regular LAPACK/BLAS (or ATLAS) can make quite a large difference.

stochastic_monk · on May 14, 2018

Eigen, Armadillo, Blaze, and ETL all have their own replacement implementations for BLAS but can be linked against any version. By the way, MKL supports AVX512, while OpenBLAS does not as of yet. Benchmarks show a factor of 4 between the two for gemm.

gnufx · on May 14, 2018

It's a factor of three for the large-matrix serial case on KNL -- the OpenBLAS issue on KNL -- whereas you might expect a factor of two by analogy with avx/avx2.

For avx512 (and maybe other x86_64, which is now dynamically dispatched) large BLAS, use BLIS. BLIS also provides a non-BLAS interface. For small matrix multiplication, use libxsmm, of course.

Remember that the world isn't all amd64/x86_64, in which case BLIS is infinitely faster than MKL, and it's probably faster even on Bulldozer/Zen. (I haven't compared on Bulldozer recently, and don't have Zen.)

stochastic_monk · on May 14, 2018

I was looking at [0] for that number. You're right, it's closer to 3 than 4; I must have rounded ~12k down to 10k and 37k up to 40k. I could imagine some other factors speeding it up further, as well. There were a number of missing instructions in AVX2 that they've filled in for AVX512 which could play a role.

Thanks for the heads-up RE: BLIS, I'd forgotten about them; it's probably the best option, especially considering its open source status.

[0] https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-3...

wohlergehen · on May 13, 2018

Check out MOMMS [1], which is a Rust library for matrix multiplication algorithms inspired by/based on BLIS.

1: https://github.com/tlrmchlsmth/momms

joshuamorton · on May 13, 2018

Interestingly, given this blog post, i'm 100% sure that there are further improvements that can be made. To my knowledge, strassen's isn't actually used by most high performance matrix libraries because the constants are too high that it outweighs the theoretical gains.

jedbrown · on May 14, 2018

https://www.cs.utexas.edu/~jianyu/papers/sc16.pdf (paper) https://www.cs.utexas.edu/~jianyu/presentations/strassen_sc1... (slides)

joshuamorton · on May 14, 2018

If I'm reading page 65 of that presentation correctly, naive Strassen (what this person did) should be approximately 75% as fast as MKL on an 8 core machine for a 4Kx4K matrix, and even the improved algorithm outlined in the paper is only approximately equivalent, and I wouldn't call it "Strassen's".

I stand by what I said (although that paper was a cool read!)

jedbrown · on May 14, 2018

So ABC Strassen (one of the variants in the paper) is crossing over at 4000x4000 on 10 cores and ahead on fewer cores. I shared this as the current state of performance engineering for Strassen-like algorithms. It offers modest benefits in more practical regimes than the conventional wisdom. I agree that many applications of dense matrices do not benefit.

yoklov · on May 14, 2018

Worth noting that for many use cases using wide floating point vectors from AVX and on is a bad idea. It can take upwards of 2ms to turn on and has weird effects (this delay is due to having to adjust the voltage the core gets).

Unless you know this is what your consumers want, I wouldn’t use it in library code.

See https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0... for a writeup about it.

gnufx · on May 14, 2018

Reasons you might avoid the widest SIMD in particular cases, particularly for avx2 v. avx512, which include the number of available SIMD units, clock speed, and number of available registers. However, recommending against avx in library code -- we have dense matrix multiplication here! -- is odd. (Various kernels other then fft and gemm have had non-trivial work to use avxN assembler or intrinsics and doubtless others could benefit.)

There are relevant snippets of information on SIMD trades-off around FFTW, BLIS, and, especially, libxsmm. This stuff definitely isn't simple.

yoklov · on May 15, 2018

Yeah, to be clear, I wasn’t saying it’s definitely the wrong choice for all library code, or anything so strong. More like, it’s worth considering the case your users will be using, since for several of them it is likely not the right choice.

For dense matrix multiplication it’s probably worth it, but the article is phrased much more broadly in its conclusions, and does not mention the fact that this issue exists (possibly because the author isn’t aware of it — which I don’t fault them for).

To be clear, a 2ms stall when calling certain functions has the potential to be devastating for anything that wants to run at speeds approximating realtime.

phkahler · on May 13, 2018

Another approach is to start with the obvious 3 nested loops. Then replace with a single loop of n3 iterations. Take the bits of this single loop variable and deinterleave them to obtain the 3 indicies of the nested loop variant. In other words treat the loop counter as if it's bits are a3b3c3a2b2c2a1b1c1a0b0c0 and so on. This is simpler with 2 interleaved values but still not bad. It produces a sort of 3 dimensional z-order traversal. The overhead can be minimized by looping with a stride of some power of 2 and unrolling the inner portion to do that many multiply-accumulates.

If you do this while also storing the matrix in z-order in memory the offsets in the unrolled part become constants independent of the matrix size.

This technique is applicable to anything using nested loops where the indicies take you to a lot of memory repeatedly.

I'm still waiting for a couple with bit interleave and deinterleave instructions, but those wouldn't be usable without intrinsics.

salzg · on May 13, 2018

Can it reach/exceed GPU performance? Thinking of ML craze for GPU.