I really like this writeup. Note that it may not be worth using the SIMD in this way (horizontal SIMD) if you know you will be multiplying many matrices that are the same size. It may be better to do vertical SIMD and simply perform the scalar algorithm on 4 or 8 matrices at a time, like GPUs would do for vertex shaders. This does mean that you may have to interleave your matrices in an odd way to optimize memory access, though.