Why GEMM is at the heart of deep learning

bmh100 · on April 20, 2015

Fotran can be a pain to write, but it can lead to very efficient programs, especially with the incorporation of advanced math kernels like ATLAS or MKL. Of course, CUDA will lead to 10x-40x boost in GEMM performance in certain algorithms as well. If you have a very matrix heavy hot loop, or could possibly represent an algorithm as a matrix operation [1], it might be worth your time to write some modern Fortran [2].

[1]: http://kukuruku.co/hub/algorithms/using-the-quick-raise-of-m... [2]: http://fortranwiki.org/fortran/show/Fortran+2008

jedbrown · on April 21, 2015

1. ATLAS hasn't had great performance in ages because it explores the wrong parameter space. OpenBLAS or MKL are much better choices and don't need autotuning.

2. You can call these libraries from any language, so Fortran offers no advantage there.

3. Fortran performance is similar to C and C++ (assuming use of restrict where needed).

4. CUDA will not lead to 10-40x boost unless your CPU implementation is garbage. The realistic standard is about 2x when normalized by Watts; see the Green500 for example. The 2x also holds for price when using server-grade parts (like Tesla), but GPUs come out further ahead if you buy consumer grade. Note that you still have to pay to buy and power a host, and that GPUs are a lot less versatile in terms of problem size/turn-around time and workload. They have their place, but evaluate your workload carefully before concluding that the cost proposition favors GPUs.

bmh100 · on April 21, 2015

1. I certainly did not mean to imply that ATLAS or MKL were the best CPU linear algebra libraries out there. Thanks for pointing out OpenBLAS and offering that comparison.

2. If I were to run a highly iterative algorithm, say with 100,000 or more iterations, would the memory marshaling not be significant? Or would it be unnoticeable since I would be using these libraries anyway? I also personally find the array notation and slicing to be quite nice.

3. The use of "restrict" is nice. I was not aware of that feature. Looking at this reference [1], it seems that Fortran basically does the equivalent of automatically using "restrict".

4. According to NVIDIA [2], cuBLAS outperforms MKL by 6x-17x (it seems MKL is catching up since I last checked). Is this misleading by NVIDIA? What is your expectation for a reasonable execution speed increase? The fixed-cost efficiency and variable-cost efficiency are good points to bring into the discussion. Your point about carefully evaluating workload is also a good one. I don't want people to think that throwing an algorithm into CUDA will magically speed up the overall program. Network I/O, disk I/O, and device (GPU) I/O are all important bottlenecks to consider.

[1]: http://www-cs-students.stanford.edu/~blynn/c/ch05.html

[2]: https://developer.nvidia.com/cuBLAS

jedbrown · on April 21, 2015

2. Most languages have a way to store raw arrays that can be passed directly to the numerical routines. If you have to marshal, then it depends on the problem size and amount of work done in the numerical routine.

3. Yes.

4. NVIDIA has a track record of misleading comparisons. They cleaned up their act in the these CUDA-7 benchmarks: http://devblogs.nvidia.com/parallelforall/cuda-7-release-can... in response to this G+ discussion https://plus.google.com/+JeffHammondScience/posts/G1MzHqZaxy... . Thanks to Szilárd Páll for calling attention to this discussion and to Mark Harris for updating the plots. Note that this should not be interpreted as "MKL/Xeon is catching up" but rather that performance comparisons are sensitive to the details of the experiment and it can be hard to recognize the consequences of biased configurations. Better normalization and standards for comparison can help.

timeu · on April 21, 2015

1. Some time ago I did some benchmarking of different BLAS implementations on our HPC using different CPUs (http://stackoverflow.com/questions/5260068/multithreaded-bla...). MKL and OpenBLAS performed best but ATLAS is not that bad. But of course I would use rather OpenBLAS/MKL rather than ATLAS because it performs better and is also much easier to install.

draven · on April 21, 2015

It never occurred to me to consider the performance per watt of CUDA vs C/C++/Fortan on the CPU (instead of only raw performance.) So one should consider CUDA only if they want to favor raw performance over cost? (and only if their algorithms yield a significant improvement when implemented to run on a GPU.)

jedbrown · on April 21, 2015

CUDA on consumer GPUs for workloads that execute efficiently on those architectures can provide about 2x benefit in energy efficiency and somewhat bigger in acquisition cost, provided there is enough work at each stage of your algorithm to amortize kernel launch latency and PCIe transfer latency and bandwidth. If you buy server-grade GPUs, the cost proposition is similar to power -- at best ~2x benefit.

im3w1l · on April 21, 2015

You should use FFT's for the convolution, see http://arxiv.org/abs/1312.5851

Also, while the original Blas was in Fortran, you probably want OpenBlas, which is in C. Or maybe Intels MKL.

ajtulloch · on April 21, 2015

FWIW, FFT-based convolutional methods aren't better for a lot of common sizes (including a bunch of AlexNet/VGG/GoogLeNet sizes). http://arxiv.org/abs/1412.7580 has a pretty thorough evaluation of im2col + gemm vs fused gemm (cudnn) vs fft.

im3w1l · on April 21, 2015

Thanks for the article tip hadn't seen it. But I don't agree with your conclusion. In fig 1-6, you should look at the last column. Because really the other channel numbers are quite low. Knowing that, the space domain algo, only really has a chance at 3x3 filters which is quite small.

And since they claim fbfft was even faster than cufft, the situation should look even better for fft.

EDIT: And they are about tied in the 3x3, 64 case.

gcr · on April 21, 2015

The current trend in convolutional neural networks seems to be moving toward more convolutions with smaller kernels. This year's second-place ILSVRC winner (VGG) is essentially a giant stack of 19 convolution layers with super tiny kernels, sandwiched between nonlinearities and pooling. "To reduce the number of parameters in such very deep networks, we use very small 3×3 filters in all convolutional layers" http://arxiv.org/pdf/1409.1556.pdf

gcr · on April 21, 2015

The article actually touches on this issue a bit.

"The main competitor to the GEMM approach is using Fourier transforms to do the operation in frequency space, but the use of strides in our convolutions makes it hard to be as efficient."

gone35 · on April 21, 2015

For 'small' workloads, it's not uncommon to find situations where direct computation might make more sense. The author touches on that a bit, as others have pointed out.

shepardrtc · on April 21, 2015

According to Github, 51.6% of the code in OpenBLAS is Fortran:

https://github.com/xianyi/OpenBLAS

gcr · on April 21, 2015

Oops! The "A x B = C" figure is exactly backwards. A matrix with shape (r, k) times a matrix with shape (k, c) should return a matrix with shape (r,c). The figure incorrectly claims that a matrix (k,m) times a matrix with shape (n,k) returns a matrix with shape (n,m), which is only true if all of these numbers are equal (eg. square matrices).

Kind of an embarrassing typo for an article that is essentially about matrix multiplication.

Here is a corrected (transposed) version of that figure: http://i.imgur.com/Bo6CJCv.png

Edit: It looks like every figure in the article suffers from this problem. I know FORTRAN matrices store each column contiguously in memory, but whether a matrix is accessed like * (ptr + r * ncols + c) or * (ptr + c * nrows + r) is (in my opinion) a low-level detail and shouldn't affect our thinking about how multiplication works.

simonster · on April 21, 2015

Either I'm missing something here or the article is correct. Generally, the first dimension of a matrix corresponds to the rows and the second dimension to the columns. So in the original figure, A is m x k, B is k x n, and C is m x n, and this seems right. See http://en.wikipedia.org/wiki/Matrix_%28mathematics%29#Size

dplarson · on April 21, 2015

I think the author revised the figure(s) between the time of the parent comment (by gcr) and your comment. At least, the A * B = C figure's filename seems to imply a revision [1].

EDIT: yep, the figures were revised. Compare the corrected version [1] vs the original [2].

[1] https://petewarden.files.wordpress.com/2015/04/gemm_correcte...

[2] https://petewarden.files.wordpress.com/2015/04/gemm.png

EDIT 2: I completely missed that the author put a notice (about having revised the figures) at the bottom of the post.

petewarden · on April 21, 2015

Yes, that was my screwup, sorry for any confusion! Despite multiple linear algebra courses, working in 3D graphics for a decade, I haven't internalized the basics of matrix notation. It doesn't help that my usual sandbox (Eigen) is column major by default, which is the wrong in-memory order for my raster-image trained brain to visualize. Funnily enough, I find tensor notation a bit easier, despite being less familiar.

dplarson · on April 25, 2015

No worries. Also, I found the article very interesting and informative. Thanks!

ntenenz · on April 21, 2015

Being computationally limited by GEMM is both a benefit and a pitfall. The operation itself is highly parallelizable and typically compute-bound (as opposed to memory-bound). As a result, advances in GPU performance will translate into improvements in GEMM. However, it also means that algorithmically you're limited by the hardware. GEMM is an _extremely_ mature so algorithmic improvements are unlikely. Instead, any algorithmic improvements would likely stem from replacing GEMM with a more tailored, specialized operation.

gcr · on April 21, 2015

Different parts of the computation are bound in different ways. The "rule of thumb" for AlexNet-derived networks (with conv layers followed by fully connected layers) is that 95% of the computation time is spent in the conv layers, but 95% of the parameters are stored as the weights of the FC layers. I imagine the later stage could be memory-bound and the former stage is CPU-bound.

Not sure if this holds true for other NN structures like GoogLeNet or the weirder recent fully convolutional networks.

bra-ket · on April 20, 2015

great writeup!

in general, to make deep learning faster avoid using backpropagation

gcr · on April 21, 2015

There's a bit of confusion here, because backpropagation is only used to train the network. After your network is trained, you don't want to change it, so you wouldn't run backpropagation in production.

As for making training faster, second-order methods don't generally work as well as backpropagation for large networks. Some notes from Yann LeCunn! http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

iskander · on April 20, 2015

What else do you use to train the network?

bra-ket · on April 21, 2015

random matrices + svd: http://www.ntu.edu.sg/home/egbhuang/

gcr · on April 21, 2015

I would be very interested to see if this approach outperforms more conventional deep networks trained with backpropagation.

It's important to note that (from what I understand) this paper seems to focus on building single-layer NNs, which are quite different from the behemoth AlexNet that this article focuses on.

Fede_V · on April 21, 2015

Very early on, people experimented heavily with non-gradient based optimization (genetic algorithms and the like) to optimize neural networks, but the results were rather poor.

For very large parameter spaces, gradient based optimization is really the only game in town. There's also lots of clever tricks (gradient clipping, momentum, adagrad, etc) that are involved in optimizing the convergence.