Hacker News new | past | comments | ask | show | jobs | submit login
Why GEMM is at the heart of deep learning (petewarden.com)
63 points by shepardrtc on April 20, 2015 | hide | past | favorite | 27 comments



Fotran can be a pain to write, but it can lead to very efficient programs, especially with the incorporation of advanced math kernels like ATLAS or MKL. Of course, CUDA will lead to 10x-40x boost in GEMM performance in certain algorithms as well. If you have a very matrix heavy hot loop, or could possibly represent an algorithm as a matrix operation [1], it might be worth your time to write some modern Fortran [2].

[1]: http://kukuruku.co/hub/algorithms/using-the-quick-raise-of-m... [2]: http://fortranwiki.org/fortran/show/Fortran+2008


1. ATLAS hasn't had great performance in ages because it explores the wrong parameter space. OpenBLAS or MKL are much better choices and don't need autotuning.

2. You can call these libraries from any language, so Fortran offers no advantage there.

3. Fortran performance is similar to C and C++ (assuming use of restrict where needed).

4. CUDA will not lead to 10-40x boost unless your CPU implementation is garbage. The realistic standard is about 2x when normalized by Watts; see the Green500 for example. The 2x also holds for price when using server-grade parts (like Tesla), but GPUs come out further ahead if you buy consumer grade. Note that you still have to pay to buy and power a host, and that GPUs are a lot less versatile in terms of problem size/turn-around time and workload. They have their place, but evaluate your workload carefully before concluding that the cost proposition favors GPUs.


1. I certainly did not mean to imply that ATLAS or MKL were the best CPU linear algebra libraries out there. Thanks for pointing out OpenBLAS and offering that comparison.

2. If I were to run a highly iterative algorithm, say with 100,000 or more iterations, would the memory marshaling not be significant? Or would it be unnoticeable since I would be using these libraries anyway? I also personally find the array notation and slicing to be quite nice.

3. The use of "restrict" is nice. I was not aware of that feature. Looking at this reference [1], it seems that Fortran basically does the equivalent of automatically using "restrict".

4. According to NVIDIA [2], cuBLAS outperforms MKL by 6x-17x (it seems MKL is catching up since I last checked). Is this misleading by NVIDIA? What is your expectation for a reasonable execution speed increase? The fixed-cost efficiency and variable-cost efficiency are good points to bring into the discussion. Your point about carefully evaluating workload is also a good one. I don't want people to think that throwing an algorithm into CUDA will magically speed up the overall program. Network I/O, disk I/O, and device (GPU) I/O are all important bottlenecks to consider.

[1]: http://www-cs-students.stanford.edu/~blynn/c/ch05.html

[2]: https://developer.nvidia.com/cuBLAS


2. Most languages have a way to store raw arrays that can be passed directly to the numerical routines. If you have to marshal, then it depends on the problem size and amount of work done in the numerical routine.

3. Yes.

4. NVIDIA has a track record of misleading comparisons. They cleaned up their act in the these CUDA-7 benchmarks: http://devblogs.nvidia.com/parallelforall/cuda-7-release-can... in response to this G+ discussion https://plus.google.com/+JeffHammondScience/posts/G1MzHqZaxy... . Thanks to Szilárd Páll for calling attention to this discussion and to Mark Harris for updating the plots. Note that this should not be interpreted as "MKL/Xeon is catching up" but rather that performance comparisons are sensitive to the details of the experiment and it can be hard to recognize the consequences of biased configurations. Better normalization and standards for comparison can help.


1. Some time ago I did some benchmarking of different BLAS implementations on our HPC using different CPUs (http://stackoverflow.com/questions/5260068/multithreaded-bla...). MKL and OpenBLAS performed best but ATLAS is not that bad. But of course I would use rather OpenBLAS/MKL rather than ATLAS because it performs better and is also much easier to install.


It never occurred to me to consider the performance per watt of CUDA vs C/C++/Fortan on the CPU (instead of only raw performance.) So one should consider CUDA only if they want to favor raw performance over cost? (and only if their algorithms yield a significant improvement when implemented to run on a GPU.)


CUDA on consumer GPUs for workloads that execute efficiently on those architectures can provide about 2x benefit in energy efficiency and somewhat bigger in acquisition cost, provided there is enough work at each stage of your algorithm to amortize kernel launch latency and PCIe transfer latency and bandwidth. If you buy server-grade GPUs, the cost proposition is similar to power -- at best ~2x benefit.


You should use FFT's for the convolution, see http://arxiv.org/abs/1312.5851

Also, while the original Blas was in Fortran, you probably want OpenBlas, which is in C. Or maybe Intels MKL.


FWIW, FFT-based convolutional methods aren't better for a lot of common sizes (including a bunch of AlexNet/VGG/GoogLeNet sizes). http://arxiv.org/abs/1412.7580 has a pretty thorough evaluation of im2col + gemm vs fused gemm (cudnn) vs fft.


Thanks for the article tip hadn't seen it. But I don't agree with your conclusion. In fig 1-6, you should look at the last column. Because really the other channel numbers are quite low. Knowing that, the space domain algo, only really has a chance at 3x3 filters which is quite small.

And since they claim fbfft was even faster than cufft, the situation should look even better for fft.

EDIT: And they are about tied in the 3x3, 64 case.


The current trend in convolutional neural networks seems to be moving toward more convolutions with smaller kernels. This year's second-place ILSVRC winner (VGG) is essentially a giant stack of 19 convolution layers with super tiny kernels, sandwiched between nonlinearities and pooling. "To reduce the number of parameters in such very deep networks, we use very small 3×3 filters in all convolutional layers" http://arxiv.org/pdf/1409.1556.pdf


The article actually touches on this issue a bit.

"The main competitor to the GEMM approach is using Fourier transforms to do the operation in frequency space, but the use of strides in our convolutions makes it hard to be as efficient."


For 'small' workloads, it's not uncommon to find situations where direct computation might make more sense. The author touches on that a bit, as others have pointed out.


According to Github, 51.6% of the code in OpenBLAS is Fortran:

https://github.com/xianyi/OpenBLAS


Oops! The "A x B = C" figure is exactly backwards. A matrix with shape (r, k) times a matrix with shape (k, c) should return a matrix with shape (r,c). The figure incorrectly claims that a matrix (k,m) times a matrix with shape (n,k) returns a matrix with shape (n,m), which is only true if all of these numbers are equal (eg. square matrices).

Kind of an embarrassing typo for an article that is essentially about matrix multiplication.

Here is a corrected (transposed) version of that figure: http://i.imgur.com/Bo6CJCv.png

Edit: It looks like every figure in the article suffers from this problem. I know FORTRAN matrices store each column contiguously in memory, but whether a matrix is accessed like * (ptr + r * ncols + c) or * (ptr + c * nrows + r) is (in my opinion) a low-level detail and shouldn't affect our thinking about how multiplication works.


Either I'm missing something here or the article is correct. Generally, the first dimension of a matrix corresponds to the rows and the second dimension to the columns. So in the original figure, A is m x k, B is k x n, and C is m x n, and this seems right. See http://en.wikipedia.org/wiki/Matrix_%28mathematics%29#Size


I think the author revised the figure(s) between the time of the parent comment (by gcr) and your comment. At least, the A * B = C figure's filename seems to imply a revision [1].

EDIT: yep, the figures were revised. Compare the corrected version [1] vs the original [2].

[1] https://petewarden.files.wordpress.com/2015/04/gemm_correcte...

[2] https://petewarden.files.wordpress.com/2015/04/gemm.png

EDIT 2: I completely missed that the author put a notice (about having revised the figures) at the bottom of the post.


Yes, that was my screwup, sorry for any confusion! Despite multiple linear algebra courses, working in 3D graphics for a decade, I haven't internalized the basics of matrix notation. It doesn't help that my usual sandbox (Eigen) is column major by default, which is the wrong in-memory order for my raster-image trained brain to visualize. Funnily enough, I find tensor notation a bit easier, despite being less familiar.


No worries. Also, I found the article very interesting and informative. Thanks!


Being computationally limited by GEMM is both a benefit and a pitfall. The operation itself is highly parallelizable and typically compute-bound (as opposed to memory-bound). As a result, advances in GPU performance will translate into improvements in GEMM. However, it also means that algorithmically you're limited by the hardware. GEMM is an _extremely_ mature so algorithmic improvements are unlikely. Instead, any algorithmic improvements would likely stem from replacing GEMM with a more tailored, specialized operation.


Different parts of the computation are bound in different ways. The "rule of thumb" for AlexNet-derived networks (with conv layers followed by fully connected layers) is that 95% of the computation time is spent in the conv layers, but 95% of the parameters are stored as the weights of the FC layers. I imagine the later stage could be memory-bound and the former stage is CPU-bound.

Not sure if this holds true for other NN structures like GoogLeNet or the weirder recent fully convolutional networks.


great writeup!

in general, to make deep learning faster avoid using backpropagation


There's a bit of confusion here, because backpropagation is only used to train the network. After your network is trained, you don't want to change it, so you wouldn't run backpropagation in production.

As for making training faster, second-order methods don't generally work as well as backpropagation for large networks. Some notes from Yann LeCunn! http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf


What else do you use to train the network?



I would be very interested to see if this approach outperforms more conventional deep networks trained with backpropagation.

It's important to note that (from what I understand) this paper seems to focus on building single-layer NNs, which are quite different from the behemoth AlexNet that this article focuses on.


Very early on, people experimented heavily with non-gradient based optimization (genetic algorithms and the like) to optimize neural networks, but the results were rather poor.

For very large parameter spaces, gradient based optimization is really the only game in town. There's also lots of clever tricks (gradient clipping, momentum, adagrad, etc) that are involved in optimizing the convergence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: