In general, the question which is the fastest public implementation of spatial convolution is so heavily dependent on the kernel parameters (and even then, there are often knobs you can twiddle for a given implementation that can lead to substantial speedups), how much GPU memory you want to spend (GEMM and FFT based methods), so the best approach is probably just to benchmark a bunch of implementations for your network and just choose the best.
There are plans for Theano to build 'meta optimizers' that determine the fastest implementation separately for each layer in a network, and for each 'pass' (forward, backward w.r.t. weights, backward w.r.t. input): https://github.com/Theano/Theano/issues/2072
Sort of like how the FFTW library for FFT computation tries out a bunch of different 'plans' and then uses the fastest. I really hope this idea comes to fruition, as it would automate the "benchmark a bunch of implementations and choose the best" step, and make the process more granular.
meta-optimizers need to factor in a few more things like memory usage. Right now the FFT modules are insanely memory-hungry, and the versions written with reasonable memory usage (Michael Mathieu wrote one), are not as fast as batched CuFFT + CuBLAS
I don't know where this figure comes from: "Nvidia's official benchmarks show a 10x speed-up when using Caffe with cuDNN.". The graph in the original announcement ( http://devblogs.nvidia.com/parallelforall/accelerate-machine... ) shows a 11x speed-up for using Caffe on the GPU versus CPU, and a 14x speed-up for using Caffe + cuDNN (again compared to CPU).
Then again these figures are also kind of meaningless, as the specifics of the experiments were not specified. There are a lot of parameters that affect the performance of different implementations in various ways: the number of input feature maps, the number of filters, filter width/height, input width/height. The FFT-approach tends to do well for large filter sizes and lots of input feature maps, for example.
I've played around with cudnn a bit using Theano. The Theano bindings are a work in progress, but so far they've already wrapped the convolution. Compared to conv_gemm (Theano's version of Caffe's GEMM convolution approach), it seems to be sometimes faster, sometimes slower. Soumith Chintala maintains a GitHub repository with benchmarks of various convolution implementations: https://github.com/soumith/convnet-benchmarks
His results aren't that spectacular either.
In my own experiments cuDNN did pretty well for very small filter sizes (i.e. 3x3), often beating even the memory-hungry FFT approach). This is great because the top two scoring entries in the 2014 ImageNet competition made use of lots of convolutional layers with small filters.
Of course the main benefit of cuDNN is that it will become faster over time, and that it will always be adapted to the latest NVIDIA GPUs without requiring code changes (provided that they keep maintaining it).
You can probably figure out why 3x3 or smaller convolutions would be faster without much thought: hint FP performance is improving much faster than either internal bandwidth or CUDA kernel launch latency.
And IMO this opens up the window for future GPUs to do even better. That said, when 9 TFLOPs is ~$1,100 (2 x GTX 980), I'm not too worried about memory usage in the long term.
Neophyte question - is FP necessary for a CNN as opposed to using 32 bits on some fixed encoding. If I used a 32 bit value on the order of +- 4.27 or +- 8.23, wouldn't that have enough accuracy. I'm assuming the weights and parameters don't go much above say 8 or 16 after all the Relu stuff.
Probably not. The noise in the training data itself, noise from dropout and various other sources are going to trump any quantization noise anyway, so you can use fairly inaccurate representations. This is why everyone's using gamer GPUs for this stuff, and not Tesla cards: single precision is enough, and much cheaper to come by.
Although, I guess the cutoff might have to be a bit higher than 8.23. Maybe the neuron activations would never exceed that range, but some intermediate computations could.
Supposedly the new Maxwell GPUs have some new instructions for working with half-precision floats ( https://developer.nvidia.com/sites/default/files/akamai/open... ). I wonder how complete this implementation is, because half-precision might be sufficient to train a convnet, and this would result in a significant speedup.
In these deep relu networks which are not renormalized in between, (like overfeat which has no normalization), some of the activations become pretty big in size! (in the order of 1e3).
You also cant clip them and get away with it, you have to either renormalize the layers to do half-precision (and live with the extra cost) or stick to full-precision. I was doing fun stuff early this year that did fixed-precision nets (8-bit/16-bit). Things get very interesting :)
That's plenty IMO for most inputs and weights. Where it gets tricky is in accumulation. You could constrain the weights for each unit I guess, but this is the sort of work best done under the hood rather than by the data scientist IMO. I'd personally choose 32-bit accumulation just because it would drastically simplify code development.
I've also worked with fixed precision elsewhere. It's awesome if you understand the dynamic range of your application. It's a migraine headache if you don't.
Say what? You can get insane single-precision efficiency out of GTX 980. Double-precision OTOH is a disaster (1/150 yes 1/150 as opposed to GK104's 1/30 or so). But for most machine learning, double-precision is overkill.
IMO What remains to be seen is what one could do with FP16 data mixed with FP32 accumulation. My signal processing friends think a pure FP16 network is just asking for trouble and are of the opinion that a minimum of 20 bits is needed for stable accumulation and 24 is more than enough. That's not far off from Teradeep's 8-bit data plus 16-bit accumulators for image data.
This is either late or on time, but its still quite interesting to see the GPU industry move into what I believe is a yet unexplored market segment. Moving forward, it seems GPUs will be used less and less for their traditional graphical purposes, and more and more for deep learning applications.
Sorry, I mean generically enabling GPGPU stuff, of course they don't want high performance apps that don't have to go through the store (WebGL / WebCL)
If that's the real reason Apple won't expose OpenCL and why Google has been aimlessly puttering with reinventing Ian Buck's Ph.D. thesis for the past 4 years, I just died a little inside.
In general, the question which is the fastest public implementation of spatial convolution is so heavily dependent on the kernel parameters (and even then, there are often knobs you can twiddle for a given implementation that can lead to substantial speedups), how much GPU memory you want to spend (GEMM and FFT based methods), so the best approach is probably just to benchmark a bunch of implementations for your network and just choose the best.