Nvidia Introduces CuDNN, a CUDA-based Library for Deep Neural Networks

ajtulloch · on Sept 29, 2014

If you want to use these in Torch7, my excellent colleague @soumith has released bindings at https://github.com/soumith/cudnn.torch. There are some benchmarks for the convolutional layers of a classic AlexNet model at https://github.com/soumith/convnet-benchmarks.

In general, the question which is the fastest public implementation of spatial convolution is so heavily dependent on the kernel parameters (and even then, there are often knobs you can twiddle for a given implementation that can lead to substantial speedups), how much GPU memory you want to spend (GEMM and FFT based methods), so the best approach is probably just to benchmark a bunch of implementations for your network and just choose the best.

benanne · on Sept 29, 2014

There are plans for Theano to build 'meta optimizers' that determine the fastest implementation separately for each layer in a network, and for each 'pass' (forward, backward w.r.t. weights, backward w.r.t. input): https://github.com/Theano/Theano/issues/2072 Sort of like how the FFTW library for FFT computation tries out a bunch of different 'plans' and then uses the fastest. I really hope this idea comes to fruition, as it would automate the "benchmark a bunch of implementations and choose the best" step, and make the process more granular.

smhx · on Sept 29, 2014

meta-optimizers need to factor in a few more things like memory usage. Right now the FFT modules are insanely memory-hungry, and the versions written with reasonable memory usage (Michael Mathieu wrote one), are not as fast as batched CuFFT + CuBLAS

benanne · on Sept 29, 2014

I don't know where this figure comes from: "Nvidia's official benchmarks show a 10x speed-up when using Caffe with cuDNN.". The graph in the original announcement ( http://devblogs.nvidia.com/parallelforall/accelerate-machine... ) shows a 11x speed-up for using Caffe on the GPU versus CPU, and a 14x speed-up for using Caffe + cuDNN (again compared to CPU).

Then again these figures are also kind of meaningless, as the specifics of the experiments were not specified. There are a lot of parameters that affect the performance of different implementations in various ways: the number of input feature maps, the number of filters, filter width/height, input width/height. The FFT-approach tends to do well for large filter sizes and lots of input feature maps, for example.

I've played around with cudnn a bit using Theano. The Theano bindings are a work in progress, but so far they've already wrapped the convolution. Compared to conv_gemm (Theano's version of Caffe's GEMM convolution approach), it seems to be sometimes faster, sometimes slower. Soumith Chintala maintains a GitHub repository with benchmarks of various convolution implementations: https://github.com/soumith/convnet-benchmarks His results aren't that spectacular either.

In my own experiments cuDNN did pretty well for very small filter sizes (i.e. 3x3), often beating even the memory-hungry FFT approach). This is great because the top two scoring entries in the 2014 ImageNet competition made use of lots of convolutional layers with small filters.

Of course the main benefit of cuDNN is that it will become faster over time, and that it will always be adapted to the latest NVIDIA GPUs without requiring code changes (provided that they keep maintaining it).

varelse · on Sept 29, 2014

You can probably figure out why 3x3 or smaller convolutions would be faster without much thought: hint FP performance is improving much faster than either internal bandwidth or CUDA kernel launch latency.

And IMO this opens up the window for future GPUs to do even better. That said, when 9 TFLOPs is ~$1,100 (2 x GTX 980), I'm not too worried about memory usage in the long term.

jimduk · on Sept 29, 2014

Neophyte question - is FP necessary for a CNN as opposed to using 32 bits on some fixed encoding. If I used a 32 bit value on the order of +- 4.27 or +- 8.23, wouldn't that have enough accuracy. I'm assuming the weights and parameters don't go much above say 8 or 16 after all the Relu stuff.

benanne · on Sept 29, 2014

Probably not. The noise in the training data itself, noise from dropout and various other sources are going to trump any quantization noise anyway, so you can use fairly inaccurate representations. This is why everyone's using gamer GPUs for this stuff, and not Tesla cards: single precision is enough, and much cheaper to come by.

Although, I guess the cutoff might have to be a bit higher than 8.23. Maybe the neuron activations would never exceed that range, but some intermediate computations could.

Supposedly the new Maxwell GPUs have some new instructions for working with half-precision floats ( https://developer.nvidia.com/sites/default/files/akamai/open... ). I wonder how complete this implementation is, because half-precision might be sufficient to train a convnet, and this would result in a significant speedup.

smhx · on Sept 30, 2014

In these deep relu networks which are not renormalized in between, (like overfeat which has no normalization), some of the activations become pretty big in size! (in the order of 1e3).

You also cant clip them and get away with it, you have to either renormalize the layers to do half-precision (and live with the extra cost) or stick to full-precision. I was doing fun stuff early this year that did fixed-precision nets (8-bit/16-bit). Things get very interesting :)

varelse · on Sept 30, 2014

16-bit float has dynamic range from + or - 6.5e04 to 6.1e-05.

http://en.wikipedia.org/wiki/Half-precision_floating-point_f...

That's plenty IMO for most inputs and weights. Where it gets tricky is in accumulation. You could constrain the weights for each unit I guess, but this is the sort of work best done under the hood rather than by the data scientist IMO. I'd personally choose 32-bit accumulation just because it would drastically simplify code development.

I've also worked with fixed precision elsewhere. It's awesome if you understand the dynamic range of your application. It's a migraine headache if you don't.

fiatmoney · on Sept 29, 2014

Be careful; most of NVidia's consumer cards only support single-precision FP at anything close to "full" speed - the 980 I believe runs at 1/32 speed.

varelse · on Sept 30, 2014

Say what? You can get insane single-precision efficiency out of GTX 980. Double-precision OTOH is a disaster (1/150 yes 1/150 as opposed to GK104's 1/30 or so). But for most machine learning, double-precision is overkill.

IMO What remains to be seen is what one could do with FP16 data mixed with FP32 accumulation. My signal processing friends think a pure FP16 network is just asking for trouble and are of the opinion that a minimum of 20 bits is needed for stable accumulation and 24 is more than enough. That's not far off from Teradeep's 8-bit data plus 16-bit accumulators for image data.

glifchits · on Sept 30, 2014

This is either late or on time, but its still quite interesting to see the GPU industry move into what I believe is a yet unexplored market segment. Moving forward, it seems GPUs will be used less and less for their traditional graphical purposes, and more and more for deep learning applications.

ris · on Sept 30, 2014

Is NVidia's (CUDA) GPU library development aimed at anything other than proliferating their proprietary lock-in?

PatRicks32 · on Sept 29, 2014

I wonder when are they planing to put their K series in the smartphones and tablets. It'll be amazing to see that up in action.

varelse · on Sept 30, 2014

Not to mention that the geniuses at Apple and Google continue to refuse to expose OpenCL or CUDA on their devices (or has this finally changed?).

sitkack · on Sept 30, 2014

they are working on it aanndd it kinda kills other market segments.

programmer_dude · on Sept 30, 2014

Google is actively trying to hide the presence of CUDA on Android in favour of render-script. It doesn't look like they are working on it.

sitkack · on Sept 30, 2014

Sorry, I mean generically enabling GPGPU stuff, of course they don't want high performance apps that don't have to go through the store (WebGL / WebCL)

varelse · on Sept 30, 2014

If that's the real reason Apple won't expose OpenCL and why Google has been aimlessly puttering with reinventing Ian Buck's Ph.D. thesis for the past 4 years, I just died a little inside.

That said, IOS 8 exposes WebGL, no?

http://www.theregister.co.uk/2014/09/17/after_20_years_apple...

I say this because you could write a neural network engine entirely in WebGL given it's mostly SGEMM, convolutions, and RMW kernels.

smhx · on Sept 30, 2014

With the power-draw of those things, I think it would be a little tricky in the short term for tablets (and even harder for phones)