epberry's favorites | Hacker News

The P3 instances are the first widely and easily accessible machines that use the NVIDIA Tesla V100 GPUs. These GPUs are straight up scary in terms of firepower. To give an understanding of the speed-up compared to the P2 instances for a research project of mine:

+ P2 (K80) with single GPU: ~95 seconds per epoch

+ P3 (V100) with single GPU: ~20 seconds per epoch

Admittedly this isn't exactly fair for either GPU - the K80 cards are straight up ancient now and the Volta isn't sitting at 100% GPU utilization as it burns through the data too quickly ([CUDA kernel, Python] overhead suddenly become major bottlenecks). This gives you an indication of what a leap this is if you're using GPUs on AWS however. Oh, and the V100 comes with 16GB of (faster) RAM compared to the K80's 12GB of RAM, so you win there too.

For anyone using the standard set of frameworks (Tensorflow, Keras, PyTorch, Chainer, MXNet, DyNet, DeepLearning4j, ...) this type of speed-up will likely require you to do nothing - except throw more money at the P3 instance :)

If you really want to get into the black magic of speed-ups, these cards also feature full FP16 support, which means you can double your TFLOPS by dropping to FP16 from FP32. You'll run into a million problems during training due to the lower precision but these aren't insurmountable and may well be worth the pain for the additional speed-up / better RAM usage.

- Good overview of Volta's advantages compared to event the recent P100: https://devblogs.nvidia.com/parallelforall/inside-volta/

- Simple table comparing V100 / P100 / K40 / M40: https://www.anandtech.com/show/11367/nvidia-volta-unveiled-g...

- NVIDIA's V100 GPU architecture white paper: http://www.nvidia.com/object/volta-architecture-whitepaper.h...

- The numbers above were using my PyTorch code at https://github.com/salesforce/awd-lstm-lm and the Quasi-Recurrent Neural Network (QRNN) at https://github.com/salesforce/pytorch-qrnn which features a custom CUDA kernel for speed