The P3 instances are the first widely and easily accessible machines that use th...

agibsonccc · on Oct 26, 2017

Great write up as usual! Could you elaborate more on the python overhead a bit? We have fp16 support running in dl4j but I don't think we've really done much with volta yet beyond get it working. In practice, (especially when we do multi gpu async back round loading of data) we find gpus being data starved. I would love to compare support for what you're seeing with pytorch.

Smerity · on Oct 26, 2017

Honestly, I didn't spend enough time delving in to the Python overhead, especially in terms of the framework. Most of it would be an issue of my own causing however rather than the framework's. The original code I wrote was never written with data loading / saving in mind as the source for speed issues so I avoided what would have been premature optimization at the time.

Some of the slowdowns now just seem silly and aren't even listed in the per epoch timings: PyTorch doesn't have an asynchronous torch.save(). This means that if you save your model after each epoch, and the model save takes a few seconds, you're increasing your per epoch timings 5-10% just by saving the damn thing!

Regarding FP16, PyTorch supports, and there's even a pull request that updates the examples repo with FP16 support for language modeling and ImageNet. It's not likely to be merged as it greatly complicates a codebase that's meant primarily for teaching purposes but it's lovely to look at. I also think many of the FP16 issues will get a general wrapper and they'll become far more agnostic to the end user. For the most part they're all outlined in NVIDIA / Baidu's "Mixed Precision Training" paper. Might be useful for DeepLearning4j to go through the most common heavy throughput use cases and get them running (just as an example of how to work around issues really) if customers were using P100s/V100s?

I'm really interested in exploring the FP16 aspect as the QRNN, especially for single GPU, is sitting at basically 100% utilization, with almost all the time spent on matrix multiplications. FP16 is about the only way to speed it up at that stage. This gets a tad more complicated regardless as the CUDA kernel is not written in FP16 (and is not easy to do so) but even converting FP16->FP32->(QRNN element-wise CUDA kernel)->FP16 ("pseudo" FP16) should still be a crazy speedup. I tested that on the P100 and it took per epoch AWD-QRNN from ~28 seconds to ~18.

- PyTorch async save issue: https://github.com/pytorch/pytorch/issues/1567

- PyTorch FP16 examples pull request: https://github.com/pytorch/examples/pull/203

- "Mixed Precision Training": https://arxiv.org/abs/1710.03740

ablekh · on Oct 27, 2017

Nice comment. In regard to your reference to reducing precision to FP16 for performance gains, you might want to read a recently published paper by Baidu Research and NVIDIA teams on mixed precision training of deep learning models (link to the paper is at the end of the following relevant post): https://www.nextplatform.com/2017/10/11/baidu-sheds-precisio.... Enjoy! :-)

mv4 · on Oct 26, 2017

I've been using the P100 on Softlayer and was impressed. Looks like V100 may be 2..3x faster on some tasks, will be interesting to test it.

P.S. with that memory speed, it can probably run 300..400MH/s on ETH.

mamon · on Oct 26, 2017

Genuinely curious: Given that Softlayer bare metal server prices start at 700$ per month is there even remote chance of this actually being profitable?

0xJRS · on Oct 26, 2017

Adjust for power https://www.cryptocompare.com/mining/calculator/eth?HashingP...

mv4 · on Oct 26, 2017

Nope, that would be hugely unprofitable. To give you an idea:

The P100 instances on Softlayer would cost around $2,000/mo, and would generate approximately $170/mo in ETH when fully optimized. One could probably build a DIY rig with the same hashing power for less than 2k total.

Beltiras · on Oct 26, 2017

I can't easily find pricing information on the P3 instances. Have you come across a simple table with the prices?

detaro · on Oct 26, 2017

On-demand prices from amazons pricing page https://aws.amazon.com/ec2/pricing/on-demand/ (select Virginia region):

p3.2xlarge: 8 vCPU, 61 GB RAM, $3.06/h

p3.8xlarge: 32 vCPU, 244 GB, $12.24/h

p3.16xlarge: 64v CPU, 488 GB., $24.48/h

joelhaasnoot · on Oct 26, 2017

Unfortunately, P3 isn't listed yet, but this is my go to site for EC2 pricing: http://www.ec2instances.info/

joelhaasnoot · on Oct 26, 2017

Oh, it is now listed. Be sure to click on "Columns" and add "GPU" to see the different options

smn1234 · on Oct 26, 2017

showing up for me on https://aws.amazon.com/ec2/pricing/ for each of On-Demand Instances, Reserved Instances, Spot Instances, and Dedicated Hosts pricing lists. Are you selecting the regions where this is available - US East (N. Virginia), US West (Oregon), EU West (Ireland) and Asia Pacific (Tokyo)?

jaymzcampbell · on Oct 26, 2017

This got me. Seems they aren't yet available in the Ohio region.

ablekh · on Oct 27, 2017

Oops, just saw that you referenced the same paper in a comment below. Sorry! :-)