DeepLearning10: The 8x Nvidia GTX 1080 Ti GPU Monster

highd · on June 7, 2017

I've been evaluating this space a fair bit recently. If you want to optimize FLOPS/$, especially for research-workstation sorts of setups, there's unfortunately not a lot of options for getting more than 6 GPUs in a motherboard without going to server grade where you're spending basically $3-4K more for unclear benefit - maybe a factor of two in GPU-to-GPU bandwidth.

The bitcoin miners have figured out one way to handle this, which is by using a variety of PCIe splitting systems. I've seen examples of people putting in 8GPUs in 4 slots with these splitters. The problem is that the majority of these splitters take your x16 connection and turn into 2 to 4 x1 PCIe lanes, which is a lot of wasted bandwidth. This is fine for the miners, since the cards run mostly independently. If I could find compatible PCIe splitters that could split x16 into 2 x8 channels then that would be a really sweet spot in performance/$, but unfortunately I've yet to find them. So right now I'm going to stick to 6 GPUs, which you can get in a $500 consumer motherboard with just a few riser cables.

See for example: http://amfeltec.com/products/flexible-x4-pci-express-4-way-s...

brigade · on June 7, 2017

Not sure if you already knew about them, but there are consumer X99 motherboards that include PLX8747 switches to mux 32 PCIe lanes from the CPU into 7 PCIe x8 slots, at premium of maybe $200 for the motherboard and $300 for a compatible CPU. (ASUS X99-E-10G WS)

The catch being that the 7 slots are next to each other, so you either have to make a custom water loop with single slot GPUs or use simple risers on half of them. But that's probably the best bandwidth you can currently get between >4 GPUs on consumer parts.

highd · on June 7, 2017

That sounds amazing - do you know of any in particular? I can't seem to find any motherboards with 8 PCIe slots. Or do they need some sort of additional splitter?

Keyframe · on June 7, 2017

Here's a guy building a water-cooled 7x1080 system https://www.youtube.com/watch?v=9hsQmcSwGv0&list=PLj4jtAjLgQ... With Ti series you don't have to cut the excess from the card either it seems, since there is no that pesky DVI port to saw off.

highd · on June 7, 2017

Wow, thanks!

brigade · on June 7, 2017

Sorry, I was wrong, it's only 7 slots - ASUS x99-e-10g ws is the board I was thinking of.

mikepurvis · on June 8, 2017

Pretty amazing, though— there's a picture of it on this review:

http://proclockers.com/reviews/motherboards/asrock-x99-ws-e-...

highd · on June 7, 2017

Still pretty good - thanks!

forgotmyhnacc · on June 7, 2017

What about AMDs threadripper? It supports 60 PCIe lanes.

wmf · on June 7, 2017

But what slots will the motherboards have? Four x16 or eight x8?

edaemon · on June 8, 2017

The ones I've seen so far have either four or five slots. The lane allocation isn't set in stone; I've seen a few that have 3-x16, 1-x1, and 3-x4 (M2) [1] but I've also seen one that has 4-x16, 1-x1, and 2-x4 (M2) [2]. I assume those motherboards are using PCIe switches, but there are plenty of configurations that wouldn't require the switch.

[1] https://www.pcper.com/news/Motherboards/Computex-2017-ASRock...

[2] http://wccftech.com/gigabyte-x399-aorus-gaming-7-amd-ryzen-t...

mippie_moe · on June 7, 2017

The author wasn't joking about the noise levels. This machine sounds like an F1 race car.

If you don't require a rack mounted server, a cluster of workstations like NVIDIA's DIGITS DevBox is far more cost efficient (and less noisy). I run a compute intensive business (Dreamscopeapp.com) and we opted to build a cluster of desktop-like machines instead of using a rack mounted solution. Another benefit is you don't run into the power issues mentioned in the post.

My start-up actually sells the machine described in this post: https://lambdal.com/nvidia-gpu-server

And a machine inspired by the NVIDIA DIGITS DevBox: https://lambdal.com/nvidia-gpu-workstation-devbox

intoverflow2 · on June 8, 2017

>The author wasn't joking about the noise levels. This machine sounds like an F1 race car.

One of the reasons 3D artists who are using GPU rendering generally go for liquid cooling (that and it makes the cards single slot)

http://rawandrendered.com/Octane-Render-Hepta-GPU-Build

dgacmu · on June 8, 2017

So - tried your quote form. The options are 4x 1080ti, 4x titan Xp, or 8x P100 --- but no 8x 1080ti? Or is the quote form wrong?

16.5K seems pretty reasonable for 8x 1080ti with a bit of profit for building it, but unreasonable for only 4x 1080ti. My home-built 4x1080ti box (without quite enough PCIe bandwidth, admittedly) is under $6k. I'm assuming/hoping there's an error there. :)

Screenshot of the order form: https://www.dropbox.com/s/2nm00w1rd6du6ey/Screenshot%202017-...

Oh, also - if I want a quote on both the big server and the little workstation I have to enter my contact info twice? Not particularly customer-friendly.

p1esk · on June 8, 2017

For quad GPU config you should look at dev box type option. It's $8,750 for machine with 4 1080Ti, 64GB of RAM, and 1TB SATA SSD. Quite a steep margin if you ask me, considering a 128GB RAM machine that you build yourself would cost you at most $5,700 (taxes included) if you get everything from Amazon, and probably under $5k if you're willing to shop around a little.

highd · on June 7, 2017

I feel like you should be able to do better than $15K for $7.2K worth of graphics cards.

mippie_moe · on June 7, 2017

You can save a ton of money by building your own machine.

The server we sell is packaged with software we wrote that makes administering significantly easier. We also provide technical support and even a limited amount of free machine learning consulting. The customers who purchase this server want a headache free solution and aren't as price sensitive as a lone researcher.

jjn2009 · on June 7, 2017

heres a part list to build a pc with 1080ti's: https://pcpartpicker.com/user/Jjn2009/saved/#view=McvMpg

notice the custom part box accounts for two more GPUs, I'm not sure why the site doesn't let you add 4 to the GPU section.

This setup ranges from $5250 with 4 GPU to $3240 with 1 GPU. You might want to bump up the PSU for 4 GPU its currently 1500 watts, which may or may not be enough at max load. The article shows a max of ~2800 watts with 8 GPUs

dgacmu · on June 8, 2017

nice - btw, the Rosewill Tokamak 1500 from newegg is a way to save a few bucks on that build, though it's out of stock (they just had it on sale). It's also 80+ titanium.

p1esk · on June 8, 2017

Actually, that's a pretty bad config:

Mobo does not support 4x16 PCIe lanes (that's why they didn't want you to add 4 GPU cards).

Mobo is limited to 64GB RAM.

$520 for two 256GB 840 Pro SSDs? Seriously?

Here's a better mobo: https://www.amazon.com/Motherboards-X99-E-WS-USB-3-1/dp/B00X...

Also, you can literally double the RAM for the same money: http://www.ebay.com/itm/128gb-DDR4-8-Crucial-16gb-DDR4-2400m... (keep in mind that speed of RAM is irrelevant for DL tasks).

sunsu · on June 8, 2017

I purchased a couple of the machines from these guys for my team and had a great experience. Highly recommended if you have better ways to spend your time than building them yourself.

science404 · on June 7, 2017

Can confirm that. Have this box (with 8 K80s) and we measured 90dB next to the box..

kough · on June 7, 2017

I ask this out of deep curiosity and by no means with any intent to offend: how does (will?) dreamscopeapp.com make any money? (I don't have an Android device or I'd install it – I'm asking because it's entirely unapparent from your web presence.)

mippie_moe · on June 7, 2017

No offense taken!

Dreamscope doesn't actually make money :) It's a little under break even. It brings in revenue through a $9.99/mo premium subscription, which gives customers higher resolution images.

kough · on June 8, 2017

Ah – that's fairly reasonable! Thanks for the answer :)

EvgeniyZh · on June 7, 2017

Doesn't the fact you get 8 PCIe lines per GPU affects performance?

mippie_moe · on June 7, 2017

With this setup, you get one GPU that runs with 16 lanes (and as you mentioned, three that run with 8).

Bottleneck depends on the workload. If you're training a small/fast network, data bandwidth is a real problem.

That being said, for most cases, a workstation build that provides every GPU with 16 lanes is far less cost effective.

braindead_in · on June 8, 2017

Nice! Would you be supporting Volta GPUs when it is launched?

mippie_moe · on June 9, 2017

nojvek · on June 8, 2017

How much do the devboxes go out for? Prices not mentioned.

astrodust · on June 7, 2017

It seems odd that they insist on putting the power connectors on the top of the card instead of the back which would avoid a lot of space constraint issues.

It also indicates there might be a market for a specialized 90° connector that can squeeze into tight spaces like that.

zeta0134 · on June 7, 2017

I'm not sure I agree here; a lot of consumer cases that I've worked with are just barely deep enough to house a card of this size. Putting the power connector on the end of the card would add some extra room to make the case slimmer perhaps, but most consumer cases tend to be generously wide, while not having a lot of extra depth. Wires plugging into the end of the card would compete with the hard drive / CD ROM / Media Card Bay, etc etc.

I agree with nVidia's choice here, but you also raise a valid point; certain cases and configurations would benefit from the added flexibility of that adapter, so there may well be a market.

astrodust · on June 7, 2017

It can be a squeeze in some of the Mini-ATX cases, but I've never had a problem with anything bigger. There's always at least four inches to spare.

seanp2k2 · on June 8, 2017

I had an Antec case (solo or sonata or something like that) where I had to cut into the drive bays with tin snips to fit a 1080ti. Modern cases have more room for GPUs.

rhinoceraptor · on June 7, 2017

Also, super compact mini ITX cases almost always use a PCI extender to mount the card away from the motherboard.

schlarpc · on June 7, 2017

There's at least one on the market: https://www.evga.com/articles/01051/evga-powerlink/

rhinoceraptor · on June 7, 2017

Power on the top of the card makes it easier to fit in most cases. When it's on the end of the card, you usually have to connect the power before installing the card. You also get a much nicer cable bend on top since the cables don't have to twist.

jdietrich · on June 7, 2017

Consumer cases tend to be fairly wide, to accommodate 5½" drive bays and tall CPU coolers. Length tends to be the primary constraint on GPU size in these applications.

wmf · on June 7, 2017

I wouldn't be surprised if it's a form of price discrimination to deliberately make consumer GPUs not fit into server cases.

dogma1138 · on June 8, 2017

Server GPUs come with the exact same connector in the same place and often use the exact same PCB if it's a reference design card.

wmf · on June 8, 2017

In all the photos I've seen consumer GPUs have power on top and server GPUs have it on the end.

21 · on June 7, 2017

Most of these boards are made and sold by third parties (MSI, ASUS, Gigabyte, ...).

strictnein · on June 7, 2017

That's not the top of the card, that's actually the side of the card.

astrodust · on June 8, 2017

When it goes into a server, as in the article, it's on the top.

Drdrdrq · on June 7, 2017

Curious: can one train a single NN over multiple GPUs? Or is this useful mainly for parallel training of multiple NNs?

rjeli · on June 7, 2017

Current front page post addresses this: http://tinyclouds.org/residency/

"At Google, one has relatively unbounded access to GPUs and CPUs. So part of this project was figuring out how to scale the training—because even with these restricted datasets training would take weeks on a single GPU.

The most ideal way to distribute training is Asynchronous SGD. In this setup you start N machines each independently training the same model, sharing weights at each step. The weights are hosted on a separate "parameter servers", which are RPC'd at each step to get the latest values and to send gradient updates. Assuming your data pipeline is good enough, you can increase the number of training steps taken per second linearly, by adding workers; since they don't depend on each other. However as you increase the number of workers, the weights that they use become increasingly out-of-date or "stale", due to peer updates. In classification networks, this doesn't seem to be a huge problem; people are able to scale training to dozens of machines. However PixelCNN seems particularly sensitive to stale gradients—more workers with ASGD provided little benefit.

The other method is Synchronous SGD, in which the workers synchronize at each step, and gradients from each are averaged. This is mathematically the same as SGD. More workers increase the batch size. But Sync SGD allows individual workers to use smaller and faster batch sizes, and thus increase the steps/sec. Sync SGD has its own problems. First, it requires many machines to synchronize often, which inevitably leads to increased idle time. Second, beyond having each machine do batch size 1, you can't increase the steps taken per second by adding machines. Ultimately I found the easiest setup was to provision 8 GPUs on one machine and use Sync SGD—but this still took days to train.

The other way you can take advantage of lots of compute is by doing larger hyperparameter searches. Not sure what batch size to use? Try all of them! I tried hundreds of configurations before arriving at the one we published."

randyrand · on June 7, 2017

Most definitely.

Simplified: Training works by taking an input sample (say an image), running it through the network, seeing if your answer is right, then updating the weights.

If you had 4 GPUs, each GPU would process 1/4 of the input images. Then after they are done, they would all pool their updates and update a global view of network. Repeat.

Smerity · on June 7, 2017

In practice, images are not particularly large and a batch of them would easily fit on a single GPU. What's more common is either (a) performing the forward and backward passes on 4 GPUs where each GPU has its own batch, then collecting the gradient from all 4 backward passes or (b) splitting the computation for individual layers across multiple GPUs.

Both (a) and (b) have various trade-offs. Some models perform worse with large batch sizes, so (a) is not preferred, and others are hard or impossible to parallelize at the layer level, ruling out (b). Google NMT did (b), though it required many trade-offs and restrictions (see my blog post[1]), while many image based tasks are happy with large batch sizes so go with (a).

[1]: http://smerity.com/articles/2016/google_nmt_arch.html

llukas · on June 7, 2017

Yes you can.

skizm · on June 8, 2017

Random question (which I am sure has been asked before): is there any way to harness the power of the block-chain to, say, fold proteins? Or something "useful" like that.

I'm not saying that securing the block-chain isn't useful in and of itself, I'm just wondering if we could sort of set up the block-chain to swap in/out problems that are "hard to solve easy to verify and also provide other benefits to humanity". Example: say we swap the current proof of work with a protein folding problem instead, and then when we've "folded all the proteins" (or just decide it isn't a useful problem or whatever) in the future, we just revert it back to the current proof of work. Then maybe we find other similar problems and we could swap them in and out as needed.

I'm guessing the current miners are hyper optimized for whatever the current proof of work is, which would be the main road block (outrage at a "wasted" investment into sha-256 specific machines).

I'm not really up to date on all the tech / politics that would go into a change like that, but curious if it were technically possible.

AlexCoventry · on June 8, 2017

It's not clear how to build a proof of work function which is both flexible enough to be of practical value and rigid enough to be secure against tasks designed with hostile intent. Primecoin is the closest I've seen. Gridcoin and foldcoin aren't serious from a security perspective.

kevinnk · on June 8, 2017

Foldcoin is very similar to what you're describing. The problem is that it's not actually clear we can do what you're describing in a cryptographically secure way - foldcoin for example essentially relies on the security of the Folding@Home network, something I find extremely sketchy. Regardless, it's not actually clear that Folding@Home itself is a net positive for society anyway (see https://www.gwern.net/Charity%20is%20not%20about%20helping)

hackinthebochs · on June 8, 2017

Foldingcoin and gridcoin do just that. They don't get much attention though.

1024core · on June 8, 2017

Slightly OT: why are we still limited to 256GB of memory? Why isn't memory capacity increasing like Moore's Law?

jpalomaki · on June 8, 2017

Quick glance to the article "There are 24 DIMM slots and you can use LRDIMMs". 64GB dimms at least seem to be available, some news from 2015 also mentioned 128GB dimm from Samsung.

fizixer · on June 7, 2017

I seriously doubt you need to spend more than 100% the cost of the 8 GPUs on the rest of the system.

If your 8 GPUs cost ~6k USD, you should be able to build a system for under ~10k USD (even ~8k). Any extra money you spend is more out of desire to "max out" your specs and less of a performance boost.

sabman · on June 7, 2017

Nice write up thanks for sharing! we have been building and selling similarly spec'd boxes in EU - if anyone is interested checkout http://deeplearningbox.com/ They come preconfigured with all the major Deep Learning libs.

Cacti · on June 7, 2017

Thanks for posting this. Put together similar build recently for home and ended up running into many of the same issues. Ended up deciding to go with a regular MB and 3x Ti cards, which is enough for what I'm doing now and avoids many of the problems with bumping out to 4+ cards.

bwasti · on June 7, 2017

Do 1080tis have fp16 support? Seems like a waste if the model can be fp16 trained and you're using full 32bit.

Similarly you should probably try a bunch of other frameworks (caffe2, cntk, mxnet) as they might be better at handling this non standard configuration.

dharma1 · on June 7, 2017

no double speed fp16 on 1080ti

shaklee3 · on June 8, 2017

It does, however, have int8 support.

dharma1 · on June 8, 2017

yep 4x int8 (44 TOPS)on 1080ti. Is the framework support for there for inference at 4x speed int8 on 1080ti? How about training - I thought you need fp16 minimum for training. I've seen some research into lower precision training (XNOR) but unsure how mature it is.

Being able to use 44 TOPS for training on a single 1080ti would be pretty awesome.

dgacmu · on June 8, 2017

Yes - here's a doc about doing quantized inference in TensorFlow, for example: https://www.tensorflow.org/performance/quantization

AFAIK, there's still a bit of a performance gap between just using TF and using the specialized gemmlowp library on Android, but that part's getting cleaned up.

Haven't seen much in generalized results on training using lower precision.

dharma1 · on June 8, 2017

Does that work with Pascal CUDA8 INT8 out of the box?

dgacmu · on June 8, 2017

I'm not sure - I believe it depends on getting cuDNN6 working, and from this bug, I can't quite tell if it works or not (but it's probably not officially supported yet): https://github.com/tensorflow/tensorflow/issues/8828

gwern · on June 8, 2017

Their test case is a GAN. I'm not sure I've ever seen someone train a GAN on int8. It'd probably work...