I've been evaluating this space a fair bit recently. If you want to optimize FLOPS/$, especially for research-workstation sorts of setups, there's unfortunately not a lot of options for getting more than 6 GPUs in a motherboard without going to server grade where you're spending basically $3-4K more for unclear benefit - maybe a factor of two in GPU-to-GPU bandwidth.
The bitcoin miners have figured out one way to handle this, which is by using a variety of PCIe splitting systems. I've seen examples of people putting in 8GPUs in 4 slots with these splitters. The problem is that the majority of these splitters take your x16 connection and turn into 2 to 4 x1 PCIe lanes, which is a lot of wasted bandwidth. This is fine for the miners, since the cards run mostly independently. If I could find compatible PCIe splitters that could split x16 into 2 x8 channels then that would be a really sweet spot in performance/$, but unfortunately I've yet to find them. So right now I'm going to stick to 6 GPUs, which you can get in a $500 consumer motherboard with just a few riser cables.
Not sure if you already knew about them, but there are consumer X99 motherboards that include PLX8747 switches to mux 32 PCIe lanes from the CPU into 7 PCIe x8 slots, at premium of maybe $200 for the motherboard and $300 for a compatible CPU. (ASUS X99-E-10G WS)
The catch being that the 7 slots are next to each other, so you either have to make a custom water loop with single slot GPUs or use simple risers on half of them. But that's probably the best bandwidth you can currently get between >4 GPUs on consumer parts.
That sounds amazing - do you know of any in particular? I can't seem to find any motherboards with 8 PCIe slots. Or do they need some sort of additional splitter?
The ones I've seen so far have either four or five slots. The lane allocation isn't set in stone; I've seen a few that have 3-x16, 1-x1, and 3-x4 (M2) [1] but I've also seen one that has 4-x16, 1-x1, and 2-x4 (M2) [2]. I assume those motherboards are using PCIe switches, but there are plenty of configurations that wouldn't require the switch.
The author wasn't joking about the noise levels. This machine sounds like an F1 race car.
If you don't require a rack mounted server, a cluster of workstations like NVIDIA's DIGITS DevBox is far more cost efficient (and less noisy). I run a compute intensive business (Dreamscopeapp.com) and we opted to build a cluster of desktop-like machines instead of using a rack mounted solution. Another benefit is you don't run into the power issues mentioned in the post.
So - tried your quote form. The options are 4x 1080ti, 4x titan Xp, or 8x P100 --- but no 8x 1080ti? Or is the quote form wrong?
16.5K seems pretty reasonable for 8x 1080ti with a bit of profit for building it, but unreasonable for only 4x 1080ti. My home-built 4x1080ti box (without quite enough PCIe bandwidth, admittedly) is under $6k. I'm assuming/hoping there's an error there. :)
Oh, also - if I want a quote on both the big server and the little workstation I have to enter my contact info twice? Not particularly customer-friendly.
For quad GPU config you should look at dev box type option. It's $8,750 for machine with 4 1080Ti, 64GB of RAM, and 1TB SATA SSD. Quite a steep margin if you ask me, considering a 128GB RAM machine that you build yourself would cost you at most $5,700 (taxes included) if you get everything from Amazon, and probably under $5k if you're willing to shop around a little.
You can save a ton of money by building your own machine.
The server we sell is packaged with software we wrote that makes administering significantly easier. We also provide technical support and even a limited amount of free machine learning consulting. The customers who purchase this server want a headache free solution and aren't as price sensitive as a lone researcher.
notice the custom part box accounts for two more GPUs, I'm not sure why the site doesn't let you add 4 to the GPU section.
This setup ranges from $5250 with 4 GPU to $3240 with 1 GPU. You might want to bump up the PSU for 4 GPU its currently 1500 watts, which may or may not be enough at max load. The article shows a max of ~2800 watts with 8 GPUs
nice - btw, the Rosewill Tokamak 1500 from newegg is a way to save a few bucks on that build, though it's out of stock (they just had it on sale). It's also 80+ titanium.
I purchased a couple of the machines from these guys for my team and had a great experience. Highly recommended if you have better ways to spend your time than building them yourself.
I ask this out of deep curiosity and by no means with any intent to offend: how does (will?) dreamscopeapp.com make any money? (I don't have an Android device or I'd install it – I'm asking because it's entirely unapparent from your web presence.)
Dreamscope doesn't actually make money :) It's a little under break even. It brings in revenue through a $9.99/mo premium subscription, which gives customers higher resolution images.
It seems odd that they insist on putting the power connectors on the top of the card instead of the back which would avoid a lot of space constraint issues.
It also indicates there might be a market for a specialized 90° connector that can squeeze into tight spaces like that.
I'm not sure I agree here; a lot of consumer cases that I've worked with are just barely deep enough to house a card of this size. Putting the power connector on the end of the card would add some extra room to make the case slimmer perhaps, but most consumer cases tend to be generously wide, while not having a lot of extra depth. Wires plugging into the end of the card would compete with the hard drive / CD ROM / Media Card Bay, etc etc.
I agree with nVidia's choice here, but you also raise a valid point; certain cases and configurations would benefit from the added flexibility of that adapter, so there may well be a market.
I had an Antec case (solo or sonata or something like that) where I had to cut into the drive bays with tin snips to fit a 1080ti. Modern cases have more room for GPUs.
Power on the top of the card makes it easier to fit in most cases. When it's on the end of the card, you usually have to connect the power before installing the card. You also get a much nicer cable bend on top since the cables don't have to twist.
Consumer cases tend to be fairly wide, to accommodate 5½" drive bays and tall CPU coolers. Length tends to be the primary constraint on GPU size in these applications.
"At Google, one has relatively unbounded access to GPUs and CPUs. So part of this project was figuring out how to scale the training—because even with these restricted datasets training would take weeks on a single GPU.
The most ideal way to distribute training is Asynchronous SGD. In this setup you start N machines each independently training the same model, sharing weights at each step. The weights are hosted on a separate "parameter servers", which are RPC'd at each step to get the latest values and to send gradient updates. Assuming your data pipeline is good enough, you can increase the number of training steps taken per second linearly, by adding workers; since they don't depend on each other. However as you increase the number of workers, the weights that they use become increasingly out-of-date or "stale", due to peer updates. In classification networks, this doesn't seem to be a huge problem; people are able to scale training to dozens of machines. However PixelCNN seems particularly sensitive to stale gradients—more workers with ASGD provided little benefit.
The other method is Synchronous SGD, in which the workers synchronize at each step, and gradients from each are averaged. This is mathematically the same as SGD. More workers increase the batch size. But Sync SGD allows individual workers to use smaller and faster batch sizes, and thus increase the steps/sec. Sync SGD has its own problems. First, it requires many machines to synchronize often, which inevitably leads to increased idle time. Second, beyond having each machine do batch size 1, you can't increase the steps taken per second by adding machines. Ultimately I found the easiest setup was to provision 8 GPUs on one machine and use Sync SGD—but this still took days to train.
The other way you can take advantage of lots of compute is by doing larger hyperparameter searches. Not sure what batch size to use? Try all of them! I tried hundreds of configurations before arriving at the one we published."
Simplified: Training works by taking an input sample (say an image), running it through the network, seeing if your answer is right, then updating the weights.
If you had 4 GPUs, each GPU would process 1/4 of the input images. Then after they are done, they would all pool their updates and update a global view of network. Repeat.
In practice, images are not particularly large and a batch of them would easily fit on a single GPU. What's more common is either (a) performing the forward and backward passes on 4 GPUs where each GPU has its own batch, then collecting the gradient from all 4 backward passes or (b) splitting the computation for individual layers across multiple GPUs.
Both (a) and (b) have various trade-offs. Some models perform worse with large batch sizes, so (a) is not preferred, and others are hard or impossible to parallelize at the layer level, ruling out (b). Google NMT did (b), though it required many trade-offs and restrictions (see my blog post[1]), while many image based tasks are happy with large batch sizes so go with (a).
Random question (which I am sure has been asked before): is there any way to harness the power of the block-chain to, say, fold proteins? Or something "useful" like that.
I'm not saying that securing the block-chain isn't useful in and of itself, I'm just wondering if we could sort of set up the block-chain to swap in/out problems that are "hard to solve easy to verify and also provide other benefits to humanity". Example: say we swap the current proof of work with a protein folding problem instead, and then when we've "folded all the proteins" (or just decide it isn't a useful problem or whatever) in the future, we just revert it back to the current proof of work. Then maybe we find other similar problems and we could swap them in and out as needed.
I'm guessing the current miners are hyper optimized for whatever the current proof of work is, which would be the main road block (outrage at a "wasted" investment into sha-256 specific machines).
I'm not really up to date on all the tech / politics that would go into a change like that, but curious if it were technically possible.
It's not clear how to build a proof of work function which is both flexible enough to be of practical value and rigid enough to be secure against tasks designed with hostile intent. Primecoin is the closest I've seen. Gridcoin and foldcoin aren't serious from a security perspective.
Foldcoin is very similar to what you're describing. The problem is that it's not actually clear we can do what you're describing in a cryptographically secure way - foldcoin for example essentially relies on the security of the Folding@Home network, something I find extremely sketchy. Regardless, it's not actually clear that Folding@Home itself is a net positive for society anyway (see https://www.gwern.net/Charity%20is%20not%20about%20helping)
Quick glance to the article "There are 24 DIMM slots and you can use LRDIMMs". 64GB dimms at least seem to be available, some news from 2015 also mentioned 128GB dimm from Samsung.
I seriously doubt you need to spend more than 100% the cost of the 8 GPUs on the rest of the system.
If your 8 GPUs cost ~6k USD, you should be able to build a system for under ~10k USD (even ~8k). Any extra money you spend is more out of desire to "max out" your specs and less of a performance boost.
Nice write up thanks for sharing! we have been building and selling similarly spec'd boxes in EU - if anyone is interested checkout http://deeplearningbox.com/ They come preconfigured with all the major Deep Learning libs.
Thanks for posting this. Put together similar build recently for home and ended up running into many of the same issues. Ended up deciding to go with a regular MB and 3x Ti cards, which is enough for what I'm doing now and avoids many of the problems with bumping out to 4+ cards.
Do 1080tis have fp16 support? Seems like a waste if the model can be fp16 trained and you're using full 32bit.
Similarly you should probably try a bunch of other frameworks (caffe2, cntk, mxnet) as they might be better at handling this non standard configuration.
yep 4x int8 (44 TOPS)on 1080ti. Is the framework support for there for inference at 4x speed int8 on 1080ti? How about training - I thought you need fp16 minimum for training. I've seen some research into lower precision training (XNOR) but unsure how mature it is.
Being able to use 44 TOPS for training on a single 1080ti would be pretty awesome.
AFAIK, there's still a bit of a performance gap between just using TF and using the specialized gemmlowp library on Android, but that part's getting cleaned up.
Haven't seen much in generalized results on training using lower precision.
I'm not sure - I believe it depends on getting cuDNN6 working, and from this bug, I can't quite tell if it works or not (but it's probably not officially supported yet): https://github.com/tensorflow/tensorflow/issues/8828
The bitcoin miners have figured out one way to handle this, which is by using a variety of PCIe splitting systems. I've seen examples of people putting in 8GPUs in 4 slots with these splitters. The problem is that the majority of these splitters take your x16 connection and turn into 2 to 4 x1 PCIe lanes, which is a lot of wasted bandwidth. This is fine for the miners, since the cards run mostly independently. If I could find compatible PCIe splitters that could split x16 into 2 x8 channels then that would be a really sweet spot in performance/$, but unfortunately I've yet to find them. So right now I'm going to stick to 6 GPUs, which you can get in a $500 consumer motherboard with just a few riser cables.
See for example: http://amfeltec.com/products/flexible-x4-pci-express-4-way-s...