Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia Announces PCIe Tesla V100 (anandtech.com)
52 points by endorphone on June 20, 2017 | hide | past | favorite | 41 comments



Don't get too excited about V100. P100 was announced over 1 year ago and is still today not available from AWS, GCE, or Azure (despite various announcements). Although you can now, finally, order one at retail for the low, low price of $12k+ [1] [2].

The upcoming Volta-based consumer GPUs are going to be your best value for machine learning, not V100.

[1] https://www.amazon.com/NVIDIA-Tesla-P100-computing-processor... [2] http://accessories.us.dell.com/sna/productdetail.aspx?c=us&l...


You can get the Quadro GP100, which is pretty much identical (I've got 2).

http://images.nvidia.com/content/pdf/quadro/data-sheets/3020...

Agreed re consumer GPUs for machine learning though.


Interesting, I hadn't heard of GP100, only released in February. Still, at $6000, as expensive as eight 1080 Tis.


Yeah but the double precision on the 1080 TIs is crippled, so the eight 1080 TIs would (in theory) be 50% as powerful in FP64 performance of a single Quadro GP100. There are good reasons to use the higher end chips, in some cases (maybe not for ML).


Many of the early P100s went to supercomputers. The V100 is interesting because they implemented "tensor cores", and claim that it will offer 12 times the deep learning performance of the already class leading P100.

I'm not sure how consumer GPUs that don't exist can be better, though. That seems highly presumptive, especially given that most machine learning setups are heat, cooling and space limited.


> most machine learning setups are heat, cooling and space limited.

You forgot "cost". V100, like P100, will cost literally more than 10 times what consumer cards do, for a likely speedup of less than 2 times. For the vast majority of people, that will not be worth it.


The 1080Ti you mentioned elsewhere achieves about 1/60th the performance of FP16 deep learning that the P100 does. The V100 would increase that gap by another magnitude. And on the other end, FP64, it has as a similar deficiency.

Of course eventually maybe they'll come out with a cheap but powerful deep learning GPU (though with the V100 having a 800mm2 die, inexpensive is going to be relative), and it's impossible for me to compare with future products, but there's a good reason these cost as much as they do.


Where did you get those numbers? For deep learning P100 is ~ 2x faster than 1080 Ti (optimistically). 60x is just ludicrously wrong.

The gap between V100 and 1180 is yet to be determined and depends mostly on what Nvidia does with the tensor units. We shall see. But I am extremely confident in saying that the performance gap will be nowhere near 60x, let alone higher than that. And despite the early announcement of V100, 1180 is likely to be available to most people long before V100, just like P100 and 1080 before.


60x is just ludicrously wrong.

I said that FP16 performance is at least 60x worse, which it absolutely is. The cores on the GT102 do not natively support FP16, so each FP16 operand has to be converted to FP32 and then processed, with significant overhead. The P100 can use FP16 directly yielding not only double the processing speed (because the FP32 cores are really dual FP16 cores, like registers on an x86), but a major memory savings. The 1080(Ti) is crippled at FP64 as well, again dramatically slower than the P100. It's a consumer GPU and FP16 and FP64 just aren't usually a consumer need.

These aren't just the same thing, with one having a sucker price. They are dramatically different chips, and many applications require the P100.

just like P100 and 1080 before.

Comparing the P100 and 1080Ti as if the latter is the consumer version of the former is not useful. They are profoundly different chips.


> I said that FP16 performance is at least 60x worse, which it absolutely is.

OK, I didn't think that's what you meant because it's a rather silly benchmark to use as it artificially disadvantages the 1080. FP32 deep learning performance on 1080 Ti will usually be ~half of P100's FP16 performance. FP16 is an advantage, but not a 60x advantage, and the 15+ 1080 Tis that you can buy for the price of one P100 are going to be far faster in most deep learning scenarios (admittedly at a higher power/etc cost).

> Comparing the P100 and 1080Ti as if the latter is the consumer version of the former is not useful. They are profoundly different chips.

I couldn't disagree more. The chips are similar enough that most deep learning applications could run on either. Really the only reason to choose P100 for deep learning with the ridiculous prices Nvidia is charging would be memory capacity and bandwidth with the FP16 advantage.


> > Comparing the P100 and 1080Ti as if the latter is the consumer version of the former is not useful. They are profoundly different chips.

> I couldn't disagree more. The chips are similar enough that most deep learning applications could run on either. Really the only reason to choose P100 for deep learning with the ridiculous prices Nvidia is charging would be memory capacity and bandwidth with the FP16 advantage.

Sure, but there are more applications than just deep learning which is where things get fuzzy. The P100 and 1080TI are fundamentally different for anything relating to FP64. I think the point of the GP is, or would hope it is, not to collapse comparisons to the case of deep learning when things like scientific computing make the comparisons necessarily more nuanced.


The "NVIDIA Tesla Family Specification Comparison" table indicates 112 TFLOPS "Tensor Performance (Deep Learning)" for Tesla V100 (PCIe).

Is that double precision?

The Nvidia 1080 Ti has a double precision performance of 332 GFLOPS [1]. If the above number is for double precision computing, the Tesla V100 (PCIe) would be about 337 times as fast (!!)

Does anyone have more insight into these numbers?

[1] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

UPDATE : I should have read the article more carefully, it seems to be a mix of FP16 (half precision) and FP32 (single precision). That would likely mean a factor ~10 in computation performance (specifically for deep learning)


"...matrix operations with FP16 inputs and FP32 accumulation..." http://images.nvidia.com/content/volta-architecture/pdf/Volt...


Thanks, I also read the article better now. I updated my comment.


On this topic, hasn't there been some research that has shown that lesser precision values can be used in neural network algorithms without much loss in performance?


There had been a lot of research on this topic (just Google it), with good results.


I wonder when will such briefs include information about cryptocurrency mining performance on a regular basis.


Probably quite a while, as NVIDIA currently doesn't mention cryptocurrency mining publicly to keep investors from worrying about the sustainability of their recent revenue growth.


I wonder whether cryptomining sells more gpus than deep learning, outside public cloud providers.


I don't know about the deep learning side of things, but on the mining side of things, over the past month, my calculations are that ~50,000 GPUs are added to mine the major GPU-minable coins every day -- from roughly one million GPUs on April 1 to about 3.5 million GPUs now. At an average of $250 revenue per card, that'd work out to ~$13M revenue per day in GPU sales for cryptomining, split between AMD cards and NVidia cards. This, however, is up about ten-fold from March, and in all likelihood will return to those levels as mining returns inevitably decline.

Source of this data is my own calculations using historical data on mining difficulty for the various coins, plus benchmarks for typical/optimal-ish GPUs used to mine each coin.


Are you assuming that all mining is happening with GPUs? Because a large portion of it now is being done with dedicated mining hardware.


Yes, I am assuming that all mining is done with GPUs for ETH/ETC/ZEC/XMR and some others. I'm only including coins which are currently profitable to mine on GPUs, which I take as an indication that those algorithms have seen little/no efficient ASIC implementation yet. I'm thus not including bitcoin, since GPUs are little more than space heaters mining bitcoin; and I'm not including e.g. lbry and litecoin and some others because I don't know enough about the ASICification of those.

It's possible, though, as you suggest, that there are some folks running ASICs or improved hashing algorithms at a small scale, small enough not to overwhelm profitability of GPU mining but large enough to muddy calculations which assume that all mining is being done on GPUs.


GP states they are referring to "... the major GPU-minable coins ... ".


I interpreted that as "possible to mine on GPUs", not "profitable to mine", but I guess I misunderstood that.


Hard to say... most recent quarter had datacenter at $409mm in revenue. Mining falls under gaming, which has grown ~$600mm since 2015 to >$1b per quarter today. Try to buy a gaming GPU today and you won't be able to -- they're all sold out because of cryptomining.


Nvidia 1050 ti is almost untouched for mining purposes and there is 10% price increase in 1060s and 1070s. 1080s & 1080 ti you can get for retail. The GPUs the are in insane demand are AMD RX 580/570/480.


Only mid-tier, where they're most power efficient. High end cards are generally still available, since the efficiency curve drops off.


It is rumored that NVIDIA will announce mining-only cards within weeks: http://wccftech.com/nvidia-pascal-gpu-cryptocurrency-mining-...


Why don't they run them themselves?


You can usually sell mining equipment for more than the net present value of the coins it will mine.


I was under the impression that ASICs were far better for mining. Have GPUs caught up again?


GPUs are definitely dead for bitcoin. You'll get something like 10^9th more efficient SHA256 hashing from an ASIC than a GPU.

But for many other coins, there are two big factors bringing GPUs back into the game for mining:

- Lots of alt-coins competing for popularity; many might have ASIC mining implementations in the future, but there either hasn't been enough time or enough money to be made by doing it yet.

- There's been various research done to reduce the potential gain in efficiency from an ASIC implementation over using commodity hardware. This was done in response to the concentrating effect of ASIC mining operations, which put a greater percentage of global hashpower in a smaller set of hands. The main way ethereum implements this "ASIC resistance" is by exercising memory bandwidth -- an area where GPUs are already quite optimized -- in the hashing algorithm. https://github.com/ethereum/wiki/wiki/Ethash goes into detail.


I remember a time back before the turn of the century me and my friends joking how cool it would be to run doom on a CRAY - who knew that less than 20 years later we could be...


DOOM wouldn't run very well on a CRAY machine for multiple reasons, but one of which is it doesn't have the right video hardware.

A Raspberry Pi has way more compute power than the first generation Cray machine and a 386-vintage machine has better single-core, non-vectorized performance.


With the incredible die size of Volta, and the cost associated with it, I wonder if the rumors are true that AMDs next gen GPU codenamed Navi is using the same strategy as their Zen CPUs (Ryzen, Threadripper, Epyc) - multiple smaller dies connected for better yields and lower cost.

Would something like that be even possible with GPUs? I know the Exascale paper from AMD mentions GPU chiplets which sound like this.


Tensorflow 1.2 has integrated new instructions from Intel, does anyone have any figures on 1.2 showing CPU performance vs GPU's?


What else besides deep learning can the tensor core do well ?


CFD and other high-fidelity multiphysics simulations can be efficient when GPU-bound. It all comes down to being able to write a matrix and stream a bunch of vectors to multiply it by. When considering Krylov subspace methods, this extends to covering a large part of the problem space of linear algebra.

I think the Tesla cards in particular are double precision, which is not necessarily the best for deep learning applications.


High-fidelity multiphysics seems incompatible with FP16, unless you're talking some kind of mixed precision method.


multiplying and adding matrices are fairly fundamental operations.

Here is a starting point: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...


Yes but they are limited in size and in precision to FP16, which is not sufficient for most applications, especially scientific computing... What this person is asking I think is: Can you actually think of another concrete application?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: