I get why the 450% increase on BERT is the headline, but I think it's even more impressive that the smallest increase vs. A100 is still 50% (for ResNet).
You don't get these types of generational improvements in the CPU world these days. In some ways though,it’s better that the gains go to the GPU because, after years of talk, we actually do seem to be in the middle of some really awesome ML developments.
This is pretty big news when you look at the numbers floating around time to train AIs like Stable Diffusion - which is by far the most disruptive change that's happened recently. 150,000 hours just went down to 100,000 hours if we take the conservative estimate.
Not fast enough (not at the price point) but getting there: of course since it's an entirely parallel problem, the real metric is cost-per-model since in reality you'd just buy time from the cloud.
I figure we're going to start seeing some big changes once cost-per-model puts it within reach of the hobbyist. If I today could buy a custom model for say, $1000 - that's accessible to the hobbyist experimentalist.
Obviously perf and efficiency would be terrible, but could you distribute the training of a diffusion model across volunteers' computers? I would think that, for many fangroups, you'd have a lot of folks who would like to customize a diffusion model for their favorite characters and/or art.
Crowd funding models seems especially interesting, since in theory you could fully automate the process. Submit the dataset and code you want to train with, then once it's funded the service automatically runs the training and releases the resulting model to backers.
I think it was donated by AWS or something. This is not a scenario where it makes sense to buy and maintain your own server park of GPUs, when there is a race to the bottom already with cloud-based GPU rentals.
Sure, I think having dedicated acceleration blocks for transformers explains that pretty well, and is definitely exciting for people who can take advantage of it.
I was more contextualizing this, where a potential 75% increase in power (on a smaller node) to deliver 50% more performance is less impressive:
> I think it's even more impressive that the smallest increase vs. A100 is still 50% (for ResNet)
If I were deploying 8-GPU nodes today, I'm not sure whether I would want to pick between 4-GPU nodes or doubling my node TDP. The DGX A100 and H100 both have 8 cards, and 640GB of VRAM, but peak power increased from 6.2kW -> 10.2kW. If you're training giant models and constrained on VRAM, you'll potentially need just as many nodes as before just to fit your models into memory. Your training will be faster but it's hard to avoid the power increase.
The power increase is much more reasonable if you stick with PCIe SKUs (300W -> 350W, literally half the power draw of the SXM SKU), but you pay for that with 20% less compute and 33% less memory bandwidth.
I'm able to trivially sustain 83% (250W / 300W) of the TDP of an Ampere A40 training a model the size of resnet50 with a basic pytorch training loop.
Remember the lower-TDP PCIe H100 has 20% slower compute and 33% slower memory than the SXM model, suggesting the increased power delivery from the PCIe (350W) to SXM (700W) model is a major factor in the performance even for H100 vs H100.
I don't think it's misleading to say that power is extremely likely to be a factor in the demonstrated performance increase from A100 to H100 for non-transformer workloads until proven otherwise. I don't think anyone here has a DGX H100 in hand yet to test this.
Edit: wait, also I looked closer at the numbers - in MLPerf 2.1, NVIDIA only submitted results for 1x H100. There's no 1x A100 result submitted for resnet50, just an 8x A100 number, which NVIDIA seems to have divided by 8 to get a "per accelerator" number to compare to their 1x H100. That doesn't feel like a clean comparison, as you can have performance loss when scaling. I'd rather see 1v1 or 8v8.
Vertical power delivery involves putting the voltage regulators on the opposite side of the circuit board of the chip, so that high-current power supplies travel a minimum distance. This reduces both resistance and inductance of the paths that power takes to the transistors, which means significantly less loss. That, in turn, means less heat related to those losses and less infrastructure to prevent those losses (fewer bypass capacitors, etc), so more of the thermal and area budgets can go to compute.
4.5x faster seems perfectly reasonable to me, I always see the x as meaning a multiplier (or divider) over the original. Saying that this is 450% faster would be blasphemous.
Both things are correct in the grammatical sense, but they have different meaning, and in the Nvidia case, "4.5x as fast" would have been correct, or they should have used the less impressive sounding "3.5x faster".
It might be easier to grasp if you use 1x. E.g. "it's 1x as fast" means it's exactly the same. You're applying the multiplier directly to get the new speed. 0.5x as fast means it's half as fast.
"It's 1x faster" means you add the result of the multiplication to the initial value. So it's twice as fast. 0.5x faster still means it's faster, you add 50% of the initial speed. This way you cannot really express that something got slower, except by using negative values, which might be rather confusing.
I think this might also work in most other languages originating in Europe.
How about "50% faster" and "50% as fast"? If these are different (which to my ear they clearly are), then "200% faster" and "200% as fast" are different too. Naturally it's all quite ambiguous so whoever's writing the press release can pick the more impressive meaning in each case.
To me percentages behave differently from "times". I agree with you on the percentages but disagree on "times faster" and "times as fast", which I think are the same.
The key difference in my reading here is the use of percentages or x, where x I read as "times". I would never expect someone to say "half times faster".
The same issue exists in most European languages, and there are two schools of thought.
Some people want to use language that is logical and consistent. For them, "3.5x faster" means the same as "350% faster" and "4.5x as fast". Others believe in convenience and redundancy. They think that "4.5x faster" means the same "4.5x as fast", because the numbers are the same. Many of them don't like expressions such as "X% faster" for X >= 100, because the numbers tend to be misleading.
I used to be in the former camp when I was young. Today I'm middle-aged and lazy. When I see "3.5x" in the text, I assume that it means "3.5x". I don't want to read the text carefully to determine that you actually meant "4.5x" when you wrote "3.5x". I interpret the "faster" part as redundancy. It tells me that we are talking about speed and that the thing we are talking about is faster than the baseline.
As a favor could you list what you think are the most important ML developments recently? It’s hard to keep up and I have a sense from your comment you might have a condensed summary list that might be helpful.
- Diffusion models for video (see https://video-diffusion.github.io/, this paper is from April but I expect to see a lot more published research in this area soon)
- OpenAI Minecraft w/VPT (first model with non-zero success rate at mining diamonds in <20min)
- AlphaCode (from February, reasonably high success rate on solving competitive programming problems)
> - AlphaCode (from February, reasonably high success rate on solving competitive programming problems)
reasonably high was 50% on 10 attempts, meaning success rate on first attempt can be as low as 5%, out of which who knows how many were leaked to training data.
I'd add GPT3 & Github Copilot, which my team and I use professionally. It's far from perfect, but it's a great GSD tool especially for stuff like Regex, bash scripts, and weird APIs.
Out of interest I've been running a bunch of the huggingface version of StableDiffusion using the M1 accelerated branch on my M1 Max[1]. I'm getting 1.54 it/s compared to 2.0 it/s for a Nvidia T4 Tesla on Google Collab.
T4 Tesla gets 21,691 queries/second for for ResNet, compared to 81,292 q/s for the new H100, 41,893 q/s for the A100 and 6164 q/s for the new Jetson.
So you can expect maybe 15,000 q/s on a M1 Max. But some tests seem to indicate a lot less[2] - not sure what is happening there.
If my understanding is correct, "nm" numbers across fabs don't really compare very well. So 5nm working out well for TSMC doesn't say anything about 5nm for Intel.
Yes and no. Even with new DX versions it seems only a "meh" reentry to the GPU market - with older versions it looks like a catastrophe to me. The M1 was a "wow" entry.
I'm really curious what will happen with the export ban here. Inevitably, the blow back will be alternatives that emerge in markets that are not the United States. I'm curious if there are any non-US GPU contenders on the horizon?
Why do you think it’s bigger than Hopper? It’s most likely going to be slower than H100, with less memory, and no cuda support. Software is likely to be its weak point (e.g. AMD is still struggling in this regard). Though if the price is 1/20 of H100 it might still be successful.
Now, after USA has removed its competitors from the Chinese market, this GPU will be a commercial success, regardless of any technical disadvantages.
So the company that has designed it will survive and it will grow and be able to design a next generation of GPUs, being protected from stronger competitors.
Had China decreed an overtax on imported datacenter GPUs to help its domestic producer, USA would have protested against such a protectionist measure. But now USA has done itself what was needed to protect the Chinese datacenter GPUs.
ASICs/Google TPU like things for AI inference/training will be easy to replicate and will have 95% of the power of existing solutions or maybe even improvements. Because they are simple in design.
As far as dedicated GPUs. There is so much additional work in the software side/driver side of thing unless there was a big concerted effort by a single company to build a solid foundation I see it as unlikely. Look at Intel struggling to get their GPUs out there.
Dedicated accelerated cards just for ML though? I believe some already exist/more will come.
And the biggest push to invest in re-inventing a software stack is a legal ban on access to the existing software.
If previously building a separate CUDA-like framework competed against paying a bit more to Nvidia, now this decision is infinitely simpler, since using Nvidia products is just out of the question.
There is not a lot of high performance digital logic that does not originate in the US.
ARM Ltd is possibly one exception. Though after many many years of attempting high performance designs and coming out with mediocre designs they promptly had their doors absolutely blown off by Apple i.e., the US PA Semi design team, in its first outing (or second, if you count PA6T, which I guess you should, but the PA6T itself was superior in performance to ARM cores when it was released too, they just didn't play quite so obviously in the same markets). Japan does some boutique supercomputer CPUs I suppose although I don't think they're actually competitive so much as a protected and subsidized venture.
I don't know why. China, Russia, Japan, Europe have been attempting this for a long time with comparatively little success, then you get a startup in the US come out with something great.
It's interesting, it almost seems there are dynasties of some secret sauce, the recipe of which was discovered in 1960s and they only pass it down by word of mouth. There are so many of these design teams in startups and established companies you hear about with roots in DEC or Intel or IBM.
Maybe that's changing. Maybe it's less true with GPUs than CPUs. Not sure though, USA is certainly the center of the universe for all that stuff at the moment.
Well, Mali GPU was from Norway. I guess you would say Mali GPU is not high performance, but it is interesting that countries like Norway can come up with commercially competitive GPU design. I am sure China can as well.
It's not that China has not been trying to develop these kinds of HPC technologies for several decades and will begin to now in response to these latest round of embargos. And they have produced things. They just aren't competitive.
According to their failures in markets where they compete with other products.
I don't know about this chip. Possibly this time it'll be different, unlike all the previous times it would be different but wasn't. I don't think that's been established yet though. And certainly it's not for CPU cores or even more general purpose GPUs.
You don't know about Huawei Ascend, which launched in 2019 (see https://www.huawei.com/en/news/2019/8/huawei-ascend-910-most...), and now after 3 years everyone (at least here in East Asia) trying to actually deploy AI inference at least heard about, as one of the best solution in the market, and still are commenting on AI accelerators and "their failures in markets where they compete with other products", okay. Go on being clueless.
It's already here. What the fuck are you talking about, "I don't think that's been established yet". That would have made sense in 2019, not in 2022.
It's not that far from the truth, in that much of high-performance chip design knowledge is acquired through on-the-job training / working with more experienced people. Learning computer architecture in school just teaches the basics about how digital logic and CPUs work (frequently with 1980s designs), not how to actually build a screaming fast 4 GHz chip that you can tape out and get high production yield on. So Intel, Nvidia, etc. train lots of engineers that go on to start other companies (or move between companies).
The pay at SV semiconductor companies is great, but not as high as at the sort of modern web companies. Moving from Intel or Nvidia to Google is a big salary bump for a software person.
Hardware side I think this is less true, but in general hardware folks and electrical engineers in general don’t get the absolutely outrageous SV salaries in SV.
Computer chip design is one of those fields that takes decades of work to build up the expertise and breakthroughs required to deliver cutting edge performance. Just look at how long it took amd to catch up to nvidia and that’s coming from an established company. It’s one of the few fields where you can’t just copy your way to the state of the art. China is of course working on their own home grown solution but it’s nowhere near the level of the US-designed silicon
The ban makes sense given ML compute is dual-use, but I wonder how hard it would be to train say, a missile targeting system, on AWS. I wonder if any nationstate would take the risk of using the public cloud for classified work as a way around sanctions.
I don’t think it’s dual use for ML, I think it’s more about edge compute. Someone correct me if I’m wrong but I think radar resolution is compute bound so an order of magnitude or two could defeat stealth.
I don't get why this is news now. The H100 diagrams have been online for weeks/months. I remember being confused by them when I researched A100 options in July.
If you can turn whitepaper diagrams and throughput claims into accurate performance estimates in your head, I suggest you embark on a lucrative career path as a world-leading computer architect immediately if you haven't already done so
You don't get these types of generational improvements in the CPU world these days. In some ways though,it’s better that the gains go to the GPU because, after years of talk, we actually do seem to be in the middle of some really awesome ML developments.