AMD compute growth isn't in places where people see it, and I think that gives a wrong impression. (Or it means people have missed the big shifts over the last two years.)
It would be interesting to see how much these "supercomputers" are actually used, and what parts of them are used.
I use my university's "supercomputer" every now and then when I need lots of VRAM, and there are rarely many other users. E.g. I've never had to queue for a GPU even though I use only the top model, which probably should be the most utilized.
Also, I'd guess there can be nvidia cards in the grid even if "the computer" is AMD.
Of course it doesn't matter for AMD whether the compute is actually used or not as long as it's bought, but lots of theoretical AMD flops standing somewhere doesn't necessarily mean AMD is used much for compute.
It is a pretty safe bet that if someone builds a supercomputer there is a business case for it. Spending big on compute then leaving it idle is terrible economics. I agree with Certhas in that although this is not a consumer-first strategy it might be working. AMDs management are not incapable, for all that they've been outmanoeuvred convincingly by Nvidia.
That being said, there is a certain irony and schadenfreude in the AMD laptop being bricked from the thread root. The AMD engineers are at least aware that running a compute demo is an uncomfortable experience on their products. The consumer situation is not acceptable even if strategically AMD is doing OK.
I find it a safer bet that there are terrible economics all over. Especially when the buyers are not the users, as is usually the case with supercomputers (just like with all "enterprise" stuff).
In the cluster I'm using there's 36 nodes, of which 13 are currently not idling (doesn't mean they are computing). There are 8 V100 GPUs and 7 A100 GPUs and all are idling. Admittedly it's holiday season and 3AM here, but this it's similar other times too.
This is of course great for me, but I think the safer bet is that the typical load average of a "supercomputer" is under 0.10. And the less useful the hardware, the less will be its load.
It is not a reasonable assumption to compare your local cluster to the largest clusters within DOE or their equivalents in Europe/Japan. These machines regularly run at >90% utilization and you will not be given an allocation if you can’t prove that you’ll actually use the machine.
I do see the phenomenon you describe on smaller university clusters, but these are not power users who know how to leverage HPC to the highest capacity. People in DOE spend their careers working to use as much as these machines as efficiently as possible.
In Europe at least supercomputer are organised in tiers. Tier 0 are the highest grade, tier 3 are small local university clusters like the one you describe. Tier 2 or Tier 1 machines and upward usually require you to apply for time. They are definitely highly utilised. Tier 3 the situation will be very different from one university to the next. But you can be sure that funding bodies will look at utilisation before deciding on upgrades.
Also this amount of GPUs is not sufficient for competitive pure ML research groups from what I have seen. The point of these small decentral underutilized resources is to have slack for experimentation. Want to explore ML application with a master student in your (non-ML) field? Go for it.
Edit: No idea how much of the total hpc market is in the many small instalks, vs the fewer large ones. My instinct is that funders prefer to fund large centralised infrastructure, and getting smaller decentralised stuff done is always a battle. But that's all based on very local experience, and I couldn't guess how well this generalises.
When you ask your funding agency for an HPC upgrade or a new machine, the first thing they will want from you are utilisation numbers of current infrastructure. The second thing they will ask is why you don't just apply for time on a bigger machine.
Despite the clichés, spending taxpayer money is really hard. In fact my impression is always that the fear that resources get misused is a major driver of the inefficient bureaucracies in government. If we were more tolerant of taxpayer money being wasted we could spend it more efficiently. But any individual instance of misuse can be weaponized by those who prefer for power to stay in the hands of the rich...
At least where I'm from, new HPC clusters aren't really asked for by the users, but they are "infrastructure projects" of their own.
With the difficulty of spending taxpayer money, I fully agree. I even think HPC clusters are a bit of a symptom of this. It's often really hard to buy a beefy enough workstation of your own that would fit the bill, or to just buy time from cloud services. Instead you have to faff with a HPC cluster and its bureaucracy, because it doesn't mean extra spending. And especially not doing a tender, which is the epitome of the inefficiency caused by the paranoia of wasted spending.
I've worked for large businesses, and it's a lot easier to spend in those for all sorts of useless stuff, at least when the times are good. When the times get bad, the (pointless) bureaucracy and red tape gets easily worse than in gov organizations.
> At least where I'm from, new HPC clusters aren't really asked for by the users, but they are "infrastructure projects" of their own.
Because the users expect them to be renewed and improved. Otherwise the research can’t be done. None of our users tell us to buy new systems. But they cite us like mad, so we can buy systems every year.
> It would be interesting to see how much these "supercomputers" are actually used, and what parts of them are used.
I’m in that ecosystem. Access is limited, demand is huge. There’s literal queues and breakneck competition to get time slots. Same for CPU and GPU partitions.
They generally run at ~95% utilization. Even our small cluster runs at 98%.
Well then I'm really unsure what's happening. Any serious researcher in either of those fields should be able and trying to expand into all available supercompute.
Super computers are in 95% cases government funded and I recommend that you check in conditions for tenders and how government has check on certain condition in buying. That isn't a normal business partner who only looks at performance, there are many more other criteria in the descision making.
Or let me ask you directly, can you name me one enterprise which would buy a super computer and wait 5+ years for it and fund the development of HW for it which doesn't exist yet? At the same time when the competition can deliver a super computer within the year with an existing product?
No sane CEO would have done Frontier or El Capitan. Such things work only with government funding where the government decides to wait and fund an alternative. But AMD is indeed a bit lucky that it happened or otherwise they wouldn't been forced to push the Instinct line.
In the commercial world, things work differently. There is always a TCO calculation. But one critical aspect since the 90s is SW. No matter how good the HW is, the opportunity costs in SW could force enterprises to use the inferior HW due to SW deployment. If vision computing SW in industry is supporting and optimized for CUDA or even runs only with CUDA then any competition has a very hard time penetrating that market. They first have to invest a lot of money to make their products equally appealing.
AMD makes a huge mistake and is by far not paranoid enough to see it. For 2 decades, AMD and Intel have been in a nice spot with PC and HPC computing requiring x86. It basically to this date has guaranteed a steady demand. But in that timeframe mobile computing has been lost to ARM. ML/AI doesn't require x86 as Nvidia demonstrates by combining their ARM CPUs into the mix but also ARM themselves want more and more of the PC and HPC computing cake. And MS is eager to help with OS for ARM solutions.
What that means is that if some day x86 isn't as dominant anymore and ARM becomes equally good then AMD/Intel will suddenly have more competition in CPUs and might even offer non-x86 solutions as well. Their position will therefore drop into yet another commodity CPU offering.
In the AI accelerator space we will witness something similiar. Nvidia has created a platform and earns tons of money with it by combining and optimizing SW+HW. Big Tech is great at SW but not yeat at HW. So the only logical thing to do is getting better at HW. All large Tech companies are working on their own accelerators and they will build their platform around it to compete with Nvidia and locking in customers all the same way. The primary losers in all of this will be HW only vendors without a platform, hoping that Big Tech will support them on their platforms. Amazon and Google have already shown today that they have no intention to support anything besides their platform and Nvidia (which they only must due to customer demand).
But I see no evidence that the strategy is wrong or failing. AMD is already powering a massive and rapidly growing share of Top 500 HPC:
https://www.top500.org/statistics/treemaps/
AMD compute growth isn't in places where people see it, and I think that gives a wrong impression. (Or it means people have missed the big shifts over the last two years.)