Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia DGX GH200: 100 Terabyte GPU Memory System (nvidia.com)
542 points by MacsHeadroom on May 31, 2023 | hide | past | favorite | 373 comments



I really have to wonder if anyone can compete with this kind of systems integration capability. A core having 900GBps connectivity to the cluster memory at such relative low power is epic beyond words. 800Gbps ethernet across PCIe is uncompetitive in extreme.

How the rest of the industry can respond is such a mystery. And will it be lone competitors, or will a new PC era be able to start, with an ecosystem of capabilities?


This seems to be competing directly with Google's TPU pods. Looks like TPU v4 has a 300 GB/s interconnect, and 32 GB HBM per chip * 4096 chips = 131 TB (which is all HBM, so higher bandwidth than the LPDDR in Nvidia's system). So yeah, Nvidia's interconnect seems better. However, TPU v4 was deployed in 2020 (!) and Nvidia's thing won't be ready until next year. I've gotta imagine that TPU v5 has already been deployed internally for a while now, but hasn't been disclosed yet. Who knows, TPU v6 might even be deployed before this Nvidia thing.


Just want to flag a potential unit issue: 900gbps vs 300GB/s?

Also worth noting - TPUv4 uses a 6-way 3D torus interconnect vs the 3-way "multi ToR" NVLINK topology; the total bisection bandwidth of the TPUv4 pod is over 1PB/s!

Can't wait to see what TPUv5 looks like. As you say, it's probably already chugging away with v6 on track to tape out in a year.

That said, I think NVidia has nailed bringing the ecosystem along, and I think making the whole setup look more like "one huge GPU" could simplify a lot of ML programming.

I am actually disappointed I haven't seen more of that style in CPU programming. Where's my 20,000 core 100TB RAM VM instance?


>900gbps vs 300GB/s?

The nvidia device uses a fabric with 900 GBps switched fabric between any of the 256 nodes in the system. The TPUv4 3d torus network is basically a ring network of 56 GBps connections creating separate rings. From a raw perspective, the nvidia solution is the overwhelming winner. There is absolutely no contest.


> Where's my 20,000 core 100TB RAM VM instance?

You could simulate this with a bunch of regular machines and a networked hypervisor.

You could do some kind of smart caching so that processes rarely need to wait to access RAM stored on a remote machine.

Combined that with a big lock eliding/speculation scheme (ie. When a process reads memory that might have been written by a remote CPU, you continue as if it hadn't, and if you later find out that data was written then you rollback). These rollbacks 'undo' all work done in however many microseconds it takes for data to travel from one side of the machine cluster to another.

Reads of RAM that aren't cached yet on the local node can also be speculated - you just assume that RAM contained null bytes and continue execution, rolling back and replaying when the actual data arrives.

So if you can make sure that processes are contending for locks and writing conflicting data less often than once per system-roundtrip-latency, then you should get a high performance system.


This is certainly a very interesting thought to entertain and your ideas make sense. One thing that makes things harder on the CPU side in this hypothetical scenario is that CPUs tend to execute much more diverse instructions/computations than GPUs. So all the caching & speculation you mention is probably all the more important.


After writing the comment, I considered writing a little toy example just to try out the idea... It would be neat to see Linux boot with 1000 CPU's...

But upon further thought, a lot of things such a system would need are actually rather inefficient to implement in software (ie. rollbackable RAM), yet quite cheap in hardware (for example rollbackable RAM can be implemented with regular RAM plus either a buffer of 'overwritten data' or a write queue)


A write queue with a dupe-back-end to say a blob on S3 or whatever would be interesting for mirrors of outcomes could be stored.

The biggest issue it seems is bandwidth and humans' patience for a response...


> Where's my 20,000 core 100TB RAM VM instance?

Machines with terabytes of RAM do exist and get used - working well on such setups is a goal of modern JVM GCs for instance - but making a single machine that large which acts like a single machine isn't easy, nor especially desirable. One machine is a unified failure domains outside of mainframe-land, so if you had a 20k core machine with 100TB of RAM you could never reboot it to apply OS updates and it'd die all the time from failed parts.

Even once you get beyond that most software stacks use locking and stop scaling beyond a few hundred cores at best and that's assuming very heavily optimized stacks. AI workloads are easier because they're designed from scratch to be inherently parallel without lots of little locks and custom data structures all over the place like a regular computer has.

Disk storage is one of the places where you can parallelize and scale out relatively easily and you do see datacenter sized disks there.


Nvidia's is 900GB/s, not gbps


You're totally right. My bad. I went back to double check the docs and indeed the chips are 900GB/s total over 18 NVLink connections.

https://www.nvidia.com/en-us/data-center/nvlink/


If I'm reading the docs right (TBH - I'm probably not) it looks like on a z16 you can get 200 cores and 40TB of Memory on a single "VM" (LPAR).

So 1/100th of the CPU and 40% of the RAM. (I suspect the RAM comparison is reasonable - I'm not sure about how to compare the CPU's).


If TPUv4 pods were so powerful, why are most new models trained with NVIDIA cards rather than TPU?


These are just my guesses but:

Software for TPU is still in its early stages. CUDA is well established. You can test on a gaming GPU that you can find (locally!) in many markets. XLA is meant to solve this, but first impressions matter and my first impression was that it has not yet "solved" this issue.

TPU is only available via Google Cloud - as far as I know they don't have NVIDIA's widespread distribution to various HPC/supercomputer systems. This also has implications on scaling up more than a few pods, as they will need to be colocated with speedy interconnect (which is provided by the various existing HPC systems that use NVIDIA's chips).

Finally, I think many people are discovering that the supposed benefits of TPU are marginal at best in the face of the types of natural scaling issues that both GPU's and TPU's suffer from when scaling out to e.g. hundreds of pods.

I'm certain that someone with more experience than I could give a better answer though - and again, all speculation. I refuse to use TPU because Google Cloud's system for getting access to said TPU's was horrible for me when I tried it. I believe John Carmack has a nice tweet thread specifying the same issues I ran into.

In general, Google has a habit of developing tech for other Googlers first, and as such winds up ignoring a lot of real-world scenarios faced by researchers/practitioners. NVIDIA on the other hand has been working directly with a ton of institutions and businesses ever since the inception of CUDA.

That their TPU's have seen any adoption at all is mostly due to their research program which granted very cheap access to TPU's to tons of people.


TPUs are mostly hoarded by Google Research (including Deepmind) and Ads. Very few are being used by external people.


The greatest artificial minds of our generation are thinking about how to make us click on ads?


It was ads that made the money to develop the artificial minds in the first place.


This is technically correct to the extent of paperclip maximization and I don't like it.


This is starting to sound very paperclippy. Ads fund the AIs to make us click on ads to fund AIs that are even better at getting us to click on even more ads.


It's ok, as soon as the AI figures out a better way to gather resources, it'll pivot.

(this is not meant to be reassuring).


That is only true for Google. If anything bootstrapped AI, it was gaming.


what's the gaming story? most of the ai we know today builds on academic work going back to the 90s


Probably refers to the development of and increase in computing power of gpus, I guess.


It’s matrix multiplication all the way down.


Yeah, and look at how some very simple clustering ML/recommender systems impact social/political dynamics all to keep people engaged on the site and maximize chances to click ads ( see youtube/facebook, etc. ).


Last time I checked, OpenAI wasn't earning money from ads.


Last time I checked, OpenAI didn't develop transformers.


Ex Machina vibes


Damn right but I don't understand why. That is, why is ads business generating so much profits that it allows to build such ridiculously powerfull devices ? Is it because it's genuinely full of money or is it because Google is so central that it makes tons of money out of lots an dlots and lots of small adverts ?


It's a monopoly on eye balls. People don't casually walk in front of domain names, they must find them on Google

As a result, spending ad money on Google is ridiculously expensive, but companies accept this because there is no alternative hoping to "build long lasting relations" with the people who make them pay upwards of 1 dollar per click


At the same time it is also a huge bubble, that Google is just hoping will never burst. People and businesses way overestimate the impact their ads are having and way underestimate the impact, that treating customers well can have.


I definitely think this is the strategy of google leaders, they've heard to much of "how do you monetize your products?" from investors and now they are maximizing profits for that current software generation. I wonder though if that bulk of money will be that much of an advantage when the tides turn. It could attract the wrong kind of leadership amongst other things like customer distrusts and turn the company into an IBM of some sort. Namely, I would rather maximize youtube premium memberships (which is at "only" 50 millions) over ads (surely they've local-maximized the balance between the two as it is) - but its easier said than done.


I think both are important. Word of mouth is useful and important but no one would use google to search to buy stuff if that was the only way to reach customers.

Also, if your established it probably a good idea not to let new competitors get a foot hold in the market with an easy google win.

It's also pretty effective for local businesses because not a lot of local businesses are tech savvy enough use it effectively.


> people who make them pay upwards of 1 dollar per click

FWIW, the cheapest (quality) clicks I've seen, at least in the B2B space, is closer to $3/click, and it can quickly balloon to upwards of $10/click especially on company brand names where competitors are bidding on another company's brand name.

Knowing this, I cringe every time I'm screensharing with someone and they search "[B2B Company] login" to login to a tool they use every day. Each login = $2-$10

It's not uncommon for companies to spend $100k+/year JUST bidding on their own company name.


It honestly escapes me how these companies can be sustainable. The whole market is sooo inefficient. Companies also pay crazy money to appear in privileged positions in supermarkets shelves, and they will often pay crazy money for simply being in the supermarket at all

I just don't get where all the marketing money is coming from. Bootstrapping is clearly not an option these days


Computers DOUBLED the productivity of the USA since the second world war. All that money went to a few people and groups, and none of it went to average people. For decades, companies have just been sloshing the same giant pile of cash around and around the Ads ecosystem.

That bag of chips did not cost $4 to make, not even a little close.


Because there is no incentive for customers to tell businesses what they want, businesses tell their customers what they should want.


My working theory is, that advertising is the overhead cost of doing capitalism. There is a certain percentage of resources which have to be spent on advertising to keep the system functioning. Google is good at grabbing a large portion of a huge pile of money.


Not really. It's sufficient to show cool products in "TV" shows (robotic vacuum cleaner in a procedural crime drama might even be a plot device, absorbing murderer's hair to be found by detectives, gasp!).

Coupled with a magazine or a show presenting new product categories for those interested, customers will eventually visit a physical or online shop and check out the goods. And then word of mouth will do the rest.

Aggressive advertising will mostly just help you get ahead of your competitors and perhaps speed up the adoption rate at the cost of increased volatility of the market and to the detriment of people's mental health.

We would be better off regulating aggressive ads away.


> Aggressive advertising will mostly just help you get ahead of your competitors

That's a hell of a load-bearing "just" you managed to insert there. Getting ahead of your competitors in market share can be the difference between having a company succeed or fail.


So if nobody is "getting ahead of competitors", does it mean that "capitalism is not functioning"? (which was the point of the comment to which the reply was)


Product placement is still advertising, likewise advertising plays a role in getting people to go to that online or brick and mortar shop instead of some other one.


I propose a law: nobody can advertise a product without mentioning all the brands which offer same or similar product on the market (and the mention must be neutral or positive).

Or: all advertisers of all brands with a same or similar product must collaborate. Only voluntary input counts as collaboration; if a brand simply doesn't care about presentation of itself in the advertisement, they have trivially collaborated. Easiest way to implement this is giving every owner of all relevant brands a right to veto every entire final advertisement product (this right could also be surrendered, for all or some possible vetoed advertisements, in exchange for something in a contract).

Ignoring flaws of this proposition itself, what could be society's reasons for rejecting it? Does society perhaps want havers of more money to gain further advantage over havers of less money?


>nobody can advertise a product without mentioning all the brands which offer same or similar product on the marke

Maybe 50 years ago that would have worked. Today, not so much. Go to Amazon and look, well, just about anything. What is BEHENO, what is DINGEE, what is Etoolia, what is Romedia, what are the over 300 different 6/7 letter companies that show up when I search up some random product.

Unfortunately your consideration causes its own parasite effect of countless companies forming up to feed of the big advertisers budget.


Since the product is standard, why is it actually bad? If there are too many brands to be included in a single advert, just choose randomly (the lower the price, the higher the probability for a single brand; I don't know the function).


Because, in the US, this will quickly fall foul of free speech laws. Over 'public' airwaves maybe you could go some distance with this, but advertising on private property, as long as it is not fraudulent will present a constitutional challenge to what your saying.

And, you're also crating a regulatory nightmare. Say I put up an add for XXYZXX company, and it includes ZZXYZZ and YYXZYY information (I mean totally random picks), and I just happen to have a stake in those companies too. Now you're going to have to track hundreds of thousands of these entities to ensure no fraud is occurring, and in most cases the fines for this kind of behavior are well under the cost of doing business.

Everything you've said so far just creates bigger messes and solves nothing.


It solves a hypothetical skew towards brands offered by already richer businesses.

About regulation, how hard is it to just audit the random picking procedure?

I now understand that my second variant, with vetoing of final advertisement, is very flawed (one can cheaply obstruct anyone's advertisement by making a company that vetoes any version of it). How about dividing an advertisement into pieces of information solely about each distinct brand, and let every brand owner compose the piece for its brand? Then all pieces are added into final concrete form in a collaboration - I think it would succeed in most cases, and if brand owners can't collaborate, then an independent company will work on it.

Then we need to look how exactly freedom of speech is defined. If it means ability to express views without attaching any additional information, then such freedom is incompatible with my proposal. But if freedom of speech allows attaching additional information as long as base message is preserved, I see no problems. Note that the proposal essentially just forces you to advertise other brands as they wish, along with any advertisement that you do, which (brands) it doesn't mention.


@h4kor one of my crazy ideas is to cap money companies are allowed to spend on Marketing once they reach a certain size. It would encourage a better form of decentralized capitalism and prevent monopolies


This could easily turn out to be counterproductive. It would provide an additional incentive to hide marketing in all kinds of other business activities rather than openly advertise what's on offer.

Marketing is already difficult to tell apart from other company communications, product documentation, etc. What about a company blog showing how to use their products? Is that marketing or product documentation?


The point is that "openly" advertising would be capped. That would reduce the price of doing so, making it more affordable to smaller players and removing the insane profits ad monopolists enjoy today. Plus, "openly" advertising is one of the most effective ways of advertising. Lastly, by diverting marketing budgets to non-traditional routes (charity donations, etc), the economy would benefit as money would be spread more evenly across


I wonder if there’s some sort of automatic stabilizer that could be applied instead.

Tax ad companies, and spend that money on education. The better ad companies are doing, the more we spend on education, the fewer gullible marks we produce, the worse ad companies will do.


sure, but what do you consider to be an ad company? is a newspaper that places sponsored articles an ad company? accounting for "marketing" expenses might be easier to track and at the end of the day, companies use accountants that are liable and so need to report accurately


Exactly, it’s the mechanism for exchanging information in a capitalist economy.

Conversely, in Communist systems they could never get this right. Factories were just told to produce 5 or 10% more than last year, didn’t matter if the product quality was worse or if people didn’t want it.


There was some competition amongst consumer goods producers and TV and other ads in the UUSR. High scarcity of good quality stuff meant they didn’t need to advertise but there was also an oversupply of junk nobody needed. Those companies has to move their inventories somehow since it was much harder for them to go bankrupt.


unfortunately pure capitalism has no mechanisms for externalities and information hiding.


A little freaky when you think about what that really means. Some of the most advanced AI systems in the world are solely focused on being good at manipulating human behavior. Cool... cool cool cool............


Tangentially, I think this explains the conspiracy theory that ad companies are spying on everyone's phones and serving ads based on what we talk about in real life.

Think about all the stuff ChatGPT and GPT-4 can do with even minimal prompting. Even when they hallucinate, the text is still ostensibly coherent and natural sounding. Now imagine a similarly powerful model, but its input is a ton of metadata about your behavior and its output is ads.

Now consider that adtech has had substantially more funding for substantially longer than research into LLMs, so ad serving models are probably way more powerful and optimized than even GPT-4.

It's freaky to think about indeed.


Another thing is: people's individual behavior is not as unique as we'd like to think. As a whole everyone is unique, but in single surprisingly complex aspects of our life we are hardly ever alone.


so Hari Seldon was right in his psychohistoric theory ?


It's been that way for over a decade now. Welcome


It’s not very good at it if it is.


No some are into the space industry.

So we can have internet anywhere. To click on ads.


It's ads that makes the market efficient. Potential customers should know the corresponding producers so that the information assumption of a ideal market stands.


Ads can have both persuasive (propaganda) and informative functions.

Informative ads make the market more efficient. Persuasive ads actively make the market less efficient.

Most ads in the US in 2023 seem to be persuasive.

Perhaps the ad industry would become more useful (and smaller) if we managed to effectively regulate it to significantly reduce the persuasive bits.

I think that most people would support this if you explained it right - from the free-market perspective, this would give you a better market.


How else would I know that “Elon Musk created a TeslaX platform that allows everyone to get rich”? Or was it Pavel Durov… Seriously, I can’t even report these on YouTube.


There is a mythology to Google's TPU that is not validated by real world numbers. Where we can actually test (I mean -- TPUv4 pods are available right now on their cloud) it is very good, but remains uncompetitive with the h100. I mean, Google disclaims that you shouldn't compare it, doing the classic "the h100 is on a better process node so it's unfair". People will always point at a mythical next generation that is surely way better, despite the fact that Google is currently building big supercomputers with their TPUv4. And in Google's shootout, again comparing with the last generation of nvidia hardware (the A100), Google's biggest advantage was in the connection fabric, which with this DGX GH200 nvidia not only overcame, but bested by a magnitude.

More competition would be fantastic. Better pricing at scale would be fantastic. But there is absolutely no doubt that nvidia is far ahead of Google right now. Tesla made some believably pushing claims about their own efforts with their own hardware, so who knows maybe they're the real challenger.


To add to the other answers, TPUv4 was not released to cloud customers until last year. And I bet availability is not as good as GPUs, even in Google Cloud (obviously TPUs are not available at all in other clouds).


Availability only on GCP and in particular cost.


Google has advertised that they have better perf/$ than GPUs, is this wrong or do you just mean absolute cost (so not available in small enough slices)?

edit: actually now i can't find the claim, maybe i misremember what the papers said.


Perf/$ where $ is what it cost _them_ , not $ they're ready to sell to others as a product. Cloud margins in the high two-digit percents are typical, and I'd imagine even higher for very specialized products in high-demand from deep-pocketed customers.


https://arxiv.org/abs/2304.01433 does claim "1.2x-1.7x faster and uses 1.3x-1.9x less power than the NVIDIA A100".


From a personal use case, the number of instructions available in TPUs are still limited and some workaround is needed when designing custom layers. Even if it's available in platforms like Colab or Kaggle, people still lean to GPUs as it is more versatile.


The network topology of TPUv4 is far far inferior though. It's a torus. No switches.


What evidence is there that Google would be able to out compete nvidia on AI hardware?


None. Heck, I can’t even search my gmail effectively any more, so if they can’t maintain a core product, I doubt they can build a new one of any quality. alphabet are now just a big, bloated catch-up corporation running on inertia and past glory.

I don’t think they will exist in 10 years.


>None. Heck, I can’t even search my gmail effectively any more

Their search products have actually gotten worse with AI. Google Images running just off basic image recognition (as in is this the same image) and the context of where they found it was far superior at identifying what an image is than ML Google Image.

The OG version could identify a frame from a movie and provide higher res versions. The ML version goes “errr looks like a woman on a street, here are random photos of unrelated women on unrelated streets with maybe a similar color scheme” close to useless why would anyone want that. Yandex Image search blows it out of the water simply by being Google Image Search from a decade ago


This is the kind of stuff that I see as being the crux of their downfall. Snippets have also gone to pot over the last year or so.

The overall theme is that product is no longer the focus, but rather navel-gazing - that’s to say, their internal world no longer aligns with the external world, and that is a fundamentally dangerous place for a business.


I thought I imagined it being worse but yeah...


Someone who believes Google won’t exist in 10 years is delusional beyond words.


Yes yes, and the East India Company will reign supreme for all time, Refco is too important to fail, Blockbuster will dominate home entertainment forever, and it’s simply inconceivable that a single trader could bring down Baring Brothers, they’ve been going for centuries!

Businesses fail. google will likely still exist, but alphabet, I don’t see a future for - just a gradual withering followed by a collapse and disintegration into myriad properties in a fire sale. They are brittle, overburdened by unity of disparity, culturally adrift, and they aren’t taking risks any more. Inertia will keep it all going for a while, but not forever.

Sure, I may be wrong, but I do put my money where my mouth is, and I am right more often than not.


Your reply is interesting because you strongly believe alphabet will fail but only supported that by arguing that over the very long term so companies fail.

I see a lot of hate for alphabet on HN. It seems very emotional. I think people feel personally betrayed by thier bad behaviours because they were 'supposed to be better'.

The thing is, there are a lot of companies you can hate. Exon, mcdonalds, blackrock, even Microsoft, there are people who are very mad at these companies.

That's not an argument that the company is doomed. If you are really putting your money where your mouth is (what shorting google?) Then I hope you have a better reasoning as to why they will fail not just eventually but this year.


I don’t hate alphabet - neither do I love them. I look at them through the lens of history. You on the other hand seem to be emotionally wounded by my assessment of them.

None of the companies you list are likely to collapse soon, as they remain focussed on their various missions, and have a unity of purpose. Out of all of them, I think Microsoft is the most likely to fail, as they are likely to be blindsided when the user-focussed desktop OS era ends. Their diversification efforts have been a mixed bag, and without windows, they are far, far less significant.

What I do look at is sentiment analysis - what other people feel and think about businesses, as that drives the market.

No, I don’t short, as just buying equities which are beginning significant growth is just as effective and doesn’t drive demise - I held goog for nearly 20 years, and sold off late ‘21, as I think they’ve peaked, and anything from here on is speculative froth.

You’ll note I keep saying “I think”, rather than making statements of fact - because this is purely what I think - I am not a Sybil.

You seem to have missed this:

>> They are brittle, overburdened by unity of disparity, culturally adrift, and they aren’t taking risks any more.


> Out of all of them, I think Microsoft is the most likely to fail, as they are likely to be blindsided when the user-focussed desktop OS

That might have been a reasonable assessment back in Balmer’s era. But what you saying has already happened years ago…

They have mostly reinvented themselves since then. Enterprise/office isn’t going anywhere. Xbox if fine too. And there is a lot of growth in their cloud/etc. business.

IMHO out of Google, Amazon & Facebook, Microsoft seems to be the least dysfunctional and and general best positioned one to be successful in the future.


Xbox doesn't seem fine. I think it's propped up by game pass having cross platform title access with windows but it's still under the Xbox balance sheet, but growth and number of exclusives doesn't paint a healthy picture.


Yeah by Xbox I mean the console + game pass + PC/Xbox gaming division. The console itself at this point is not much more than a cheap(ish) locked down gaming PC.


For context, I have never worked for or with google, and don't use their products much other than search. So I don't have much emotional connection to the company. My comments were more motivated by a kind of concern.

My perception is that Google split into a number of focussed business units when they became Alphabet, with the Google component being execution focussed and the more speculative stuff spun out into other group companies like deepmind, waymo, etc. That's why the Google unit stopped doing nice incubator projects that we were all excited about.

From what I've seen, this cash cow execution business unit has been fairly effective - in particular they've done a good job of entering the cloud market space producing a differentiated product that is penetrating their target customers. They have not been able to compete with Microsofts excelent and deeply embedded IT sales capability, so they've done well to go after people with big problems that other vendors more civillian offerings are not so great for. They are currently the first choice platform for AI training for instance.

I'd contrast this to Facebook who seem to be trying to become a deep tech VR hardware vendor in the same business unit as their cash cow entertainemet and advertising business which has confused investors and probably distracted their focus.

We can see that Google has innovated. For instance, a lot of Tela's stock price is based on the idea that they are going to run autonomous taxis, and instead of owning cars we will just hail a Tesla when we need one. Telsa does not run autonomous taxis, but you can ride a Google Waymo taxi today in Pheonix, and they are running autonmous trucks which is a big industry Tesla aren't even attempting yet. They are doing a lot in medicine and medical devices. This seems a lot more diversified and innovative than other companies - it's just not as visible to the HN community as an RSS reader or some other internet thing we care about.

We can also say that... on the AI thing, I think it's very early days. Microsoft have a shakey looking deal with the first mover, but Alphabet and Facebook have the advantage of actually using AI extensively in their real buisnesses and may be able to deliver product market fit better. Time will tell.

On the stocks front, I agree with your overall thesis - I think it's harder for these conglomerates to grow than a new company just because they are already giants in their niche and even adding a new niche generates less growth in percentage terms than for a smaller company starting from a lower number. I just wouldn't actually bet against google as much as I would some of the others.


you originally said 10 years. so hopelessly delusional lol. since you're so confident let's bet $10,000. By your claim let's bet by 2034 (I'll give you some extra time). Alphabet Corporation and all subsidiaries will no longer exist. If they do I get your $10K. If it does not you get my $10K.

We can both give the money to a mutually trusted third party now.


but IBM still exists, and Microsoft after missing the mobile market. Even Nokia exists.


If you believe this it implies you have gone full malthusian, because the innovation engine that made Google possible could also disappear Google, but without that engine we are all screwed.


> code product

gmail is a freebie! the core product is how they index your messages to create an anonymous profile that they will then offer on reverse bid to advertisers when you do a search or visits an AdWords site.


Gmail dev here (but not search), I don't think anything has changed with search. Operators still work too. What's actually wrong?

Do you just have more email now?


If you don’t match the terms in your email exactly right with your query, gmail starts returning email that matches one of the terms which is rarely what you want.


What would you expect it to do? How is it supposed to know what you want if you don't provide it with the exact search terms? Shouldn't it do partial matches if there are no full matches ?


It should look for things that match the meaning of the words searched.

E.g. what’s possible with embedding where the query terms are matched with similar messages that mean the same thing.


Is HBM mostly Samsung?


Market share for last year were 50% SK, 40% Samsung and 10% Micron, but as there is currently a huge demand things may change depending on capacity.


I thought SK Hynix was the big producer of HBM? But that could be out of date.


Some of the specs seem inaccurate here, HBM has been present in NVIDIA datacenter GPUs for awhile now. LPDDR is for their gaming hardware.


The GH200 uses a combination of HBM3 for the GPU and LPDDR5 for the CPU but it's a unified memory system so the GPU can access all the RAM. Gaming GPUs use GDDR which is a third flavor.


As always in economics it is about volumes and margins.

If the competitors (mainly AMD, Intel and to some extent ARM) will keep seeing growing volumes and insane margins they will be attracted to bring to invest and take part of that market.

Till now gaming GPU market did not bring to AMD the necessary margins to really push them to bring a better competition to Nvidia. Even 10/15 years ago when ATI was way ahead of Nvidia technologically for 2/3 years (the HD 4000 and HD 5000 generations vs the Nvidia flops of the 9000, 200 and 400 series) Nvidia was posting billions of profits and ATI posted a whole...19 millions of profits across 3 years.

But today's GPU market thanks to it's non-gaming sales is much bigger to ignore (which is why Intel entered it as well) and those players will likely react.

You don't need to have the best premier product, you need to have your products good and priced well enough that they will be chosen over the competitor's.


Cerebras supposedly can: https://www.servethehome.com/cerebras-wafer-scale-engine-2-w...

In hindsight the 40GB of SRAM feels kind of quaint, but nevertheless their very fat nodes let them get away with more than Nvidia could with A100s, as you can see in the slides.

CS2 is a little old now. I bet an update is just around the corner.


I have to wonder why these engineers are not paid millions of dollars per year. As a lowly backend dev this seems so much more impressive than my new API that retrieves something from a database...


Engineers are really poor negotiators, probably because they neglect "people skills".


Because their scales of what are important things are so often entirely unintelligible to the rest of the planet.


Hardware Engineers are really poor negotiators.


No need for the quotes, it's exactly the issue


Because 10 engineers paid hundreds of thousands a year do a better job than 1 engineer paid millions.

And that's because it is very high skill work, otherwise, it would have been 100 engineers paid tens of thousands.


perhaps they are, even if in stock share price appreciation?


Isn't this a lot of things which AMD has already sold as the ORNL-Frontier 2 years ago? The main difference seems to be that external bandwidth here indeed is crazy via NVLink (though it is only 450GB/s per way so the same as 64 PCIe-Gen5...) and they have two networks for communication (although I suppose the HPE Slingshot is as good as the Infiniband in here)...


Sounds like you havent seen Wafer-scale integration computing, Tesla has one and comercial companies like cerebras will sell you a cabnet without the miles of fiber networking.

https://www.cerebras.net/andromeda/


FWIW, I believe it’s 900GB/s full duplex, so 450GB/s each way. Still a ton!


I'm curious how you can keep that fed with data fast enough. What kind of interfaces to your network do you need to keep it busy and not just waiting on data.


Ultimately how fast their transistors can switch and at what power is determined by TSMC, which everyone else can use too. Same for density of interconnect.


For compare a lane of PCIe 6.0 is 7.56GBps. If all 128 lanes of a AMD cpu were able to be interconnect, that would be 967GBps.


Is there really a market for a response though? Now, I'll be honest that I know very little about this market. What I do know from doing a decade of presales before covid hit, is that people who buy GPUs go for aggregate max on a big node farm. Now, most of my clients who bought GPU-heavy scale-out nodes were in the financial industry, so maybe deep learning stuff is different. Their workloads were massively parallel, and could scale out instead of needing something singularly fast.

So I guess my question is - what use case is there for a huge truck that goes 200mph and take 4 trips, when you could just buy 16 regular trucks, and move your apartment in the same amount of time at half the cost.


The exciting thing about CXL is we can start to find out if peripheral or hopefully close-networked computing fabrics can be useful & interesting, beyond the small circumstances Nvidia will offer. Having an ecosystem that everyone can participate in will let us explore. Money can't buy that. Talent can't buy that. You need to socialize to really find out the possible values.

'The street finds it's own uses for things' is the well known Gibson adage, I and typically it's a comment aimed low. But our entire era of amazing computing began with the Gang of Nine enabling lowness in a degree such that it quickly became the highest tech, the best. Sure you can still buy a mainframe & they have amazing feats but it's not where the value is, but and the value is where it is because possibility was unchained, I unleashed from corporate dominion, and spread wide. I think we can find amazing new futures with CXL & mad bandwidth connectivity.


The reason that analogy falls short is because it's easier to drive the huge truck at 200mph than it is to find 16 truck drivers. It's really neat when you figure out how to map/reduce your algorithm so you can parallelize it, but it would be even easier if you didn't even have to in the first place. And that's assuming that it is even parallelizable in the first place. Not all algorithms can be optimized like that and needs a bigger system to run on.


There are workloads that are data parallel, and scale like the GPU-heavy scale-out nodes that you describe.

The other approach, which you do when models themselves are massive, is model parallelism. You split it into multiple parts that run on different nodes.

In both cases, you need to distribute weight updates through the network although the traffic patterns can be different.

To maximize the performance in both scenarios, systems designers optimize for all-reduce and bisection bandwidth.

There are also other tricks, for example the TPUv4 ICI network is optically switched, and it is configured when a workload starts to maximize bandwidth for the requested topology ("twisting the torus" in the published paper).


Using something like Stable diffusion and generating all the frames at once (for video) as a single image. For that kind of usage one needs to have ram for the whole image. This setup could generate videos like that in the same time as I generate an image on my home computer.


There were rumors floating around that GPT-4 was going to be a 100 trillion parameter model. Those rumors seemed ridiculous in hindsight, but this announcement makes me rethink how ridiculous it really was. 100 Terabytes of GPU memory is exactly what you need to train that class of model.

However, I’m not even sure enough text data exists in the world to saturate 100T parameters. Maybe if you generated massive quantities of text with GPT-4 and used that dataset as your pre-training data. Training on the entirety of the internet then becomes just another fine tuning step. The bulk of the training could be on some 400TB dataset of generated text.


Rule of thumb is that you need ~20 tokens per parameter. The average token size is ~4 characters, probably more for larger models where you want larger dictionary, but for simplicity I'll say it's 5 bytes to make numbers round. So you need 100 bytes of text data per parameter, or 10 PB for 100T model. Now, recent research says that you can reuse the same data like 4 times before it becomes hindering performance but it doesn't help much in our case.

But in this case what is really ridiculous is the compute requirement. The required compute for optimal model growth roughly quadratically (both your model and your data grow linearly). So for 100T model you need 1e30 FLOPs. This machine gives you 1e18 FLOPs per second. It will take 30k years to train this model on one of these (or 30k of these to train it in a year, but then utilization will start kicking in).


"The best time to start training a 100T param model was 30k years ago, the second best time is now."


Probably the best time to start train 100T model is never


What if you could train an AI with a desired outcome to their answers?

I.E. ; "answer this question where the outcome is the most beneficial to quality of life"


I'll take your question further: what if we have unlimited data (say some crazy rich RL environment or way to produce high quality and diverse synthetic data)? You still have to get these 1e30 FLOPs. Lets say you can connect 100 of these bad boys together with 40% utilization, with total 4e19 FLOPs/s. Assume also Moore's law keeps working indefinitely. When should we start training 100T model on it to get is as early as possible? We wait x years and the start training on machine with 4e19*2^(x/2) FLOPs/s. Turns out the answers is ~16 years, after which we'll have 1e22 FLOPs/s and 1e30 FLOPs will take another 3 years.


> life

A strange game. The only winning move is not to play. unplugs self


A properly designed AI agent would do exactly that.


That's obviously false under the assumption computing power will increase as it has in the past.


For the uninitiated, it's a tree planting quote.


I am not going to be embarrassed for the following Q ;

Please ELI5 where I can have a glossary of AI/ML terms - where do I get fluency in speaking about Tokens, Models, Training, Parameters, etc...

Please dont be Snarky - This is info that everyone younger than I am needs as well.

Is there a Canon? Where is it?


At the risk of sounding snarky, https://chat.openai.com would be a good introduction, followed by books, which GPT could recommend.


I have no idea tbh. I learned these a while ago (~7 years ago), and the materials I used then are heavily outdated and also I won't be able to remember what they were. I guess any intro course to deep learning should talk about these. Stanford ones used to be good. Maybe someone else can be more useful about it.


I'd recommend starting here:

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

It's pretty lengthy but doesn't require a PhD to understand. If you can get to the end of it you'll have a much better understanding of what's going on.


To be honest, I'd start with some introduction to Transformer YouTube videos. They'll cover a lot of these terms and you'll then have a better understanding to find additional resources.


> Rule of thumb is that you need ~20 tokens per parameter.

That rule of thumb is wrong. The chinchilla paper has it anywhere between 1 and 100 tokens per parameter.


> Those rumors seemed ridiculous in hindsight

No, those rumors seemed ridiculous even then. Many AI influencers were posting some of the most absurd material, often makes basic mistakes (like confusing training tokens with parameters), but anyone in the field could have easily told you that 100T parameters sounded ridiculous.

On that note, "100 Terabytes of GPU memory is exactly what you need to train that class of model." is also likely false. That's how much you'd need to fit such a model into memory at 1 byte per param. Not train it.


https://huggingface.co/docs/transformers/perf_train_gpu_one#...

You can't train a 100T model with "only" 100TB of VRAM, you need for each parameters 4 bytes + 4 bytes (gradient) + 8 bytes (AdamW optimizer) + forward activations that depends on the batch size, sequence length etc, maybe more if you use mixed precision and also you need to distribute the weights.


The general rule of thumb that I'm familiar with is that you need about 80 bytes of VRAM per parameter when you are doing training. Inference is different and a lot more efficient, and LoRA is also different and more efficient, but training a base model requires a LOT of memory.

A machine like this would top out below 2 trillion parameters using the training algorithms that I'm familiar with.


I suppose it would be 12 bytes? 4 bytes for base model, 4 bytes for optimizer momentum and 4 bytes for optimizer second moment EWA.


I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).

The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"


Why 80? It's matrix operations on 4 byte numbers for single precision.


Because you need a lot more information to perform back-propagation.


It's not "a lot more" information, it's holding derivative (single number) per parameter, right?


For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.


What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?


You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.


Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.


It's not per parameter, you also need to hold activations for back prop to work.


You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.


Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.


You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.


For the numerically challenged like me: 100TB is 100 trillion bytes, giving you 1 byte per parameter at 100T params.

LLaMA can apparently run quantized to 4 bits per param (not sure if worth it though), which would allow you to run a 200TB model on one of these cards if I'm understanding right.


You can’t quantize it for training due to numerical instability. For inference you don’t usually use such a big cluster.


I think people talking about a 100T GPT didn't mean a dense transformer but some sort of extreme Mixture-of-Experts which is much more amenable to low-resource setups and complicates this discussion.

In any case, it's almost certainly not bigger than 1T, even if it's not a dense transformer (PaLM-2 is and makes do with 340B, but it isn't exactly on par).


> LLaMA can apparently run quantized to 4 bits per param (not sure if worth it though)

From the GPTQ paper https://arxiv.org/abs/2210.17323:

"... with negligible accuracy degradation relative to the uncompressed baseline"


That would work for inference, but for efficient training you’d also want you training set to fit in memory.


There is much more out there than text. Audio, visual, touch, smell. Text isn't something humans directly train on, but representations of text from our senses.

GPT-4 was trained on image data. Besides gaining understanding of image content it also showed improved language abilities over a GPT-4 trained with only text. Facebook is working on a smaller model with text, image, video, audio, lidar depth, infrared heat, and 6-axis motion data. If a GPT-4 was trained with data like that, what capabilities would it have? Rumor says we will know in a few months.


My understanding is that the image data used a decoder-only stage, i.e. mapping images to tokens, basically taking the image textual description instead of the actual pixels so it can't "see" but can understand the "narration of an image"


John Conner, is that you?


> However, I’m not even sure enough text data exists in the world

I hope these models move significantly beyond text at some point. For backend programmers it's ok, but for the rest of the technical world (circuits, mechanical engineering, front end, sound, etc), it's fairly limited.


My understanding is that this is already the case, see PaLM-E as one such example of a multimodal model.


> Maybe if you generated massive quantities of text with GPT-4 and used that dataset as your pre-training data

Hello spurious regression


These 100T rumors were ridiculous from the start, not just in hindsight.


I think we’re going to start seeing learning based on all the video out there. Text is just computationally easier, but video contains a lot of information that people rarely write about, because it’s completely obvious to humans who grew up in the real world.

Also, I think training in simulated realities will be big, especially for learning how to interact with complex systems, for developing strategic planning heuristics.


There may not be enough text content on the internet, but there’s plenty of audio and video content, and there has already been some research about connecting that as an input to an LLM. So far we’ve seen that the more diverse the training data the more versatile the model, so I suspect multi-modal input training is inevitably where LLM’s are going.


As far as I can tell, the "100 trillion" number comes from an interview with the CEO of Cerebras when he was doing press for the WSE-2 release in 2021: https://www.wired.com/story/cerebras-chip-cluster-neural-net...


You don’t really need to fit fully in memory. Memory requirement to train is

~6DP * precision

Where D is number of tokens*mini batch size and P is number of parameters.

So if you want to fit fully into memory with a mini batch of 1, context window 32k, and 16 bit precision, that’s 144e12/6/32e3/2 = 375M param.

If you apply one token at a time then

144e12/6/2 = 12 T param

Ofc, in reality you have model parallelism as well…


I have to wonder how much improvement you would get with a 100 trillion parameter model. There seems to be diminishing returns in model size. That effort could almost certainly be better spent.


Let's record every conversation on Android to collect training data! Anyone can do the math?


So how exactly (in the technical sense) is this more energy efficient than both PCIe and Infiniband (which seems to be a claim somewhere too, together with the added bandwidth)?

EDIT: so the whitepaper is surprisingly good for that (somehow all the articles are very weird...): https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-ho... - essentially they connect the GPUs with NVlink instead of PCIe (so, vertical integrators heaven) and then NVLink forms a separate interconnect for GPUs. So this is cool and essentially what Fujitsu, Google, ... have done for some time. A fun thing is, that they like to add up their nvlink-duplex bandwith and don't do for PCIe... (which then suddenly would have the same bandwith as the GPU-side).

Still very cool to see the mainframe come back alive ...

(it's a bit sad they bought Mellanox - monopolies are sad...)


Article written by marketing. White Paper written by apps engineering. You can always tell so much more about semiconductor products if you can get your hands on material written by engineers. The information hasn't been distilled through layers of internal training sessions.


This is awe-inspiring and almost scary, it's pretty much beyond my understanding how much data these systems are meant to process.

What is also beyond me is how someone at Nvidia thinks that the label sequence "1.00E+2; 1.00E+3; 1.00E+4; 1.00E+5; 1.00E+6" for the vertical axis in "Figure 1" is more readable than "100; 1,000; 10,000; 100,000; 1,000,000" would have been. The latter is 5 chars less (total), even. Or, if exponential notation is important for the Big Serious Computing People, then perhaps they could have dropped the ".00" part from each value? Or, if I'm allowed to dream, gone with actual exponential notation?


The exponent number is the number of zeros, its way more readable and faster to interpret than counting zeros.


It's easier to think in (possibly relative) orders of magnitude than with absolute numbers, instinctively it's what we do when we read large numbers.


It's the scientific notation, and makes the graph to be a log scale. It allows you to see they gained more than two orders of magnitude in a single generation.


I'm not sure what's the problem with the exponential notation? It shows scale in order of magnitudes.


> I'm not sure what's the problem with the exponential notation?

Same. It's just more efficient and readable that counting the 0 while considering the culture bound digit group separator norms like thousands/millions 3,3 vs laks/crore 2,2,3 cf https://en.wikipedia.org/wiki/Indian_numbering_system?useski...

Personally, I think it'd have been better to say the the .00 adding nothing: 1E2 1E3 etc would be far better.


There is a semantic difference: 1,000,000 == 1,000,000 where-as 1.00E+6 >= 1,000,000 < 1,010,000. The decimal places after the 1 in 1.00E+6 specify the precision of the measurement.


I don't think they specify any precision. It's just a way to write very large/small numbers approximately. (Though these numbers here aren't really considered large.)


They do. See wikipedia [significant figures] https://en.wikipedia.org/wiki/Scientific_notation#Significan...

In this case, it is pointless though, since the precision is actually known.


I think the scientific notation is much more readable. Putting in the number of zeroes quickly leads me to count it them.


I can't find how much it will cost or how much power it will use. I mean it will be a lot and maybe only Google and Microsoft and Facebook can afford it but I still want to know.


The DGX A100 was $200k at launch. I found a DGX H100 in the mid $300k area. And those are 8 GPU systems. So you need 32 of those, and each one will definitely cost more plus networking. Super low estimate would be $500k each for $16M total. But considering its moving from 98GB to 480GB RAM per GPU. Might be more like $1.5M per 8, round it to say $50M.

And at 1/8th the power per GB, you have 700 Watts / 96GB / 8 * 480GB come to around 450 Watts per. And 115kw for the 256.


What does this mean for the AI race? For example what if a newish company (newer than Google/Facebook/Microsoft/etc.) like Anthropic, Scale, Perplexity, or Stability is able to scrape together $5B USD funding and spend their hardware budget on these things. Say that can buy $1B of them and spend the rest on hackers and operating expenses (idk if that's realistic). So maybe they could purchase and operate like 20 of them. Say that they spend six months doing experimental things and then the next six months training their Tsar Model. If they follow the Chinchilla scaling laws and normal architectures, how good will these models be?


For AI race means it opens the door to a competitor. Could be AMD, Google, or Amazon. All which have offerings in this space.

However while the hardware isn’t cheap, it’s still likely not a blocker. Costs do inhibitor more experimental research.


Startup which will rent this compute when needed will likely have more advantage on AI front.

Selling shovels is good business, but doesn't compete directly in AI area.


I have no expertise in GPU System used for AI Learning, but would It be possible to buy a bunch of consumer cards and get the same performance? Or is this not possible because consumer cards go to 40 ish GB RAM and Models would not fit or „swapping“ like crazy and be slow.


Consumer cards only have PCIe 4.0, at most 24GB VRAM and the only recent model with NVLink, the RTX 3090, can only be connected to exactly one other card. It doesn't scale beyond that. So you are limited to PCIe 4.0 x16 speeds.


Technically the A6000 has 48 Gb of VRAM and works in 2-way NVLink.


The NVLink interconnect on all the GPUs is a huge part of it, and cannot come even remotely close to that bandwidth with consumer goods. Then the density of RAM to compute and power is huge. A single 4090 is 450 watts, for 24GB where this is 20x the memory for the same watts. 2.3Mw or so. If you say $0.14 / kwh, thats something like $325 / hour in power costs to run. Not counting additional cooling you are definitely going to need. And I am sure there is inefficiency this doesn't cover but 240v 10,000+ Amps for that?


Not the same. Not all problems can be efficiently divided among NUMA nodes with low bandwidth interconnects.


would It be possible

No


With that much memory I could probably run Crysis 4 and an Electron app side by side!


It’s in interest of Nvidia, to try to make sure they do not end up in a situation where they have a small group of very big customer that buy a large slice of their production. For Nvidia a market of the same size, with many small to medium customer is a lot better as those customers will have a lot less power to force Nvidia to do something that isn't in it's interest or it does not want. I expect to see moves from Nvidia to help smaller players, open source or semi open models to not be crushed by the big players. Not because they are nice, but because it is in their best interest.


I wonder if you can even run this on a regular 20A circuit - I'm thinking no - 20 * 120 = 2400 Watts - I assume that will not be enough...


My hob in my kitchen is 7.3kw? Normal 32a * 240 circuit allows up to 7.68kw, and the 6mm^2 cable is rated to something like 45amps

This seems fairly common e.g. https://www.currys.co.uk/products/aeg-ikb64401fb-59-cm-elect...

I am sure data centers have larger circuit breakers and chunkier cables than my kitchen appliances!


Woah, I looked at the specs:

    Front left: 2.3 kW / 3.7 kW
Cripes. You can boil water extremely fast on that IH setup! I'm living with 1.5 kW, and it is painful...


When we first got our induction cooktop I was so excited about how ridiculously fast we could boil water. Which is definitely an odd thing to get excited about. It definitely isn't that powerful though. That's a lot of power.


For a single Grace+Hopper node? I'd bet it fits in that budget, the Grace Hopper datasheet says the combo has a CPU + GPU + memory TDP of 450W - 1000W programmable, and that leaves more than half of the room for the rest of the node's power budget. For the DGX GH200? It's 18,432 CPU cores with 256 GPUs across 16 full racks of servers :p.


Here's an example of an 8x H100 machine - look at the tech specs: https://lambdalabs.com/deep-learning/servers/hyperplane

6x 3000W PSUs in a 3x3 redundant config. So 9000 watts total. So at least 240v x 50A. x2 for redundancy.


I would guess about $100m and ~1.5-2MW


Yet another thing to put on my wishlist for whenever I become an eccentric billionaire.

I know that I will never be able to afford such a thing (or possibly even afford to power it for more than a few minutes), but a man can dream.


Did supercomputers ever produce something meaningful or did advancement usually come out of more "scrappier" setups?

I remember hearing a lot about rankings of supercomputers, but less so about what they actually achieved.


Yeah they do all the time. I remember in my parallel computing course where we got to use our 800 core test PC back in grad school where people were running simulations of different weather patterns and climate change. Earthquake simulations and what not. A lot of that can be done taking advantage of all of those cores. Academia specifically heavily uses these to get closer to the "physics" with clear discrete limitations


What is the difference between a supercomputer and a million ordinary servers connected together?


Typically:

  1. Low latency network, 1-2us.  Most servers can't ping their local switch that quickly, let alone the most distant switch for 1M nodes
  2. High bandwidth network, at least 200gbit
  3. A parallel filesystem
  4. Very few node types.
  5. Network topology designed for low latency/high bandwidth, things like hypercube, dragonfly, or fat tree.
  6. Software stack that is aware of the topology and makes use of it for efficiency and collective operations, 
  7. Tuned system images to minimize noise, maximize efficiency, and reduce context switches and interupts.  Reserving cores for handing interrupts is common at larger core counts.


Simplicity of the programming model, basically, though in the end it all just comes down to bandwidth and latency.


Communication speed / latency is a big one. Sometimes it matters how quickly extremely large volumes of data can be sent between cores.


Supercomputers exist in meaningful part to compensate for our lack of ability to do nuclear tests. This is why the national labs run them.


>Did supercomputers ever produce something meaningful

supercomputers do all the hard work in research universities all the time. Hell, astrophysics and research involving telescopes and observatories use em all the time.


Yes, absolutely. Most climate models run on supercomputers, same with molecular dynamics, large scale fluid dynamics, energy systems simulations and of course a whole lot of weapons research.


> supercomputers ever produce something meaningful

Absolutely, they contribute to research all of the time.

Some of them have pages where they list research outputs that they enabled (though this is of course limited to those authors tell them about!).


Ever better weather forecast for one. I can remember that, about two decades ago, weather forecast was still rather wobbly, and could only see a couple days into the future. Now 10-day forecast is routine, and surprisingly good. Much of that improvement came about as a result of more powerful supercomputers.


Weather forecasts are vastly more accurate because of supercomputers. And they're improving all the time.


We've never had architectures that scale so effectively, unlocking new cognitive capabilities by just increasing parameters/exaflops/datasets without writing a lot more code or changing the architecture. Ilya Sutskever mentioned this in some interview, that transformers are the first with that property but probably won't be the last or best.


Most of research stuff done at CERN for example.

Besides the outcomes that were adopted by the industry, before cloud computing there was grid computing, exactly to manage such resources at scale.

https://en.wikipedia.org/wiki/Worldwide_LHC_Computing_Grid


I think part of this comes down to your definition of "supercomputer", but I mean, pretty much the entire internet is powered by servers. I've never worked there, but I'm assuming that AWS data centers have very powerful computers designed to handle thousands of VMs/containers each, and I suspect with all the AI hype, a large percentage of them have very beefy GPUs in there as well.

If you're talking about the more stereotypical "high performance supercomputers", I think that they are still used very liberally within the defense industry. I think Lockheed Martin, for example, uses them for CFD analysis.


Google might've been built on a laptop, but it can't scale on a laptop. Same applies to coding an algorithm on a scrappy setup, and then scaling it to sequence DNA or simulate a phenomenon.


Not sure what this means, because google effectively scaled on laptops (generic x86).


In my head the way I differentiate "supercomputers" (national labs) and "warehouse-scale computers" (google/amazon/azure) is:

1. workload for national labs this is mostly sparse fp64 in my understanding, for warehouse-scale computing is lots of integer work, highly branchy, lots of pointer chasing, stuff like that.

2. latency/reliability vs throughput warehouse-scale computing jobs often run at awful utilization, in the 5-20% range depending on how you measure, in order to respond to shocks of various kinds and provide nice abstractions for developers. fundamentally these systems are used live by humans and human time is very valuable so making sure it stays up always and returns quickly is paramount. In my understanding supercomputing workloads are much more throughput-oriented, where you need to do an enormous amount of computation to get some answer but it doesn't much matter whether the answer comes in one week or two weeks.

3. interconnect warehouse-scale computing workloads are mostly fairly separable and the place where different requests become intertangled is in the database. In the supercomputing world, in my understanding, there are often significant interconnect needs all the time, so extremely high performance networking is emphasized.


Nice ontological classification thank you !


Yes, especially when you account for shared university supercomputers, and especially when it took a supercomputer to do much of anything.

Generally given there are a much larger number of less powerful computers, more accessible to much scrappier interests, one would expect more innovation to be done on them.


Weather simulations and forecasting are very useful to society and practically almost all available public weather forecasting datasets were computed in some supercomputer cluster.


Have people experimented with distributed training of parts of the model to avoid needing these absolutely massive GPUs? Anyone have pointers to large scale distributed training done recently?


Yes, certainly. One industry use-case that comes to mind if Baidu; white-paper link below [1]. Pretty much all the large model developers distribute their model training across hardware in some way, using a blend of GPU/TPU/FPGA accelerators across multiple CPU nodes. Moving all the data around is expensive though, in both power consumption and time, which is why NVIDIA's new system would be of interest.

[1] http://research.baidu.com/Public/uploads/5e76df66c467b.pdf


This is fantastic, thanks.


The DGX described here is a distributed system in the sense that many nodes, each with their own GPUs, exist and are part of the overall whole. They are connected over Infiniband and use RDMA in order to read/write memory across the cluster. Therefore training is also distributed among the nodes in the sense that each node takes part of the process.

The difference is that Nvidia's software and hardware stack combined makes all these systems, all these aggregate GPUs, look like One Really Big GPU. Not hundreds of small ones. That's not only good for users because they can take existing programs and migrate them to these big machines and get improved performance, but also good because it's generally much easier to program "one big machine" as opposed to programming and orchestrating many small ones. This is an attractive proposition for many but it requires an insane amount of integration to achieve.

So, the major differentiator here isn't the lack of or existence of many discrete machines connected together. It's the programming model, at this scale, that's different. And Nvidia is way ahead of everyone else here in terms of programming models; once full heterogeneous memory management for CUDA arrives in a stable consumer driver, it'll be a massive change for others to catch up with.

What you might also be referring to is the idea of "distributed training", or what is called "ensemble learning" where you individually train a bunch of small unique models that, when combined together, perform better than if they were one giant model (or at least are as accurate/efficient as a giant model.) It's "The P2P model" of training because you can take lots of small models and collectively aggregate them. That's an open problem people are attacking but not really relevant in the case of the DGX.

Many hyperscalers, such as Microsoft and their project "Brainwave", have very complex heterogeneous AI datacenter stacks consisting of GPUs, FPGAs, TPUs and CPUs. (Google "Microsoft Brainwave" for some papers.) This DGX is positioned as an alternative to that but also as a tool for their customers to use since many want to train large models efficiently.


It's not InfiniBand, they're using NVlink thid time


NVLink is the GPU interconnect and part of the memory fabric. InfiniBand at 400Gbps is the data and storage interconnect for the CPUs but also to build clusters bigger than 256 GPUs. It's all in the docs.



The issue is that distributed training needs high bandwidth and very low latency to be efficient. In a single computer you can fit about 8-10 GPUs, or if you go to extremes like in this system you might fit 16. To scale beyond that, you connect multiple computers in the same rack via Infiniband (a optical fibre network solution, the system in the article comes with a 400G Infiniband network adapter).

But systems that can host many GPUs tend to be expensive, and electricity is expensive, so at scale the expensive GPUs make sense. For a homebrew solution you can stick four consumer GPUs in a case and might save a buck.


There's also Petals: https://petals.ml/


this is distributed training with RDMA-aware and directly interconnected GPUs?


How far are we from fully modeling the human brain? I mean besides an easy way to identify all the neuronal connections...

This makes me feel like we're close to that one terrifying short-story.


We are massively far away from modeling the human brain. First of all, no one can agree what level is necessary to model the brain, and that varies tremendously by scientific question. Personally, my lower limit would be something like the computational package Neuron which models voltages across axon compartments and distribution of ion channels, My upper limit confidence bound is we don’t care about anything subatomic.

At the upper bound: In molecular dynamics, which is used extensively in modern day neuroscience to understand the function of ion channels and GPCRs, a single H100 can model 70ns/day of compute for 1M atoms. There are 8.64e+13 nanoseconds per day. There are ~10^26 atoms in a human brain. Therefore, an upper limit back of envelope is you need fewer than 10e+26 atoms / 10e+9 atoms * 8.64e+13 ns / 70 ns = 1.23e+29 H100 GPUs.

Calculating the lower bound is more difficult, but let’s start by saying you can get away with a fp16 for each synapse. Storing the weights of that model for 100 trillion synapses is 200 Terabytes, and if you figure weight size * 4 or so to do anything useful then this is in spitting distance. Note that this example lower bound is massively less complex than the Neuron model I suggested, as the entire field of neuromodulators, homeostatic mechanisms, glia, and more are thrown out, which are all important for modeling how the brain works under certain computational regimes.


For the lower bound there is a dark horse factor that has spooked Geoffrey Hinton. He thinks that biological brains aren't able to do backpropagation effectively through multiple layers, and so differentiable programming frameworks are much more powerful than what the brain has, at an algorithmic level. In other words, he thinks that computers are able to learn more effectively than any neuron-based biological brain. Of course right now there are caveats. The brain appears to have more 'statistical efficiency' meaning it appears to learn more from less data, and the brain is obviously more energy-efficient. There is also the possibility that Geoffrey Hinton is just wrong.


Biological brains also don't really operate layer by layer and can have connections between random neurons, so it's probably a lot more space efficient. Impossible to say if any of that actually matters though.


Can you cite a reference for Geoffrey Hinton (and or others) on this line of thought on this "dark horse factor"? I think this resonates with a lot of people, having a focal point Schelling point to refer to in discussions would be handy.

I believe the situation is a lot more extreme than the absurd efficiency of differentiable programming. I have been meaning to write up (but been too busy to do so) an insight where I believe training can be made ridiculously cheap computationally speaking (in a way that combines with differentiable programming, not replaces it). I am agnostic if this is what the brain does, but wouldn't be surprised at all if the brain does in fact do back-propagation (or uses the insight that I've been meaning to write up).


That is a spectacular response!

My bio knowledge is very basic, so forgive naiviety in these two questions.

First, I'm not asking you to go through the math on the spot, but I'm guessing that lower-bound capability is well understood in 'the field', but is it documented against various species? Perhaps mapping against current / projected GPU/compute systems capabilities? (I know there's a project to model a worm's brain, IIRC down to molecular level. But I'm picturing a 'we are 3 years away from being able to emulate a basset hound, 4 years for a border collie' - that kind of roadmap.)

Second, you said upper bound is to ignore sub-atomic. I thought we had proton and electron gradients, at least in metabolism. I believe proton there is a synonym for Hydrogen (atom), but electron would imply some potential need to emulate at sub-atomic? Have I misunderstood the bounding / chemistry involved?


The electron and proton are considered parts of the realm in atomic physics still. When physicists speak of sub-atomic they tend to mean any physics below the Aufbau model or electrons and nuclei.


We will never simulate the entire brain atom-by-atom, we won't need to, the same way we never simulate atom-by-atom and we don't even place structural atoms by hand when we build a bridge, a rocket, or a tree house, we can be way more intelligent than that [1]. In the limit, the entire thing could be even more simple than we currently can imagine [2]. But yes, before we start leveraging equations, we must find the principle of gravitation for collective intelligence first [3].

[1] https://en.wikipedia.org/wiki/Hodgkin%E2%80%93Huxley_model

[2] https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_sys...

[3] Michael Levin | Cell Intelligence in Physiological and Morphological Spaces, https://www.youtube.com/watch?v=jLiHLDrOTW8


I think a long ways away, if I understand this article about the difficulties involved in accurately modeling even a single biological neuron:

https://www.quantamagazine.org/how-computationally-complex-i...

> "If each biological neuron is like a five-layer artificial neural network, then perhaps an image classification network with 50 layers is equivalent to 10 real neurons in a biological network."

The complexity explodes quickly because each biological neuron's behavior is modulated by a large number of biochemical neurotransmitters, on top of all the dendritic connections (up to 15,000 each, apparently).


When AI becomes good enough we will maybe we will stop thinking about trying to imitate human brains. If we viewed our brain's decision making power objectively we can find several flaws, for example our heuristics to make quick decisions for mundane things is also our greatest weakness(short sightedness). We are poor at incorporating data to make good decisions and constantly bias due to some external stimuli.

Why would you want to make anything close to the brain? What real scientific or engineering or humanitarian uses does doing that even have? AI is already and going forward should strive to be a groundup of redesign of intelligence.


> Why would you want to make anything close to the brain? What real scientific or engineering or humanitarian uses does doing that even have?

To have models of the human brain that we can poke at and change and tinker with and etc., so that we can get better ideas of how therapy techniques, medications, ... will impact the actual real people that might benefit from them.


You would need to define what you mean by 'modeling the human brain'. If it means AGI or anything similar, then we're very far.

To paraphrase an analogy I've heard somewhere (in similar context) - We're building better and better ladders, maybe even lifts with this last push in ML field. But the brain is on the moon - even the best lifts won't get us there.


Sorry if I'm missing an obvious reference, but what short story do you mean?

(edit: thanks for both replies already, and any others that might fit; I understand now the reference was likely to Asimov)




Wouldn't modelling the human brain mean we'd be using less power? We're using brute force to try get similar results to what the brain does.


The question that interests me is how far are we from modelling a human cell, neuron or not? Because that's how we cure cancer.


Is human consciousness just a product of neural networks, or is there other additional mechanisms we're not aware of?


The brain is analog and chemical, AI will be digital and silicon. We have no idea how the map from one to the other.


> The brain is analog and chemical, AI will be digital and silicon.

Says who?

Sure, if you assume that “AGI is just scaling up GPT”, it will be digital and silicon. But that’s a big assumption.

For all we know, AGI will only ever, if it exists, be analog and chemical.

> We have no idea how the map from one to the other.

Plus, even if we had an easy one-to-one mapping function between them, we don’t understand the source well enough to do the mapping.


It does not need to be, but today the computers we use are overwhelmingly based on silicon. Also OP mentioned AI, not AGI.


An analog-to-digital and digital-to-analog converter is less than a cup of coffee in some places [1].

[1] https://protosupplies.com/product/pcf8591-a-d-and-d-a-conver...


Oh god so my background is CE/ECE stuff and you managed to trigger me. I don't want to be rude... just bluntly saying you triggered me. Doing something really small for A/D D/A with 8bit and not worrying much about resolution and data loss is one thing. For something massive scale the problem is a lot less trivial and a lot more mathematical.


Haha, sorry, was more of a tongue-in-cheek reply to "We have no idea how [to] map from [analog] to [digital]".


Is the Quantum computer hypothesis dead?


I don't see how quantum computers are relevant? We can't build them, and there certainly isn't any interesting quantum computation in the brain.


What do you mean we can’t build them?

To your second point, we have little to no ability to understand yet what quantum effects may or may not be active in brain/consciousness function. We certainly can’t exclude the possibility.


I mean lots of people have tried to build quantum computers, and so far no-one has succeeded in anything that's describable as a "computer", instead of "half a dozen logic gates". Perhaps in the future.

We can fairly well exclude the possibility of interesting quantum effects in human consciousness, because the human brain is a hot, dense environment that might as well have been literally designed to eliminate the possibility. It's the exact opposite of how you want a quantum computer to be built.

Which doesn't mean there aren't plenty of quantum effects involved in the molecular physics, but that isn't what is normally meant by 'quantum computer'. Transistors would also meet that definition.


I do not agree with either of your assertions:

1) That 433 qubits does not make a computer and is instead “a half dozen logic gates.” I agree a half dozen logic gates is not a computer. 433 qubits is not comparable in terms of information capacity or processing capacity to a half dozen logic gates. This number is also publicly doubling annually now — I would bet the systems we don’t know about are more complex. Importantly, a computer in this context is not something you would attach a monitor to — it is just an electronic device for storing and processing data.

2) That we have any good idea of the limits of how biological systems might be influenced by quantum effects within specific temperature ranges. You certainly wouldn’t construct a human brain to interface with quantum effects given the present state of our knowledge in constructing these kinds of systems. But then we can’t even construct a self-replicating cell yet, nevermind a brain. It’s hard to imagine we understand the limits at work there.


For many applications there is no need to fully model the human brain. An approximation of a particular aspect would be sufficient in most cases. We didn’t build aeroplanes by fully modeling a bird, we just need aerodynamics.


Wow, 480 GB per GPU! What happened to end of Moore's law?

I hope this improvement translates to consumer GPUs as 24GB is a big limitation.


Moores law stated transistor density per square inch roughly doubled every 2 years.

I see no end in sight for that specific law. As we can always go vertically if needed. It’s also held through so far in 2020[1].

The folks who conflated moores law to also mean doubling of compute processing capabilities of a CPU double every 2 years were wrong.

[1]https://en.wikipedia.org/wiki/Moore%27s_law


Per your link, Moore's law also doesn't state anything about density. Density is just one of the ways "The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.", i.e. Moore's law only ever stated transistor count per a given price roughly doubled every 2 years.


Moore's original article on the topic in 1965[0], and the same with additional context interview form 2005[1].

> "The original Moore’s Law came out of an article I published in 1965...I had no idea this was going to be an accurate prediction, but amazingly enough instead of ten doubling, we got 9 over the 10 years, but still followed pretty well along the curve. And one of my friends, Dr. Carver Mead, a Professor at Cal Tech, dubbed this Moore’s Law. So the original one was doubling every year in complexity now in 1975, I had to go back and revisit this... and I noticed we were losing one of the key factors that let us make this remarkable rate of progress... and it was one that was contributing about half of the advances were making. So then I changed it to looking forward, we’d only be doubling every couple of years, and that was really the two predictions I made. Now the one that gets quoted is doubling every 18 months...I think it was Dave House, who used to work here at Intel, did that, he decided that the complexity was doubling every two years and the transistors were getting faster, that computer performance was going to double every 18 months... but that’s what got on Intel’s Website... and everything else. I never said 18 months that’s the way it often gets quoted."

Anyways, See slide 13 here[2] (2021). "Pop-culture" Moore's law stated that the number of transistors per area will double every n months. That's still happening. Besides, neither Moore's law nor Dennard scaling are even the most critical scaling law to be concerned about...

...that's probably Koomey's law[3][5], which looks well on track to hold for the rest of our careers. But eventually as computing approaches the Landauer limit[4] it must asymptotically level off as well. Probably starting around year 2050. Then we'll need to actually start "doing more with less" and minimizing the number of computations done for specific tasks. That will begin a very very productive time for custom silicon that is very task-specialized and low-level algorithmic optimization.

[2] Shows that Moore's law (green line) is expected to start leveling off soon, but it has not yet slowed down. It also shows Koomey's law (orange line) holding indefinitely. Fun fact, if Koomey's law holds, we'll have exaflop power in <20W in about 20 years. Which should be enough for people to create ChatGPT-4 in their pocket.

0: https://www.rfcafe.com/references/electronics-mag/gordon-moo...

1: https://cdn3.weka-fachmedien.de/media_uploads/documents/1429...

2: (Slide 13) https://www.sec.gov/Archives/edgar/data/937966/0001193125212...

3: "The constant rate of doubling of the number of computations per joule of energy dissipated" https://en.wikipedia.org/wiki/Koomey%27s_law

4: "The thermodynamic limit for the minimum amount of energy theoretically necessary to perform an irreversible single-bit operation." https://en.wikipedia.org/wiki/Landauer%27s_principle

5: https://www.koomey.com/post/14466436072

6: https://www.koomey.com/post/153838038643


Thank you for the detailed writeup with sources, I enjoyed this. I'd somehow never heard of Koomey's law despite working in tech, this is very interesting and directly relevant to the widespread deployment of AI (biological neural networks still blow silicon out of the water for computations per joule).


An implicit aspect of Moores law has been that cost per transistor has been going down as the density is increasing. This doesn't seem to be the case anymore. The technology required to get higher transistor density is getting ridiculously expensive. We're not seeing the power benefit of scaling down transistors either, since leakage is starting to get too high. I guess there's one more trick in the pipeline with Gate-All-Around, but I don't think I see a path to get better gate control after that. And if we don't get power consumption per transistor down, then stacking transistors in layers to increase density isn't going to be very viable for compute chips, since you need to get the heat out of the chip. IIRC, Intel is working on putting the power metal layers on the back side of the chip, which grows the chip vertical in the other direction so to speak. And it helps wick away heat as well, so could open a path for a few layers of compute transistors. But all this adds a huge amount of complexity to manufacturing, so at some point it might not be worth the cost anymore.


I thought the power benefits of shrinking still hold up rather well, in contrast to cost. E.g. new Nvidia gaming cards have smaller GPUs for the same price as the respective old generation, meaning the cost per chip area doesn't stay constant for improved manufacturing nodes. So the price per transistor shrinks slower than the number of transistors per chip area grows. At some point in the future the price per transistor would go up rather than decrease. Then the value of shrinking structures could stem, at best, from lower power draw per transistor. For mobile devices. But even power draw per transistor may stop decreasing at some point. Then further shrinking the process nodes would be useless.


That's a common misconception and I'm not surprised it made it into Wikipedia. Moore's 1965 paper was the first time anybody had pointed out the exponential nature of progress in miniaturizing transistors and packing more of them on the same integrated circuit. But it wasn't until 1975 a the same conference where Dennard presented his scaling laws that the phrase "Moore's Law" was coined in an interview where someone was trying to explain Dennard scaling to a reporter. The original coining was ambiguous as to whether it meant more transistors, faster transistors, or smaller transistors and that ambiguity remained in its usage because from 1975 until about 2005 they all went together just as Dennard said they would.

And my lecture notes from a class I took on semiconductor physics in college had a photo copy of a memo from Moore himself endorsing this broader conception of Moore's Law.


To be fair, if you stack them, density is not going up - only if ignore the number of stacks and take one of their areas for the total number of stacked transistors would it then go up. Plus, stacking is great, but with heat issues, isn't the industry going to many dielets with a massive interconnect?


Heat issues are very valid. But the "per square inch" density is relative to a square inch of fab wafer. So if it can be done on one wafer, it counts. If it's stacking discrete chipsets, not so much.


Stacking likely wouldn't save substantial cost compared to producing multiple different wafers. It could even increase cost if it decreases yield. That's very different from making the transistors smaller, where the cost per transistor decreased exponentially in the past. People focus too much in Moore's law (transistors per area), when the only interesting quantities are 1) price per performance and 2) power draw per performance.


If measured in square inch the 3rd and more importantly the 4th dimension are not accounted for and are basically free.

Another way to say it is to count the famous founder brown: ymmoore wasn't thinking 4th dimensionally.

For shame really


> I see no end in sight for that specific law. As we can always go vertically if needed.

Not a hardware person but heat dissipation becomes more of a problem when you go vertical IIRC.


Smaller transistor means lower energy consumption, going vertical won’t solve this.

This ceiling will be hit much earlier than what process/technique allowed.


smaller energy per transistor, but if you're packing more in the package, the package's consumption will grow up. Also, I think leakage current (and the heat that comes with it) goes up the smaller the feature size.


Not to be pedantic but wouldn't stacking transistors have no effect on density per square inch? Since it would only increase density per cubic inch.


We include multi-story buildings when we calculate population density, why not include multi-story chips?


Stacking transistors increases density per square inch if it can be done on a single wafer of silicon, because its "per square inch of fab wafer silicon"


I view it as if you cut 1 square inch of a motherboard. That the every 2 years you’d expect to see roughly double the number of transistors in that cut out piece.

Scaling vertically would “technically” still meet the above.


I don’t think that’s what they mean by per square inch. They mean in a plane, not a volume. If you add a third dimension the law stays the same, because a volume is two planes and the density law applies to each independently. That’s why node sizes are a single value not a two dimensional value. A 3nm node is 3nm feature sizes, regardless of dimensionality.


The quote given on Wikipedia is:

> The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.

If he was talking about the area of a single transistor, there would be more concise ways to put it.


Even a 1000ft thick motherboard?


That'll show Moore


More’s law: if there is more of it, it does more!


> As we can always go vertically if needed.

I don't think you understand how difficult non-planar transistors are to engineer at scale


How difficult is it? Difficult or really really difficult?


End of Dennard scaling was the performance breakdown. Meant chip frequencies couldn't be cranked higher and higher as temperature dissipation became more and more of an issue https://en.wikipedia.org/wiki/Dennard_scaling


Except for GPUs, which are for highly parallel tasks anyway.


A bit self-serving but GPU scaling supposedly follows Huang’s Law (from Jensen Huang of Nvidia) which claims GPUs more than double (~1.7x) every 2 years

https://en.m.wikipedia.org/wiki/Huang%27s_law#


> more than double (~1.7x) every 2 years

To clarify in case anyone else finds this confusing. The linked article suggests a 1.7x annual increase, which compounds to 2.89x every two years.


I'm pretty sure this law does no longer hold, as new Nvidia GPUs show only meager performance improvements over their (in class) predecessors. Though this could also be due to the price per chip area increasing faster than performance.


4090 is twice faster than 3090 in almost all metrics.


It came out two years later, so according to Huang's law we would still expect more than that. Moreover, most other models have seen much smaller improvements, like the 3060/4060.


There's a 24 GB limitation for consumer GPUs, because AMD and Intel aren't competitive.


80 GB per GPU and the consumer GPU is purposefully lower memory to induce demand for server grade.


On top of contracts strictly penalizing utilization of consumer GPUs in data centers, at that! Even with the memory, bandwidth etc. handicaps, servers with 4090/3090s would have been competitive for many ML tasks.


big limitations for what? AI models? It's certainly not for gaming and we're not quite to the point of consumers running huge AI models on their desktops. The HN crowd is, as always, not representative of the broader consumer market.


Aren’t we?

We’re seeing ChatGPT plug-ins for games, to provide intelligent conversation — and we’ve seen DNNs in StarCraft and similar.

To me, the “next gen” of gaming is intelligent NPCs, combining those features to create realistic behavior. That will require that consumer GPUs get closer to supercomputer GPUs:

More tensor cores and higher memory.


Other than the holographic projection, it feels like we're in reach of the holodeck - you ask for a scene with a character or general backstory, and you go in. Fun times. Now on that energy to matter and holographic projection part...


It's not a holodeck but Google has an interactive display now that feels like an open window. It doesn't even register in my mind as a display, it feels like looking through a literal portal to another location in physical space.


I tried to Google for more information, but I didn't find anything. Can you share a link?



That is amazing. Do you work for Google or are you part of the early access program? How does it work? I cannot believe we haven't seen this on HN before!



DLSS needs less memory than rendering in native resolution, not more.


Are there more dense memory chips or is it more of the same memory chips? Putting more memory chips on something doesn't have anything to do with increased density, which is what moore's law is about.


A large system, so much higher chance of something breaking.

What happens if it loses a node or a link? Or some memory becomes unreliable? This thing needs some sophisticated fault tolerance.


Nvidia's enterprise GPUs are surprisingly unreliable. Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days. I didn't have any insight on whether it was a hardware or software failure.


> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days

Define "fail".

> I didn't have any insight on whether it was a hardware or software failure.

Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.

For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)


What does an AWS user do with this advice?


They either figure out how to write scripts or ask AWS support how to get that done.


I know my comment is not directly related to the post, but it's not completely unrelated. Nvidia's CEO gave a commencement speech recently. In that, he said:

I contacted the CEO of Sega and suggested that they should find another partner. But I also needed Sega to pay us in full or Nivida would be out of business.

Just like failures, there are lessons for learning behind every success. Does anyone here have any insight - how did Nvidia came out of the above embarrassing and incompetency phase, and became a path-breaking, and trend-setting power house?


I don't have a direct answer, but the "Acquired" podcast did a two-part series on Nvidia in 2022Q1, it's a good/deep answer to why they are the way they are.

part 1: https://www.acquired.fm/episodes/nvidia-the-gpu-company-1993...

part 2: https://www.acquired.fm/episodes/nvidia-the-machine-learning...

In short, Jensen has an almost Elon-like appetite for "bet the company" tier risk. He's never been comfortable with a plateau, and is always looking for the next mountain to jump to, before the plate tectonics of the industry come around to form it.

There aren't a lot of CEOs and companies that oversee a company or industry-transforming shift more than once, he's definitely in that camp.

But that's a gross oversimplification, the story is more interesting. Check it out!


Thank you for your response and those links.

It seems Jensen is either reserved or media shy. He does not do as many public appearance as his contemporaries like Musk/Jobs/Gates/Bezos.

Also, there has not been any book on him. But, there are couple on the way - https://www.amazon.com/s?k=Jensen+Huang&ref=nb_sb_noss . I hope he'll choose to publish a 2k/3k pages biography. Love reading life stories of successful leaders.


The trouble with all this is that people are forced to buy these insanely expensive systems merely for the benefit of fitting all that stuff in the vram even if they don't end up using the compute cores on these machines (which, let's be honest, aren't really all that better than the gaming gpus).

The needs are more in line with consumer server hardware with user-choice on cpu/ram etc. Sounds to me like there's a market for disruption. Pity that the deep-learning community is under the choke-hold of nvidia's software.


Eh, time will tell. Nvidia is killing it because they have "semi decent" software toolchains like CUDA when just about hardware player botched software. That said, a lot of interesting development on the XLA and PyTorch 2.0 sides that lower straight down to LLVM, bypassing Nvidia's CUDA moat today.


The same is also true for https://github.com/ROCmSoftwarePlatform/rocBLAS and https://github.com/ROCmSoftwarePlatform/hipBLASLt although the build stack, distribution— leaves a lot to be desired, and otherwise quite unstable.


Why do we still call those processors GPU when they are not meant to process graphics?


A GPU specializes in Vector math, which is what is used in gaming graphics. Hence the name Graphics Processing Unit and it's original use case. It just so happens that LLMs are powered by vector math much like graphics applications are.


So it should be called VMPU then, not GPU.


CPUs are also arguably not the centre of processing in such systems.


Good point. "AI accelerators" is a thing now that competes directly with "GPUs used for AI as its sole purpose".


The technical term is GPGPU, general-purpose computing on GPUs, but I like to call them GPUs for short.


So, a GPPU.


That’s very interesting - why such an increase in memory capacity? I hope they can translate this to cheaper cards too


It's not an increase in capacity of per-GPU "GPU memory" (the HBM directly connected to the H100 here is up to 96GB, where the previous generation was 80GB), but rather reflects the product of two things:

1. Each node here is a more tightly-coupled CPU+GPU two-chip pairing, and the CPU side has a significantly larger pool of 480GB of LPDDR ("regular" RAM). So each GPU is part of a node that includes up to 480+96GB of total memory.

2. There are way more nodes: 256, up from 8.


>480+96GB of total memory

Is this memory unified like Apple Silicon? Meaning, can a model be deployed onto 574GB of total memory? Can the GPU read memory directly from the 480GB pool? Same question for CPU being able to directly access the 96GB.


It should be mapped as one address space, so yes to the loading across multiple GPU question. It's not fully unified though, at this scale of computer it's simply impossible to put 100s of GB on an SOC like that. Instead, the GPU and CPU have DMA over PCI and NvLink, which is plenty fast for AI and scientific compute purposes. "Unified memory" doesn't make much sense for supercomputers this large.


`Nvidia discovers DMA`


This device has a fully switched fabric allowing comms between any of the 256 "superchip" clusters at 900GB/s. That is dramatically faster than a direct host to GPU 32-lane PCI-E connection (which is crazy), and obviously dwarfs any existing machine to machine connectivity. The actual usability of shared memory across the array is improved significantly.

I mean...nvidia has obviously been using DMA for decades. This isn't just DMA.


Parent discovers the difference between DMA and RDMA


No i mean the fact that Nvidia is now claiming that the memory the CPU has access to can be counted as memory for the GPU. the fabric is neat. the "We have 500 GB of ram per gpu" claim is questionable.


Nvlink provides cache coherent load-store access, so the point is actually that it's not DMA.


They do make PCI hardware, don't they?


It is surely driven by the gargantuan increase in demand for training and running massive models.


How does this compare to Cerebras? https://www.cerebras.net/


I think this has a lot more memory than Cerebras. Their site doesn't say how much memory they can attach to each Cerebras chip and I've gotta imagine that's because it doesn't look good vs their competitors.


I watched Huang's Computex presentation and was pretty impressed by the hardware they're launching, but it's worth noting some caveats. While it was announced to be in "full production," according to the blog post, DGX GH200 won't be available until the end of the year, which puts it about on the same timeline as AMD's delivery of MI300.

Also, while the 1 exaFLOPS topline number is impressive, this has some asterisks. Each GH H100 GPU only does 34 TFLOPS of FP64 according to the data sheet. [1] At 256 nodes, this is a mere 8.7 petaFLOPS, or 0.0087 exaFLOPS. You only get to the 1 exaFLOPS number (from looking at the data sheet) if you are doing sparse FP8 Tensor Core FLOPS (3968 TFLOPS/GH, non-sparse is halved). It's worth keeping this in mind when comparing to something like Frontier (1.1 exaFLOPS) [2] or the upcoming El Capitan (expected 2 exaFLOPS) [3] - Top500 uses LINPACK, which benchmarks FP64 FLOPS. Of course, for AI training, FP8 or BF16 is probably the most relevant numbers for perf/W and perf/$... Frontier and El Capitan are each estimated to cost ~$600M, and while exact numbers weren't given, I'd expect a full 256-node DGX H200 to come in between $50-100M.

AMD will be having a "Data Center and AI" event on June 13th, so we'll get to see soon how competitive they are (the announced MI300 specs is 24 x Zen 4 cores with a CDNA3 architecture that is 8X faster than MI250X (383 FP16/BF16 TFLOPS), so about 3000 TFLOPS, with 128GB of unified HBM3 on a 8196 bit bus (6.5TB/s theoretical), which is in the same ballpark as Grace Hopper - I think it'll mostly come down to software, but with drop-in PyTorch support, OpenAI's Triton, etc, I'm somewhat optimistic that it will be worth it for big players do to some software lift (that others can benefit from), if the cost competitiveness of the hardware is there. For details already made public, see: https://www.tomshardware.com/news/new-amd-instinct-mi300-det...

[1] https://resources.nvidia.com/en-us-dgx-gh200/nvidia-grace-ho...

[2] https://en.wikipedia.org/wiki/Frontier_(supercomputer)

[3] https://en.wikipedia.org/wiki/El_Capitan_(supercomputer)


I can finally play Crysis on highest settings.


Shame they don't seem to be keeping pace with AMD, even after the Mellanox acquisition.


imagine a beowulf cluster of them


How doe it compared with tesla dojo?


okay, but can it play crysis at 60fps


But can it play Doom?


Hell, with a developer emulation force of 13 MegaCarmacks it can rewrite 50,000 Doom per second!

(But that number drops to only 10/s if the rewrites are in Rust)


> (But that number drops to only 10/s if the rewrites are in Rust)

Because of the slow toolchain, or because of the trademark lawsuits?


Yes, you will be able to train an AI model that is capable of beating Doom using this machine.


I think you're more likely to train a model capable of writing Doom using this machine.


If it can infer 35 fps, then it can be Doom.


You mean millions of Doom?


At this point, I'm convinced that someone is here which could make that cluster play trillions of Doom. I mean, just how many pregnancy tests worth of compute is this?


Megadoom


With each doom instance being played by an AI?


Chrome will still find a way to eat all that up and lag.


Epic meme sir, here's your updoot


Best comment of the day!


That RAM is distributed among the GPUs right?

256 x 450W = 115KW = 82MWh = $82.000 / month at peak EU costs this winter (which will be normal next winter)

For what, something that gets everything wrong?


It's an agnostic system. They could use this thing for curing cancer or predicting earthquakes, but... just you watch and see what the Free Market uses it for.


Seen how bad both those are doing and will do in the future I guess it's "progress".

But correction to my comment above: These become one global memory... how slow it is and how many corruptions they will have is unknown but holy hell...


Celebrating new ways to further burn up the planet rather than discovering more efficient and better ways for training, inference and fine-tuning AI systems without needing to scale up more GPUs, TPUs, data centers and water for the same purpose.

The end result of this announcement is another expensive system only available to the same incumbent of tech giants with tens of billions at their disposal.


> The end result of this announcement is another expensive system only available to the same incumbent of tech giants with tens of billions at their disposal.

Certainly wasn't the case when a public research university partnership seeded by a generous donation from Nvidia co-founder/UF alumnus Chris Malachowsky was formally announced[1] shortly after DGX A100 launch[2] several years ago, never mind the handful of other academic early adopters mentioned in the press release.

Of course, we tend to conveniently forget such exogenous details.

[1] https://news.ufl.edu/2020/07/nvidia-partnership/

[2] https://nvidianews.nvidia.com/news/nvidias-new-ampere-data-c...


We found a way: nuclear fission reactors. So that problem is solved.


> We found a way: nuclear fission reactors.

Nope. I'm talking about efficient methods in training, inferencing and fine-tuning these AI models that doesn't require lots of data centers, TPUs, GPUs, etc. You're talking about something else.

Petrol and diesel cars are already burning the planet, but the main difference is, that there are efficient alternatives available today like electric cars to use instead.

AI (Deep learning) however, does not have any viable and efficient methods in training, fine-turning these AI models, at all [0] [1] and wastes a tremendous amount of resources, all to keep up with scalability.

So that problem is still NOT solved after a decade of using GPUs, the wastage is getting worse.

[0] https://gizmodo.com/chatgpt-ai-water-185000-gallons-training...

[1] https://www.independent.co.uk/tech/chatgpt-data-centre-water...


> efficient methods in training, inferencing and fine-tuning these AI models that doesn't require lots of data centers, TPUs, GPUs, etc.

Exactly the types of problems future AI models could solve.

Dire climate alarms are based on the predictions made using models. As modelling advances as a field, both predictions, and solutions become more and more voluminous and accurate, along with revealing mistakes and failures of prior models.

Anyone concerned with climate should rally behind this kind of general progress. Further, it simply is progressing, and fields that don't embrace it, will be left behind. We're in the midst of an unprecedented revolution which touches all.


> efficient methods in training, inferencing and fine-tuning these AI models

Which can also be archived by training more with the same amount of spent energy.

Why learn about training ("make training more efficient") on old hardware, which is more energy inefficient?


It goes more fundamental than that in the algorithms and it should not take tens of billions of dollars with multiple data centers to train, learn, fine-tune and do inference with these AI models. A decade later, there are no viable alternatives to solve that instead of the costly replacement of hardware with more expensive hardware.

Add that towards scalability and you will realize that training AI models scales terribly with more data as it is very energy and time inefficient. Even if you replace all the hardware in the data centers it still wouldn't reduce the emissions regardless and replacing them also costs at most billions either way. That is my the entire point.

So that does nothing to solve the issue. Only ignores and prolongs it.


> A decade later, there are no viable alternatives to solve that instead of the costly replacement of hardware with more expensive hardware.

I mean, that's the root of scaling as a principle, right?

You could viably start training an AI on your cell phone. It would be completely useless, lack meaningful parameter saturation and take months to reach an inferencing checkpoint, but you could do it. Nvidia is offering a similar system to people, but at a scale that doesn't suck like a cellphone does. Businesses can then choose how much power they need on-site, or rent it from a cloud provider.

If a product like this convinces some customers to ditch older and less efficient training silicon, I don't see how it's any more antagonistic than other CPU designers with perennial product updates.


Yet, the climate is still changing. Inflation is rising. The world definitely has advanced but never became a better place.


The world has never been better. This is an incredible time to be alive.


I don't feel that way personally... I have Nth level anxiety, maybe even N+1 level anxiety about losing my job to AI. Maybe in the future whoever comes next can enjoy things, but this literally keeps me up at night. With talks of extinction, job loss, etc. I feel like I wish I wasn't alive at this time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: