Nvidia DGX GH200: 100 Terabyte GPU Memory System

rektide · on May 31, 2023

I really have to wonder if anyone can compete with this kind of systems integration capability. A core having 900GBps connectivity to the cluster memory at such relative low power is epic beyond words. 800Gbps ethernet across PCIe is uncompetitive in extreme.

How the rest of the industry can respond is such a mystery. And will it be lone competitors, or will a new PC era be able to start, with an ecosystem of capabilities?

modeless · on May 31, 2023

This seems to be competing directly with Google's TPU pods. Looks like TPU v4 has a 300 GB/s interconnect, and 32 GB HBM per chip * 4096 chips = 131 TB (which is all HBM, so higher bandwidth than the LPDDR in Nvidia's system). So yeah, Nvidia's interconnect seems better. However, TPU v4 was deployed in 2020 (!) and Nvidia's thing won't be ready until next year. I've gotta imagine that TPU v5 has already been deployed internally for a while now, but hasn't been disclosed yet. Who knows, TPU v6 might even be deployed before this Nvidia thing.

dweekly · on May 31, 2023

Just want to flag a potential unit issue: 900gbps vs 300GB/s?

Also worth noting - TPUv4 uses a 6-way 3D torus interconnect vs the 3-way "multi ToR" NVLINK topology; the total bisection bandwidth of the TPUv4 pod is over 1PB/s!

Can't wait to see what TPUv5 looks like. As you say, it's probably already chugging away with v6 on track to tape out in a year.

That said, I think NVidia has nailed bringing the ecosystem along, and I think making the whole setup look more like "one huge GPU" could simplify a lot of ML programming.

I am actually disappointed I haven't seen more of that style in CPU programming. Where's my 20,000 core 100TB RAM VM instance?

llm_nerd · on May 31, 2023

>900gbps vs 300GB/s?

The nvidia device uses a fabric with 900 GBps switched fabric between any of the 256 nodes in the system. The TPUv4 3d torus network is basically a ring network of 56 GBps connections creating separate rings. From a raw perspective, the nvidia solution is the overwhelming winner. There is absolutely no contest.

londons_explore · on May 31, 2023

> Where's my 20,000 core 100TB RAM VM instance?

You could simulate this with a bunch of regular machines and a networked hypervisor.

You could do some kind of smart caching so that processes rarely need to wait to access RAM stored on a remote machine.

Combined that with a big lock eliding/speculation scheme (ie. When a process reads memory that might have been written by a remote CPU, you continue as if it hadn't, and if you later find out that data was written then you rollback). These rollbacks 'undo' all work done in however many microseconds it takes for data to travel from one side of the machine cluster to another.

Reads of RAM that aren't cached yet on the local node can also be speculated - you just assume that RAM contained null bytes and continue execution, rolling back and replaying when the actual data arrives.

So if you can make sure that processes are contending for locks and writing conflicting data less often than once per system-roundtrip-latency, then you should get a high performance system.

mochomocha · on May 31, 2023

This is certainly a very interesting thought to entertain and your ideas make sense. One thing that makes things harder on the CPU side in this hypothetical scenario is that CPUs tend to execute much more diverse instructions/computations than GPUs. So all the caching & speculation you mention is probably all the more important.

londons_explore · on May 31, 2023

After writing the comment, I considered writing a little toy example just to try out the idea... It would be neat to see Linux boot with 1000 CPU's...

But upon further thought, a lot of things such a system would need are actually rather inefficient to implement in software (ie. rollbackable RAM), yet quite cheap in hardware (for example rollbackable RAM can be implemented with regular RAM plus either a buffer of 'overwritten data' or a write queue)

samstave · on May 31, 2023

A write queue with a dupe-back-end to say a blob on S3 or whatever would be interesting for mirrors of outcomes could be stored.

The biggest issue it seems is bandwidth and humans' patience for a response...

mike_hearn · on June 1, 2023

> Where's my 20,000 core 100TB RAM VM instance?

Machines with terabytes of RAM do exist and get used - working well on such setups is a goal of modern JVM GCs for instance - but making a single machine that large which acts like a single machine isn't easy, nor especially desirable. One machine is a unified failure domains outside of mainframe-land, so if you had a 20k core machine with 100TB of RAM you could never reboot it to apply OS updates and it'd die all the time from failed parts.

Even once you get beyond that most software stacks use locking and stop scaling beyond a few hundred cores at best and that's assuming very heavily optimized stacks. AI workloads are easier because they're designed from scratch to be inherently parallel without lots of little locks and custom data structures all over the place like a regular computer has.

Disk storage is one of the places where you can parallelize and scale out relatively easily and you do see datacenter sized disks there.

inhumantsar · on May 31, 2023

Nvidia's is 900GB/s, not gbps

dweekly · on June 2, 2023

You're totally right. My bad. I went back to double check the docs and indeed the chips are 900GB/s total over 18 NVLink connections.

https://www.nvidia.com/en-us/data-center/nvlink/

whatusername · on May 31, 2023

If I'm reading the docs right (TBH - I'm probably not) it looks like on a z16 you can get 200 cores and 40TB of Memory on a single "VM" (LPAR).

So 1/100th of the CPU and 40% of the RAM. (I suspect the RAM comparison is reasonable - I'm not sure about how to compare the CPU's).

renonce · on May 31, 2023

If TPUv4 pods were so powerful, why are most new models trained with NVIDIA cards rather than TPU?

ShamelessC · on May 31, 2023

These are just my guesses but:

Software for TPU is still in its early stages. CUDA is well established. You can test on a gaming GPU that you can find (locally!) in many markets. XLA is meant to solve this, but first impressions matter and my first impression was that it has not yet "solved" this issue.

TPU is only available via Google Cloud - as far as I know they don't have NVIDIA's widespread distribution to various HPC/supercomputer systems. This also has implications on scaling up more than a few pods, as they will need to be colocated with speedy interconnect (which is provided by the various existing HPC systems that use NVIDIA's chips).

Finally, I think many people are discovering that the supposed benefits of TPU are marginal at best in the face of the types of natural scaling issues that both GPU's and TPU's suffer from when scaling out to e.g. hundreds of pods.

I'm certain that someone with more experience than I could give a better answer though - and again, all speculation. I refuse to use TPU because Google Cloud's system for getting access to said TPU's was horrible for me when I tried it. I believe John Carmack has a nice tweet thread specifying the same issues I ran into.

In general, Google has a habit of developing tech for other Googlers first, and as such winds up ignoring a lot of real-world scenarios faced by researchers/practitioners. NVIDIA on the other hand has been working directly with a ton of institutions and businesses ever since the inception of CUDA.

That their TPU's have seen any adoption at all is mostly due to their research program which granted very cheap access to TPU's to tons of people.

ioedward · on May 31, 2023

TPUs are mostly hoarded by Google Research (including Deepmind) and Ads. Very few are being used by external people.

vintermann · on May 31, 2023

The greatest artificial minds of our generation are thinking about how to make us click on ads?

philjohn · on May 31, 2023

It was ads that made the money to develop the artificial minds in the first place.

exikyut · on May 31, 2023

This is technically correct to the extent of paperclip maximization and I don't like it.

brookst · on May 31, 2023

This is starting to sound very paperclippy. Ads fund the AIs to make us click on ads to fund AIs that are even better at getting us to click on even more ads.

flir · on May 31, 2023

It's ok, as soon as the AI figures out a better way to gather resources, it'll pivot.

(this is not meant to be reassuring).

papruapap · on May 31, 2023

That is only true for Google. If anything bootstrapped AI, it was gaming.

replygirl · on May 31, 2023

what's the gaming story? most of the ai we know today builds on academic work going back to the 90s

donkeybeer · on May 31, 2023

Probably refers to the development of and increase in computing power of gpus, I guess.

selectodude · on May 31, 2023

It’s matrix multiplication all the way down.

lubesGordi · on May 31, 2023

Yeah, and look at how some very simple clustering ML/recommender systems impact social/political dynamics all to keep people engaged on the site and maximize chances to click ads ( see youtube/facebook, etc. ).

laserlight · on May 31, 2023

Last time I checked, OpenAI wasn't earning money from ads.

lordswork · on May 31, 2023

Last time I checked, OpenAI didn't develop transformers.

temp0826 · on May 31, 2023

Ex Machina vibes

wiz21c · on May 31, 2023

Damn right but I don't understand why. That is, why is ads business generating so much profits that it allows to build such ridiculously powerfull devices ? Is it because it's genuinely full of money or is it because Google is so central that it makes tons of money out of lots an dlots and lots of small adverts ?

collaborative · on May 31, 2023

It's a monopoly on eye balls. People don't casually walk in front of domain names, they must find them on Google

As a result, spending ad money on Google is ridiculously expensive, but companies accept this because there is no alternative hoping to "build long lasting relations" with the people who make them pay upwards of 1 dollar per click

zelphirkalt · on May 31, 2023

At the same time it is also a huge bubble, that Google is just hoping will never burst. People and businesses way overestimate the impact their ads are having and way underestimate the impact, that treating customers well can have.

quadcore · on May 31, 2023

I definitely think this is the strategy of google leaders, they've heard to much of "how do you monetize your products?" from investors and now they are maximizing profits for that current software generation. I wonder though if that bulk of money will be that much of an advantage when the tides turn. It could attract the wrong kind of leadership amongst other things like customer distrusts and turn the company into an IBM of some sort. Namely, I would rather maximize youtube premium memberships (which is at "only" 50 millions) over ads (surely they've local-maximized the balance between the two as it is) - but its easier said than done.

sharemywin · on May 31, 2023

I think both are important. Word of mouth is useful and important but no one would use google to search to buy stuff if that was the only way to reach customers.

Also, if your established it probably a good idea not to let new competitors get a foot hold in the market with an easy google win.

It's also pretty effective for local businesses because not a lot of local businesses are tech savvy enough use it effectively.

cj · on May 31, 2023

> people who make them pay upwards of 1 dollar per click

FWIW, the cheapest (quality) clicks I've seen, at least in the B2B space, is closer to $3/click, and it can quickly balloon to upwards of $10/click especially on company brand names where competitors are bidding on another company's brand name.

Knowing this, I cringe every time I'm screensharing with someone and they search "[B2B Company] login" to login to a tool they use every day. Each login = $2-$10

It's not uncommon for companies to spend $100k+/year JUST bidding on their own company name.

collaborative · on May 31, 2023

It honestly escapes me how these companies can be sustainable. The whole market is sooo inefficient. Companies also pay crazy money to appear in privileged positions in supermarkets shelves, and they will often pay crazy money for simply being in the supermarket at all

I just don't get where all the marketing money is coming from. Bootstrapping is clearly not an option these days

mrguyorama · on May 31, 2023

Computers DOUBLED the productivity of the USA since the second world war. All that money went to a few people and groups, and none of it went to average people. For decades, companies have just been sloshing the same giant pile of cash around and around the Ads ecosystem.

That bag of chips did not cost $4 to make, not even a little close.

imtringued · on May 31, 2023

Because there is no incentive for customers to tell businesses what they want, businesses tell their customers what they should want.

h4kor · on May 31, 2023

My working theory is, that advertising is the overhead cost of doing capitalism. There is a certain percentage of resources which have to be spent on advertising to keep the system functioning. Google is good at grabbing a large portion of a huge pile of money.

mordae · on May 31, 2023

Not really. It's sufficient to show cool products in "TV" shows (robotic vacuum cleaner in a procedural crime drama might even be a plot device, absorbing murderer's hair to be found by detectives, gasp!).

Coupled with a magazine or a show presenting new product categories for those interested, customers will eventually visit a physical or online shop and check out the goods. And then word of mouth will do the rest.

Aggressive advertising will mostly just help you get ahead of your competitors and perhaps speed up the adoption rate at the cost of increased volatility of the market and to the detriment of people's mental health.

We would be better off regulating aggressive ads away.

WJW · on May 31, 2023

> Aggressive advertising will mostly just help you get ahead of your competitors

That's a hell of a load-bearing "just" you managed to insert there. Getting ahead of your competitors in market share can be the difference between having a company succeed or fail.

chromoblob · on May 31, 2023

So if nobody is "getting ahead of competitors", does it mean that "capitalism is not functioning"? (which was the point of the comment to which the reply was)

SturgeonsLaw · on May 31, 2023

Product placement is still advertising, likewise advertising plays a role in getting people to go to that online or brick and mortar shop instead of some other one.

chromoblob · on May 31, 2023

I propose a law: nobody can advertise a product without mentioning all the brands which offer same or similar product on the market (and the mention must be neutral or positive).

Or: all advertisers of all brands with a same or similar product must collaborate. Only voluntary input counts as collaboration; if a brand simply doesn't care about presentation of itself in the advertisement, they have trivially collaborated. Easiest way to implement this is giving every owner of all relevant brands a right to veto every entire final advertisement product (this right could also be surrendered, for all or some possible vetoed advertisements, in exchange for something in a contract).

Ignoring flaws of this proposition itself, what could be society's reasons for rejecting it? Does society perhaps want havers of more money to gain further advantage over havers of less money?

pixl97 · on May 31, 2023

>nobody can advertise a product without mentioning all the brands which offer same or similar product on the marke

Maybe 50 years ago that would have worked. Today, not so much. Go to Amazon and look, well, just about anything. What is BEHENO, what is DINGEE, what is Etoolia, what is Romedia, what are the over 300 different 6/7 letter companies that show up when I search up some random product.

Unfortunately your consideration causes its own parasite effect of countless companies forming up to feed of the big advertisers budget.

chromoblob · on May 31, 2023

Since the product is standard, why is it actually bad? If there are too many brands to be included in a single advert, just choose randomly (the lower the price, the higher the probability for a single brand; I don't know the function).

pixl97 · on May 31, 2023

Because, in the US, this will quickly fall foul of free speech laws. Over 'public' airwaves maybe you could go some distance with this, but advertising on private property, as long as it is not fraudulent will present a constitutional challenge to what your saying.

And, you're also crating a regulatory nightmare. Say I put up an add for XXYZXX company, and it includes ZZXYZZ and YYXZYY information (I mean totally random picks), and I just happen to have a stake in those companies too. Now you're going to have to track hundreds of thousands of these entities to ensure no fraud is occurring, and in most cases the fines for this kind of behavior are well under the cost of doing business.

Everything you've said so far just creates bigger messes and solves nothing.

chromoblob · on May 31, 2023

It solves a hypothetical skew towards brands offered by already richer businesses.

About regulation, how hard is it to just audit the random picking procedure?

I now understand that my second variant, with vetoing of final advertisement, is very flawed (one can cheaply obstruct anyone's advertisement by making a company that vetoes any version of it). How about dividing an advertisement into pieces of information solely about each distinct brand, and let every brand owner compose the piece for its brand? Then all pieces are added into final concrete form in a collaboration - I think it would succeed in most cases, and if brand owners can't collaborate, then an independent company will work on it.

Then we need to look how exactly freedom of speech is defined. If it means ability to express views without attaching any additional information, then such freedom is incompatible with my proposal. But if freedom of speech allows attaching additional information as long as base message is preserved, I see no problems. Note that the proposal essentially just forces you to advertise other brands as they wish, along with any advertisement that you do, which (brands) it doesn't mention.

collaborative · on May 31, 2023

@h4kor one of my crazy ideas is to cap money companies are allowed to spend on Marketing once they reach a certain size. It would encourage a better form of decentralized capitalism and prevent monopolies

fauigerzigerk · on May 31, 2023

This could easily turn out to be counterproductive. It would provide an additional incentive to hide marketing in all kinds of other business activities rather than openly advertise what's on offer.

Marketing is already difficult to tell apart from other company communications, product documentation, etc. What about a company blog showing how to use their products? Is that marketing or product documentation?

collaborative · on June 9, 2023

The point is that "openly" advertising would be capped. That would reduce the price of doing so, making it more affordable to smaller players and removing the insane profits ad monopolists enjoy today. Plus, "openly" advertising is one of the most effective ways of advertising. Lastly, by diverting marketing budgets to non-traditional routes (charity donations, etc), the economy would benefit as money would be spread more evenly across

bee_rider · on May 31, 2023

I wonder if there’s some sort of automatic stabilizer that could be applied instead.

Tax ad companies, and spend that money on education. The better ad companies are doing, the more we spend on education, the fewer gullible marks we produce, the worse ad companies will do.

collaborative · on June 9, 2023

sure, but what do you consider to be an ad company? is a newspaper that places sponsored articles an ad company? accounting for "marketing" expenses might be easier to track and at the end of the day, companies use accountants that are liable and so need to report accurately

jack_riminton · on May 31, 2023

Exactly, it’s the mechanism for exchanging information in a capitalist economy.

Conversely, in Communist systems they could never get this right. Factories were just told to produce 5 or 10% more than last year, didn’t matter if the product quality was worse or if people didn’t want it.

ffgjgf1 · on May 31, 2023

There was some competition amongst consumer goods producers and TV and other ads in the UUSR. High scarcity of good quality stuff meant they didn’t need to advertise but there was also an oversupply of junk nobody needed. Those companies has to move their inventories somehow since it was much harder for them to go bankrupt.

sharemywin · on May 31, 2023

unfortunately pure capitalism has no mechanisms for externalities and information hiding.

itslennysfault · on May 31, 2023

A little freaky when you think about what that really means. Some of the most advanced AI systems in the world are solely focused on being good at manipulating human behavior. Cool... cool cool cool............

czx4f4bd · on May 31, 2023

Tangentially, I think this explains the conspiracy theory that ad companies are spying on everyone's phones and serving ads based on what we talk about in real life.

Think about all the stuff ChatGPT and GPT-4 can do with even minimal prompting. Even when they hallucinate, the text is still ostensibly coherent and natural sounding. Now imagine a similarly powerful model, but its input is a ton of metadata about your behavior and its output is ads.

Now consider that adtech has had substantially more funding for substantially longer than research into LLMs, so ad serving models are probably way more powerful and optimized than even GPT-4.

It's freaky to think about indeed.

jnkl · on May 31, 2023

Another thing is: people's individual behavior is not as unique as we'd like to think. As a whole everyone is unique, but in single surprisingly complex aspects of our life we are hardly ever alone.

ddalex · on June 1, 2023

so Hari Seldon was right in his psychohistoric theory ?

ktta · on May 31, 2023

It's been that way for over a decade now. Welcome

whywhywhywhy · on May 31, 2023

It’s not very good at it if it is.

nazka · on May 31, 2023

No some are into the space industry.

So we can have internet anywhere. To click on ads.

rapiz · on May 31, 2023

It's ads that makes the market efficient. Potential customers should know the corresponding producers so that the information assumption of a ideal market stands.

throw10920 · on May 31, 2023

Ads can have both persuasive (propaganda) and informative functions.

Informative ads make the market more efficient. Persuasive ads actively make the market less efficient.

Most ads in the US in 2023 seem to be persuasive.

Perhaps the ad industry would become more useful (and smaller) if we managed to effectively regulate it to significantly reduce the persuasive bits.

I think that most people would support this if you explained it right - from the free-market perspective, this would give you a better market.

f6v · on May 31, 2023

How else would I know that “Elon Musk created a TeslaX platform that allows everyone to get rich”? Or was it Pavel Durov… Seriously, I can’t even report these on YouTube.

llm_nerd · on May 31, 2023

There is a mythology to Google's TPU that is not validated by real world numbers. Where we can actually test (I mean -- TPUv4 pods are available right now on their cloud) it is very good, but remains uncompetitive with the h100. I mean, Google disclaims that you shouldn't compare it, doing the classic "the h100 is on a better process node so it's unfair". People will always point at a mythical next generation that is surely way better, despite the fact that Google is currently building big supercomputers with their TPUv4. And in Google's shootout, again comparing with the last generation of nvidia hardware (the A100), Google's biggest advantage was in the connection fabric, which with this DGX GH200 nvidia not only overcame, but bested by a magnitude.

More competition would be fantastic. Better pricing at scale would be fantastic. But there is absolutely no doubt that nvidia is far ahead of Google right now. Tesla made some believably pushing claims about their own efforts with their own hardware, so who knows maybe they're the real challenger.

modeless · on May 31, 2023

To add to the other answers, TPUv4 was not released to cloud customers until last year. And I bet availability is not as good as GPUs, even in Google Cloud (obviously TPUs are not available at all in other clouds).

boyka · on May 31, 2023

Availability only on GCP and in particular cost.

fulafel · on May 31, 2023

Google has advertised that they have better perf/$ than GPUs, is this wrong or do you just mean absolute cost (so not available in small enough slices)?

edit: actually now i can't find the claim, maybe i misremember what the papers said.

mochomocha · on May 31, 2023

Perf/$ where $ is what it cost _them_ , not $ they're ready to sell to others as a product. Cloud margins in the high two-digit percents are typical, and I'd imagine even higher for very specialized products in high-demand from deep-pocketed customers.

sanxiyn · on May 31, 2023

https://arxiv.org/abs/2304.01433 does claim "1.2x-1.7x faster and uses 1.3x-1.9x less power than the NVIDIA A100".

imsaw · on May 31, 2023

From a personal use case, the number of instructions available in TPUs are still limited and some workaround is needed when designing custom layers. Even if it's available in platforms like Colab or Kaggle, people still lean to GPUs as it is more versatile.

the_svd_doctor · on May 31, 2023

The network topology of TPUv4 is far far inferior though. It's a torus. No switches.

ec109685 · on May 31, 2023

What evidence is there that Google would be able to out compete nvidia on AI hardware?

madaxe_again · on May 31, 2023

None. Heck, I can’t even search my gmail effectively any more, so if they can’t maintain a core product, I doubt they can build a new one of any quality. alphabet are now just a big, bloated catch-up corporation running on inertia and past glory.

I don’t think they will exist in 10 years.

whywhywhywhy · on May 31, 2023

>None. Heck, I can’t even search my gmail effectively any more

Their search products have actually gotten worse with AI. Google Images running just off basic image recognition (as in is this the same image) and the context of where they found it was far superior at identifying what an image is than ML Google Image.

The OG version could identify a frame from a movie and provide higher res versions. The ML version goes “errr looks like a woman on a street, here are random photos of unrelated women on unrelated streets with maybe a similar color scheme” close to useless why would anyone want that. Yandex Image search blows it out of the water simply by being Google Image Search from a decade ago

madaxe_again · on May 31, 2023

This is the kind of stuff that I see as being the crux of their downfall. Snippets have also gone to pot over the last year or so.

The overall theme is that product is no longer the focus, but rather navel-gazing - that’s to say, their internal world no longer aligns with the external world, and that is a fundamentally dangerous place for a business.

adql · on May 31, 2023

I thought I imagined it being worse but yeah...

endisneigh · on May 31, 2023

Someone who believes Google won’t exist in 10 years is delusional beyond words.

madaxe_again · on May 31, 2023

Yes yes, and the East India Company will reign supreme for all time, Refco is too important to fail, Blockbuster will dominate home entertainment forever, and it’s simply inconceivable that a single trader could bring down Baring Brothers, they’ve been going for centuries!

Businesses fail. google will likely still exist, but alphabet, I don’t see a future for - just a gradual withering followed by a collapse and disintegration into myriad properties in a fire sale. They are brittle, overburdened by unity of disparity, culturally adrift, and they aren’t taking risks any more. Inertia will keep it all going for a while, but not forever.

Sure, I may be wrong, but I do put my money where my mouth is, and I am right more often than not.

shubb · on May 31, 2023

Your reply is interesting because you strongly believe alphabet will fail but only supported that by arguing that over the very long term so companies fail.

I see a lot of hate for alphabet on HN. It seems very emotional. I think people feel personally betrayed by thier bad behaviours because they were 'supposed to be better'.

The thing is, there are a lot of companies you can hate. Exon, mcdonalds, blackrock, even Microsoft, there are people who are very mad at these companies.

That's not an argument that the company is doomed. If you are really putting your money where your mouth is (what shorting google?) Then I hope you have a better reasoning as to why they will fail not just eventually but this year.

madaxe_again · on May 31, 2023

I don’t hate alphabet - neither do I love them. I look at them through the lens of history. You on the other hand seem to be emotionally wounded by my assessment of them.

None of the companies you list are likely to collapse soon, as they remain focussed on their various missions, and have a unity of purpose. Out of all of them, I think Microsoft is the most likely to fail, as they are likely to be blindsided when the user-focussed desktop OS era ends. Their diversification efforts have been a mixed bag, and without windows, they are far, far less significant.

What I do look at is sentiment analysis - what other people feel and think about businesses, as that drives the market.

No, I don’t short, as just buying equities which are beginning significant growth is just as effective and doesn’t drive demise - I held goog for nearly 20 years, and sold off late ‘21, as I think they’ve peaked, and anything from here on is speculative froth.

You’ll note I keep saying “I think”, rather than making statements of fact - because this is purely what I think - I am not a Sybil.

You seem to have missed this:

>> They are brittle, overburdened by unity of disparity, culturally adrift, and they aren’t taking risks any more.

ffgjgf1 · on May 31, 2023

> Out of all of them, I think Microsoft is the most likely to fail, as they are likely to be blindsided when the user-focussed desktop OS

That might have been a reasonable assessment back in Balmer’s era. But what you saying has already happened years ago…

They have mostly reinvented themselves since then. Enterprise/office isn’t going anywhere. Xbox if fine too. And there is a lot of growth in their cloud/etc. business.

IMHO out of Google, Amazon & Facebook, Microsoft seems to be the least dysfunctional and and general best positioned one to be successful in the future.

avereveard · on May 31, 2023

Xbox doesn't seem fine. I think it's propped up by game pass having cross platform title access with windows but it's still under the Xbox balance sheet, but growth and number of exclusives doesn't paint a healthy picture.

ffgjgf1 · on May 31, 2023

Yeah by Xbox I mean the console + game pass + PC/Xbox gaming division. The console itself at this point is not much more than a cheap(ish) locked down gaming PC.

shubb · on May 31, 2023

For context, I have never worked for or with google, and don't use their products much other than search. So I don't have much emotional connection to the company. My comments were more motivated by a kind of concern.

My perception is that Google split into a number of focussed business units when they became Alphabet, with the Google component being execution focussed and the more speculative stuff spun out into other group companies like deepmind, waymo, etc. That's why the Google unit stopped doing nice incubator projects that we were all excited about.

From what I've seen, this cash cow execution business unit has been fairly effective - in particular they've done a good job of entering the cloud market space producing a differentiated product that is penetrating their target customers. They have not been able to compete with Microsofts excelent and deeply embedded IT sales capability, so they've done well to go after people with big problems that other vendors more civillian offerings are not so great for. They are currently the first choice platform for AI training for instance.

I'd contrast this to Facebook who seem to be trying to become a deep tech VR hardware vendor in the same business unit as their cash cow entertainemet and advertising business which has confused investors and probably distracted their focus.

We can see that Google has innovated. For instance, a lot of Tela's stock price is based on the idea that they are going to run autonomous taxis, and instead of owning cars we will just hail a Tesla when we need one. Telsa does not run autonomous taxis, but you can ride a Google Waymo taxi today in Pheonix, and they are running autonmous trucks which is a big industry Tesla aren't even attempting yet. They are doing a lot in medicine and medical devices. This seems a lot more diversified and innovative than other companies - it's just not as visible to the HN community as an RSS reader or some other internet thing we care about.

We can also say that... on the AI thing, I think it's very early days. Microsoft have a shakey looking deal with the first mover, but Alphabet and Facebook have the advantage of actually using AI extensively in their real buisnesses and may be able to deliver product market fit better. Time will tell.

On the stocks front, I agree with your overall thesis - I think it's harder for these conglomerates to grow than a new company just because they are already giants in their niche and even adding a new niche generates less growth in percentage terms than for a smaller company starting from a lower number. I just wouldn't actually bet against google as much as I would some of the others.

endisneigh · on May 31, 2023

you originally said 10 years. so hopelessly delusional lol. since you're so confident let's bet $10,000. By your claim let's bet by 2034 (I'll give you some extra time). Alphabet Corporation and all subsidiaries will no longer exist. If they do I get your $10K. If it does not you get my $10K.

We can both give the money to a mutually trusted third party now.

mejutoco · on May 31, 2023

but IBM still exists, and Microsoft after missing the mobile market. Even Nokia exists.

htss2013 · on June 1, 2023

If you believe this it implies you have gone full malthusian, because the innovation engine that made Google possible could also disappear Google, but without that engine we are all screwed.

jeodjdodh · on May 31, 2023

> code product

gmail is a freebie! the core product is how they index your messages to create an anonymous profile that they will then offer on reverse bid to advertisers when you do a search or visits an AdWords site.

Tyr42 · on May 31, 2023

Gmail dev here (but not search), I don't think anything has changed with search. Operators still work too. What's actually wrong?

Do you just have more email now?

ec109685 · on June 1, 2023

If you don’t match the terms in your email exactly right with your query, gmail starts returning email that matches one of the terms which is rarely what you want.

ddalex · on June 1, 2023

What would you expect it to do? How is it supposed to know what you want if you don't provide it with the exact search terms? Shouldn't it do partial matches if there are no full matches ?

ec109685 · on June 4, 2023

It should look for things that match the meaning of the words searched.

E.g. what’s possible with embedding where the query terms are matched with similar messages that mean the same thing.

ijidak · on May 31, 2023

Is HBM mostly Samsung?

mahkeiro · on May 31, 2023

Market share for last year were 50% SK, 40% Samsung and 10% Micron, but as there is currently a huge demand things may change depending on capacity.

atty · on May 31, 2023

I thought SK Hynix was the big producer of HBM? But that could be out of date.

tbruckner · on May 31, 2023

Some of the specs seem inaccurate here, HBM has been present in NVIDIA datacenter GPUs for awhile now. LPDDR is for their gaming hardware.

wmf · on May 31, 2023

The GH200 uses a combination of HBM3 for the GPU and LPDDR5 for the CPU but it's a unified memory system so the GPU can access all the RAM. Gaming GPUs use GDDR which is a third flavor.

epolanski · on May 31, 2023

As always in economics it is about volumes and margins.

If the competitors (mainly AMD, Intel and to some extent ARM) will keep seeing growing volumes and insane margins they will be attracted to bring to invest and take part of that market.

Till now gaming GPU market did not bring to AMD the necessary margins to really push them to bring a better competition to Nvidia. Even 10/15 years ago when ATI was way ahead of Nvidia technologically for 2/3 years (the HD 4000 and HD 5000 generations vs the Nvidia flops of the 9000, 200 and 400 series) Nvidia was posting billions of profits and ATI posted a whole...19 millions of profits across 3 years.

But today's GPU market thanks to it's non-gaming sales is much bigger to ignore (which is why Intel entered it as well) and those players will likely react.

You don't need to have the best premier product, you need to have your products good and priced well enough that they will be chosen over the competitor's.

brucethemoose2 · on May 31, 2023

Cerebras supposedly can: https://www.servethehome.com/cerebras-wafer-scale-engine-2-w...

In hindsight the 40GB of SRAM feels kind of quaint, but nevertheless their very fat nodes let them get away with more than Nvidia could with A100s, as you can see in the slides.

CS2 is a little old now. I bet an update is just around the corner.

makestuff · on May 31, 2023

I have to wonder why these engineers are not paid millions of dollars per year. As a lowly backend dev this seems so much more impressive than my new API that retrieves something from a database...

akomtu · on May 31, 2023

Engineers are really poor negotiators, probably because they neglect "people skills".

rektide · on June 9, 2023

Because their scales of what are important things are so often entirely unintelligible to the rest of the planet.

ksec · on May 31, 2023

Hardware Engineers are really poor negotiators.

flangola7 · on June 1, 2023

No need for the quotes, it's exactly the issue

GuB-42 · on May 31, 2023

Because 10 engineers paid hundreds of thousands a year do a better job than 1 engineer paid millions.

And that's because it is very high skill work, otherwise, it would have been 100 engineers paid tens of thousands.

jjtheblunt · on May 31, 2023

perhaps they are, even if in stock share price appreciation?

fock · on May 31, 2023

Isn't this a lot of things which AMD has already sold as the ORNL-Frontier 2 years ago? The main difference seems to be that external bandwidth here indeed is crazy via NVLink (though it is only 450GB/s per way so the same as 64 PCIe-Gen5...) and they have two networks for communication (although I suppose the HPE Slingshot is as good as the Infiniband in here)...

ironbound · on May 31, 2023

Sounds like you havent seen Wafer-scale integration computing, Tesla has one and comercial companies like cerebras will sell you a cabnet without the miles of fiber networking.

https://www.cerebras.net/andromeda/

virtuallynathan · on June 1, 2023

FWIW, I believe it’s 900GB/s full duplex, so 450GB/s each way. Still a ton!

briffle · on May 31, 2023

I'm curious how you can keep that fed with data fast enough. What kind of interfaces to your network do you need to keep it busy and not just waiting on data.

amelius · on May 31, 2023

Ultimately how fast their transistors can switch and at what power is determined by TSMC, which everyone else can use too. Same for density of interconnect.

rektide · on June 1, 2023

For compare a lane of PCIe 6.0 is 7.56GBps. If all 128 lanes of a AMD cpu were able to be interconnect, that would be 967GBps.

usernew · on May 31, 2023

Is there really a market for a response though? Now, I'll be honest that I know very little about this market. What I do know from doing a decade of presales before covid hit, is that people who buy GPUs go for aggregate max on a big node farm. Now, most of my clients who bought GPU-heavy scale-out nodes were in the financial industry, so maybe deep learning stuff is different. Their workloads were massively parallel, and could scale out instead of needing something singularly fast.

So I guess my question is - what use case is there for a huge truck that goes 200mph and take 4 trips, when you could just buy 16 regular trucks, and move your apartment in the same amount of time at half the cost.

rektide · on June 9, 2023

The exciting thing about CXL is we can start to find out if peripheral or hopefully close-networked computing fabrics can be useful & interesting, beyond the small circumstances Nvidia will offer. Having an ecosystem that everyone can participate in will let us explore. Money can't buy that. Talent can't buy that. You need to socialize to really find out the possible values.

'The street finds it's own uses for things' is the well known Gibson adage, I and typically it's a comment aimed low. But our entire era of amazing computing began with the Gang of Nine enabling lowness in a degree such that it quickly became the highest tech, the best. Sure you can still buy a mainframe & they have amazing feats but it's not where the value is, but and the value is where it is because possibility was unchained, I unleashed from corporate dominion, and spread wide. I think we can find amazing new futures with CXL & mad bandwidth connectivity.

fragmede · on May 31, 2023

The reason that analogy falls short is because it's easier to drive the huge truck at 200mph than it is to find 16 truck drivers. It's really neat when you figure out how to map/reduce your algorithm so you can parallelize it, but it would be even easier if you didn't even have to in the first place. And that's assuming that it is even parallelizable in the first place. Not all algorithms can be optimized like that and needs a bigger system to run on.

rlupi · on May 31, 2023

There are workloads that are data parallel, and scale like the GPU-heavy scale-out nodes that you describe.

The other approach, which you do when models themselves are massive, is model parallelism. You split it into multiple parts that run on different nodes.

In both cases, you need to distribute weight updates through the network although the traffic patterns can be different.

To maximize the performance in both scenarios, systems designers optimize for all-reduce and bisection bandwidth.

There are also other tricks, for example the TPUv4 ICI network is optically switched, and it is configured when a workload starts to maximize bandwidth for the requested topology ("twisting the torus" in the published paper).

mejutoco · on May 31, 2023

Using something like Stable diffusion and generating all the frames at once (for video) as a single image. For that kind of usage one needs to have ram for the whole image. This setup could generate videos like that in the same time as I generate an image on my home computer.

valine · on May 31, 2023

There were rumors floating around that GPT-4 was going to be a 100 trillion parameter model. Those rumors seemed ridiculous in hindsight, but this announcement makes me rethink how ridiculous it really was. 100 Terabytes of GPU memory is exactly what you need to train that class of model.

However, I’m not even sure enough text data exists in the world to saturate 100T parameters. Maybe if you generated massive quantities of text with GPT-4 and used that dataset as your pre-training data. Training on the entirety of the internet then becomes just another fine tuning step. The bulk of the training could be on some 400TB dataset of generated text.

EvgeniyZh · on May 31, 2023

Rule of thumb is that you need ~20 tokens per parameter. The average token size is ~4 characters, probably more for larger models where you want larger dictionary, but for simplicity I'll say it's 5 bytes to make numbers round. So you need 100 bytes of text data per parameter, or 10 PB for 100T model. Now, recent research says that you can reuse the same data like 4 times before it becomes hindering performance but it doesn't help much in our case.

But in this case what is really ridiculous is the compute requirement. The required compute for optimal model growth roughly quadratically (both your model and your data grow linearly). So for 100T model you need 1e30 FLOPs. This machine gives you 1e18 FLOPs per second. It will take 30k years to train this model on one of these (or 30k of these to train it in a year, but then utilization will start kicking in).

moffkalast · on May 31, 2023

"The best time to start training a 100T param model was 30k years ago, the second best time is now."

EvgeniyZh · on May 31, 2023

Probably the best time to start train 100T model is never

samstave · on May 31, 2023

What if you could train an AI with a desired outcome to their answers?

I.E. ; "answer this question where the outcome is the most beneficial to quality of life"

EvgeniyZh · on May 31, 2023

I'll take your question further: what if we have unlimited data (say some crazy rich RL environment or way to produce high quality and diverse synthetic data)? You still have to get these 1e30 FLOPs. Lets say you can connect 100 of these bad boys together with 40% utilization, with total 4e19 FLOPs/s. Assume also Moore's law keeps working indefinitely. When should we start training 100T model on it to get is as early as possible? We wait x years and the start training on machine with 4e19*2^(x/2) FLOPs/s. Turns out the answers is ~16 years, after which we'll have 1e22 FLOPs/s and 1e30 FLOPs will take another 3 years.

moffkalast · on May 31, 2023

> life

A strange game. The only winning move is not to play. unplugs self

chromoblob · on May 31, 2023

A properly designed AI agent would do exactly that.

rowanG077 · on May 31, 2023

That's obviously false under the assumption computing power will increase as it has in the past.

moffkalast · on May 31, 2023

For the uninitiated, it's a tree planting quote.

samstave · on May 31, 2023

I am not going to be embarrassed for the following Q ;

Please ELI5 where I can have a glossary of AI/ML terms - where do I get fluency in speaking about Tokens, Models, Training, Parameters, etc...

Please dont be Snarky - This is info that everyone younger than I am needs as well.

Is there a Canon? Where is it?

eamsen · on May 31, 2023

At the risk of sounding snarky, https://chat.openai.com would be a good introduction, followed by books, which GPT could recommend.

EvgeniyZh · on May 31, 2023

I have no idea tbh. I learned these a while ago (~7 years ago), and the materials I used then are heavily outdated and also I won't be able to remember what they were. I guess any intro course to deep learning should talk about these. Stanford ones used to be good. Maybe someone else can be more useful about it.

rocmcd · on May 31, 2023

I'd recommend starting here:

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

It's pretty lengthy but doesn't require a PhD to understand. If you can get to the end of it you'll have a much better understanding of what's going on.

Crye · on June 1, 2023

To be honest, I'd start with some introduction to Transformer YouTube videos. They'll cover a lot of these terms and you'll then have a better understanding to find additional resources.

causalmodels · on June 1, 2023

> Rule of thumb is that you need ~20 tokens per parameter.

That rule of thumb is wrong. The chinchilla paper has it anywhere between 1 and 100 tokens per parameter.

arugulum · on May 31, 2023

> Those rumors seemed ridiculous in hindsight

No, those rumors seemed ridiculous even then. Many AI influencers were posting some of the most absurd material, often makes basic mistakes (like confusing training tokens with parameters), but anyone in the field could have easily told you that 100T parameters sounded ridiculous.

On that note, "100 Terabytes of GPU memory is exactly what you need to train that class of model." is also likely false. That's how much you'd need to fit such a model into memory at 1 byte per param. Not train it.

GaggiX · on May 31, 2023

https://huggingface.co/docs/transformers/perf_train_gpu_one#...

You can't train a 100T model with "only" 100TB of VRAM, you need for each parameters 4 bytes + 4 bytes (gradient) + 8 bytes (AdamW optimizer) + forward activations that depends on the batch size, sequence length etc, maybe more if you use mixed precision and also you need to distribute the weights.

Taek · on May 31, 2023

The general rule of thumb that I'm familiar with is that you need about 80 bytes of VRAM per parameter when you are doing training. Inference is different and a lot more efficient, and LoRA is also different and more efficient, but training a base model requires a LOT of memory.

A machine like this would top out below 2 trillion parameters using the training algorithms that I'm familiar with.

renonce · on May 31, 2023

I suppose it would be 12 bytes? 4 bytes for base model, 4 bytes for optimizer momentum and 4 bytes for optimizer second moment EWA.

Taek · on May 31, 2023

I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).

The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"

mirekrusin · on May 31, 2023

Why 80? It's matrix operations on 4 byte numbers for single precision.

KeplerBoy · on May 31, 2023

Because you need a lot more information to perform back-propagation.

mirekrusin · on May 31, 2023

It's not "a lot more" information, it's holding derivative (single number) per parameter, right?

calaphos · on May 31, 2023

For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.

mirekrusin · on May 31, 2023

What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?

ioedward · on May 31, 2023

You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.

mirekrusin · on May 31, 2023

Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.

rfoo · on May 31, 2023

It's not per parameter, you also need to hold activations for back prop to work.

mirekrusin · on May 31, 2023

You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.

gmueckl · on May 31, 2023

Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.

mirekrusin · on May 31, 2023

You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.

andai · on May 31, 2023

For the numerically challenged like me: 100TB is 100 trillion bytes, giving you 1 byte per parameter at 100T params.

LLaMA can apparently run quantized to 4 bits per param (not sure if worth it though), which would allow you to run a 200TB model on one of these cards if I'm understanding right.

renonce · on May 31, 2023

You can’t quantize it for training due to numerical instability. For inference you don’t usually use such a big cluster.

airgapstopgap · on May 31, 2023

I think people talking about a 100T GPT didn't mean a dense transformer but some sort of extreme Mixture-of-Experts which is much more amenable to low-resource setups and complicates this discussion.

In any case, it's almost certainly not bigger than 1T, even if it's not a dense transformer (PaLM-2 is and makes do with 340B, but it isn't exactly on par).

Tepix · on May 31, 2023

> LLaMA can apparently run quantized to 4 bits per param (not sure if worth it though)

From the GPTQ paper https://arxiv.org/abs/2210.17323:

"... with negligible accuracy degradation relative to the uncompressed baseline"

valine · on May 31, 2023

That would work for inference, but for efficient training you’d also want you training set to fit in memory.

flangola7 · on May 31, 2023

There is much more out there than text. Audio, visual, touch, smell. Text isn't something humans directly train on, but representations of text from our senses.

GPT-4 was trained on image data. Besides gaining understanding of image content it also showed improved language abilities over a GPT-4 trained with only text. Facebook is working on a smaller model with text, image, video, audio, lidar depth, infrared heat, and 6-axis motion data. If a GPT-4 was trained with data like that, what capabilities would it have? Rumor says we will know in a few months.

ddalex · on June 1, 2023

My understanding is that the image data used a decoder-only stage, i.e. mapping images to tokens, basically taking the image textual description instead of the actual pixels so it can't "see" but can understand the "narration of an image"

xen2xen1 · on May 31, 2023

John Conner, is that you?

nomel · on May 31, 2023

> However, I’m not even sure enough text data exists in the world

I hope these models move significantly beyond text at some point. For backend programmers it's ok, but for the rest of the technical world (circuits, mechanical engineering, front end, sound, etc), it's fairly limited.

liamwire · on May 31, 2023

My understanding is that this is already the case, see PaLM-E as one such example of a multimodal model.

martinko · on May 31, 2023

> Maybe if you generated massive quantities of text with GPT-4 and used that dataset as your pre-training data

Hello spurious regression

in3d · on May 31, 2023

These 100T rumors were ridiculous from the start, not just in hindsight.

ericd · on May 31, 2023

I think we’re going to start seeing learning based on all the video out there. Text is just computationally easier, but video contains a lot of information that people rarely write about, because it’s completely obvious to humans who grew up in the real world.

Also, I think training in simulated realities will be big, especially for learning how to interact with complex systems, for developing strategic planning heuristics.

Joeri · on May 31, 2023

There may not be enough text content on the internet, but there’s plenty of audio and video content, and there has already been some research about connecting that as an input to an LLM. So far we’ve seen that the more diverse the training data the more versatile the model, so I suspect multi-modal input training is inevitably where LLM’s are going.

sbierwagen · on May 31, 2023

As far as I can tell, the "100 trillion" number comes from an interview with the CEO of Cerebras when he was doing press for the WSE-2 release in 2021: https://www.wired.com/story/cerebras-chip-cluster-neural-net...

ivalm · on May 31, 2023

You don’t really need to fit fully in memory. Memory requirement to train is

~6DP * precision

Where D is number of tokens*mini batch size and P is number of parameters.

So if you want to fit fully into memory with a mini batch of 1, context window 32k, and 16 bit precision, that’s 144e12/6/32e3/2 = 375M param.

If you apply one token at a time then

144e12/6/2 = 12 T param

Ofc, in reality you have model parallelism as well…

jandrese · on May 31, 2023

I have to wonder how much improvement you would get with a 100 trillion parameter model. There seems to be diminishing returns in model size. That effort could almost certainly be better spent.

fomine3 · on May 31, 2023

Let's record every conversation on Android to collect training data! Anyone can do the math?

fock · on May 31, 2023

So how exactly (in the technical sense) is this more energy efficient than both PCIe and Infiniband (which seems to be a claim somewhere too, together with the added bandwidth)?

EDIT: so the whitepaper is surprisingly good for that (somehow all the articles are very weird...): https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-ho... - essentially they connect the GPUs with NVlink instead of PCIe (so, vertical integrators heaven) and then NVLink forms a separate interconnect for GPUs. So this is cool and essentially what Fujitsu, Google, ... have done for some time. A fun thing is, that they like to add up their nvlink-duplex bandwith and don't do for PCIe... (which then suddenly would have the same bandwith as the GPU-side).

Still very cool to see the mainframe come back alive ...

(it's a bit sad they bought Mellanox - monopolies are sad...)

Aromasin · on June 1, 2023

Article written by marketing. White Paper written by apps engineering. You can always tell so much more about semiconductor products if you can get your hands on material written by engineers. The information hasn't been distilled through layers of internal training sessions.

unwind · on May 31, 2023

This is awe-inspiring and almost scary, it's pretty much beyond my understanding how much data these systems are meant to process.

What is also beyond me is how someone at Nvidia thinks that the label sequence "1.00E+2; 1.00E+3; 1.00E+4; 1.00E+5; 1.00E+6" for the vertical axis in "Figure 1" is more readable than "100; 1,000; 10,000; 100,000; 1,000,000" would have been. The latter is 5 chars less (total), even. Or, if exponential notation is important for the Big Serious Computing People, then perhaps they could have dropped the ".00" part from each value? Or, if I'm allowed to dream, gone with actual exponential notation?

vasco · on May 31, 2023

The exponent number is the number of zeros, its way more readable and faster to interpret than counting zeros.

PartiallyTyped · on May 31, 2023

It's easier to think in (possibly relative) orders of magnitude than with absolute numbers, instinctively it's what we do when we read large numbers.

tuetuopay · on May 31, 2023

It's the scientific notation, and makes the graph to be a log scale. It allows you to see they gained more than two orders of magnitude in a single generation.

csomar · on May 31, 2023

I'm not sure what's the problem with the exponential notation? It shows scale in order of magnitudes.

csdvrx · on May 31, 2023

> I'm not sure what's the problem with the exponential notation?

Same. It's just more efficient and readable that counting the 0 while considering the culture bound digit group separator norms like thousands/millions 3,3 vs laks/crore 2,2,3 cf https://en.wikipedia.org/wiki/Indian_numbering_system?useski...

Personally, I think it'd have been better to say the the .00 adding nothing: 1E2 1E3 etc would be far better.

timthelion · on May 31, 2023

There is a semantic difference: 1,000,000 == 1,000,000 where-as 1.00E+6 >= 1,000,000 < 1,010,000. The decimal places after the 1 in 1.00E+6 specify the precision of the measurement.

cubefox · on May 31, 2023

I don't think they specify any precision. It's just a way to write very large/small numbers approximately. (Though these numbers here aren't really considered large.)

timthelion · on May 31, 2023

They do. See wikipedia [significant figures] https://en.wikipedia.org/wiki/Scientific_notation#Significan...

In this case, it is pointless though, since the precision is actually known.

rowanG077 · on May 31, 2023

I think the scientific notation is much more readable. Putting in the number of zeroes quickly leads me to count it them.

ftxbro · on May 31, 2023

I can't find how much it will cost or how much power it will use. I mean it will be a lot and maybe only Google and Microsoft and Facebook can afford it but I still want to know.

Bedon292 · on May 31, 2023

The DGX A100 was $200k at launch. I found a DGX H100 in the mid $300k area. And those are 8 GPU systems. So you need 32 of those, and each one will definitely cost more plus networking. Super low estimate would be $500k each for $16M total. But considering its moving from 98GB to 480GB RAM per GPU. Might be more like $1.5M per 8, round it to say $50M.

And at 1/8th the power per GB, you have 700 Watts / 96GB / 8 * 480GB come to around 450 Watts per. And 115kw for the 256.

ftxbro · on May 31, 2023

What does this mean for the AI race? For example what if a newish company (newer than Google/Facebook/Microsoft/etc.) like Anthropic, Scale, Perplexity, or Stability is able to scrape together $5B USD funding and spend their hardware budget on these things. Say that can buy $1B of them and spend the rest on hackers and operating expenses (idk if that's realistic). So maybe they could purchase and operate like 20 of them. Say that they spend six months doing experimental things and then the next six months training their Tsar Model. If they follow the Chinchilla scaling laws and normal architectures, how good will these models be?

bushbaba · on May 31, 2023

For AI race means it opens the door to a competitor. Could be AMD, Google, or Amazon. All which have offerings in this space.

However while the hardware isn’t cheap, it’s still likely not a blocker. Costs do inhibitor more experimental research.

mirekrusin · on May 31, 2023

Startup which will rent this compute when needed will likely have more advantage on AI front.

Selling shovels is good business, but doesn't compete directly in AI area.

BonoboIO · on May 31, 2023

I have no expertise in GPU System used for AI Learning, but would It be possible to buy a bunch of consumer cards and get the same performance? Or is this not possible because consumer cards go to 40 ish GB RAM and Models would not fit or „swapping“ like crazy and be slow.

Tepix · on May 31, 2023

Consumer cards only have PCIe 4.0, at most 24GB VRAM and the only recent model with NVLink, the RTX 3090, can only be connected to exactly one other card. It doesn't scale beyond that. So you are limited to PCIe 4.0 x16 speeds.

SirFlamenco · on June 3, 2023

Technically the A6000 has 48 Gb of VRAM and works in 2-way NVLink.

Bedon292 · on May 31, 2023

The NVLink interconnect on all the GPUs is a huge part of it, and cannot come even remotely close to that bandwidth with consumer goods. Then the density of RAM to compute and power is huge. A single 4090 is 450 watts, for 24GB where this is 20x the memory for the same watts. 2.3Mw or so. If you say $0.14 / kwh, thats something like $325 / hour in power costs to run. Not counting additional cooling you are definitely going to need. And I am sure there is inefficiency this doesn't cover but 240v 10,000+ Amps for that?

01100011 · on May 31, 2023

Not the same. Not all problems can be efficiently divided among NUMA nodes with low bandwidth interconnects.

p1esk · on May 31, 2023

would It be possible

No

nvy · on May 31, 2023

With that much memory I could probably run Crysis 4 and an Electron app side by side!

einarfd · on May 31, 2023

It’s in interest of Nvidia, to try to make sure they do not end up in a situation where they have a small group of very big customer that buy a large slice of their production. For Nvidia a market of the same size, with many small to medium customer is a lot better as those customers will have a lot less power to force Nvidia to do something that isn't in it's interest or it does not want. I expect to see moves from Nvidia to help smaller players, open source or semi open models to not be crushed by the big players. Not because they are nice, but because it is in their best interest.

bigmattystyles · on May 31, 2023

I wonder if you can even run this on a regular 20A circuit - I'm thinking no - 20 * 120 = 2400 Watts - I assume that will not be enough...

mattlondon · on May 31, 2023

My hob in my kitchen is 7.3kw? Normal 32a * 240 circuit allows up to 7.68kw, and the 6mm^2 cable is rated to something like 45amps

This seems fairly common e.g. https://www.currys.co.uk/products/aeg-ikb64401fb-59-cm-elect...

I am sure data centers have larger circuit breakers and chunkier cables than my kitchen appliances!

throwaway2037 · on May 31, 2023

Woah, I looked at the specs:

    Front left: 2.3 kW / 3.7 kW

Cripes. You can boil water extremely fast on that IH setup! I'm living with 1.5 kW, and it is painful...

Bedon292 · on May 31, 2023

When we first got our induction cooktop I was so excited about how ridiculously fast we could boil water. Which is definitely an odd thing to get excited about. It definitely isn't that powerful though. That's a lot of power.

zamadatix · on May 31, 2023

For a single Grace+Hopper node? I'd bet it fits in that budget, the Grace Hopper datasheet says the combo has a CPU + GPU + memory TDP of 450W - 1000W programmable, and that leaves more than half of the room for the rest of the node's power budget. For the DGX GH200? It's 18,432 CPU cores with 256 GPUs across 16 full racks of servers :p.

jeffnappi · on May 31, 2023

Here's an example of an 8x H100 machine - look at the tech specs: https://lambdalabs.com/deep-learning/servers/hyperplane

6x 3000W PSUs in a 3x3 redundant config. So 9000 watts total. So at least 240v x 50A. x2 for redundancy.