Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch? The networking effect of all devs porting their LLMs etc. to that card would instantly put them as a major CUDA threat. But beancounters running the company would never get such an idea...


Disclosure: HPC admin who works with NIVIDA cards here.

Because, no. It's not as simple as that.

NVIDIA has a complete ecosystem now. They have cards. They have cards of cards (platforms), which they produce, validate and sell. They have NVLink crossbars and switches which connects these cards on their card of cards with very high speeds and low latency.

For inter-server communication they have libraries which coordinate cards, workloads and computations.

They bought Mellanox, but that can be used by anyone, so there's no lock-in for now.

As a tangent, NVIDIA has a whole set of standards for pumping tremendous amount of data in and out of these mesh of cards. Let it be GPU-Direct storage or specialized daemons which handle data transfers on and off cards.

If you think that you can connect n cards on PCIe bus and just send workloads to them and solve problems magically, you'll hurt yourself a lot, both performance and psychology wise.

You have to build a stack which can perform these things with maximum possible performance to be able to compute with NVIDIA. It's not just emulating CUDA, now. Esp., on the high end of the AI spectrum (GenAI, MultiCard, MultiSystem, etc.).

For other lower end, multi-tenant scenarios, they have card virtualization, MIG, etc. for card sharing. You have to complete on that, too, for cloud and smaller applications.


I have been hacking on local llama 3 inference software (for the CPU, but I have been thinking about how I would port it to a GPU) and would like to do a rebuttal:

https://github.com/ryao/llama3.c

Inference workloads are easy to parallelize to N cards with minimal connectivity between them. The Nvlink crossbars and switches just are not needed.

In particular, inference can be divided into two distinct phases, which are input processing (prompt processing) and output generation (token generation). They are remarkably different in their requirements. Input processing is compute bound via GEMM operations while output generation is memory bandwidth bound via GEMV operations. Technically, you can do the input processing via GEMV too by processing 1 token at a time, but that is slow, so you do not want to do that. Anyway, these phases can be further subdivided into the model’s layers. You can have 1 GPU per layer with the logits passing from GPU to GPU in a pipeline. The GPUs just need the layer’s weights and the key-value cache for all of the tokens in that layer in memory to be able to work effectively. For llama 3.1 405B, there are 126 layers, so that is up to 126 GPUs.

That is of course slightly slower than if you just had 1 GPU with an incredible amount of VRAM, but you can always have more than one query in flight to get better than 1 GPU’s worth of performance from this pipeline approach. There are other ways of doing parallelization too, such as having output processing use GEMM to do multiple queries in parallel. This would be what others call batching, although I am only interested in doing 1 query at a time right now, so I have not touched it.

In essence, you can connect n cards on PCIe and have them solve inferencing problems magically, with the right software. Training is a different matter and I cannot comment on it as I have not studied it yet.


I presume the counterargument is that inference hosting is commoditized (sort of like how stateless CPU-based containerized workload hosts are commoditized); there’s no margin in that business, because it is parallelizable, and arbitrarily schedulable, and able to be spread across heterogenous hardware pretty easily (just route individual requests to sub-cluster A or B), preventing any kind of lock-in and thus any kind of rent-extraction by the vendor.

Which therefore means that cards that can only do inference, are fungible. You don’t want to spend CapEx on getting into a new LOB just to sell something fungible.

All the gigantic GPU clusters that you can sell a million at a time to a bigcorp under a high-margin service contract, meanwhile, are training clusters. Nvidia’s market cap right now is fundamentally built on the model-training “space race” going on between the world’s ~15 big AI companies. That’s the non-fungible market.

For Intel to see any benefit (in stock price terms) from an ML-accelerator-card LOB, it’d have to be a card that competes in that space. And that’s a much taller order.


Intel does make cards aimed at this space too:

https://www.intel.com/content/www/us/en/products/details/pro...

Coincidentally, it has 128GB of RAM. However, it is not a GPU, is designed to do training too and uses expensive HBM.

Modern GPUs can do more than inference/training and the original poster asked about a GPU with 128GB of RAM, not a card that can only do inferencing as you described. Interestingly, Qualcomm made its own card targeted at only inferencing with 128GB of RAM without using HBM:

https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...

They do not sell it through PC parts channels so I do not know the price, but it is exactly what you described and it has been built. Presumably, a GPU with the same memory configuration would be of interest to the original poster.


Back in January, someone on Reddit claimed the list price was $16k.


It's competing against Nvidia H100s, which cost $25k. It's cheap, at least by the norms of the space.


You are both correct. AI inference is a comparatively easy problem from the perspective of parallelization when compared to most HPC problems.


Facebook did a technical paper where they described their training cluster and the sheer amount of complexity is staggering. That said, the original poster was interested in inferencing, not training.

https://arxiv.org/pdf/2407.21783


Training is close to traditional HPC in many ways. Inference is far simpler since it's a simple forward-going pipeline of a relatively small working set.


what kind of bandwidth/latency between GPUs would one need in that setup to not be bottlenecking? What you're describing sounds quite forgiving. Is it forgiving enough that we could potentially connect those GPUs over a LAN, or even a remote decentralized cloud of host computers?

From my understanding that's certainly possible to do without the latency hurting much with large batching between inference layers


This depends on:

  * the model dimension
  * how many bits per variable are used by your quantization
  * how many tokens are being processed per step (input processing can do all N input tokens simultaneously and output processing can only do 1 at a time when doing a single query)
  * how many times you split the model layers across multiple GPUs
The model dimensions are:

  * 4096 for llama 3/3.1 8B.
  * 8192 for llama 3/3.1 70B.
  * 16384 for llama 3.1 405B.
The model layers are:

  * 32 for llama 3/3.1 8B
  * 80 for llama 3/3.1 70B
  * 126 for llama 3.1 405B
The amount of data that needs to be transferred for each split is surprisingly small. Each time you move the calculation of a subsequent layer to a different GPU, you need to transfer an array that is of size model_dimension * num_tokens * bits_per_variable. Then this reduces to a classic network transfer time problem, where you consider both time until the first byte arrives and the transfer time until the last byte arrives. Reality will likely be longer than that idealized scenario, especially since you need to send a signal saying to begin computing.

Input processing can tackle so many tokens simultaneously that it probably is not worth thinking too much about this penalty there. Output processing is where the penalty is more significant, since you will incur these costs for every token. Let’s say we are doing fp16 or bf16 on llama 3 8B. Then we need to transfer 8KB every time we move the calculation for another layer to another GPU. If you use RDMA and do this over 10GbE, the transfer time would be 6.4 microseconds. If we assume the time to first byte and time to do a signal to begin processing is 3.6 microseconds combined (chosen to round things up), then we get a penalty of 10 microseconds per split, per token. If you are doing 60 tokens per second and split things across 4 GPUs over the network, you have a penalty of 30 microseconds per token. It will run about 0.003% slower and you are not going to notice this at all. Assuming 10GbE with RDMA is somewhat idealized, although I needed to pick something to give some numbers.

In any case, the equation for figuring what factor slower it would be is 1 / (1 + time to do transfers and trigger processing per each token in seconds). That would mean under a less ideal situation where the penalty is 5 milliseconds per token, the calculation will be ~0.99502487562 times what it would have been had it been done in a hypothetical single GPU that has all of the VRAM needed, but otherwise the same specifications. This penalty is also not very noticeable.

In conclusion, you are right. I actually recall seeing a random YouTube video talking about a software project that does clustered interferencing, so people are already doing this. Unfortunately, I do not remember the project name, channel name or video name.


I think he's mostly referring to inference and not training, which I entirely agree with - a 4x version of this card for workstations would do really well - even some basic interconnect between the cards a la nvlink would really drive this home.

The training can come after, with some inference and runtime optimizations on the software stack.


Most of the above infra is predicated on limiting RAM so that you need so much communication between cards. Bump the RAM up and you could do single card inference and all those connections become overhead that could have gone to more ram. For training there is an argument still, but even there the more RAM you have the less all that connectivity gains you. RAM has been used to sell cards and servers for a long time now, it is time to open the floodgates.


Correct for inference - the main use of the interconnect is RDMA requests between GPUs to fit models that wouldn't otherwise fit.

Not really correct for training - training has a lot of all-to-all problems, so hierarchical reduction is useful but doesn't really solve the incast problem - Nvlink _bandwidth_ is less of an issue than perhaps the SHARP functions in the NVLink switch ASICs.


All of that is highly relevant for training but what the poster was asking for is a desktop inference card.


You use at least half of this stack for desktop setups. You need copying daemons, the ecosystem support (docker-nvidia, etc.), some of the libraries, etc. even when you're on a single system.

If you're doing inference on a server; MIG comes into play. If you're doing inference on a larger cloud, GPU-direct storage comes into play.

It's all modular.


It's possible you're underestimating the open source community.

If there's a competing platform that hobbyists can tinker with, the ecosystem can improve quite rapidly, especially when the competing platform is completely closed and hobbyists basically are locked out and have no alternative.


> It's possible you're underestimating the open source community.

On the contrary. You really don't know how I love and prefer open source and love a more leveling playing field.

> If there's a competing platform that hobbyists can tinker with...

AMD's cards are better from hardware and software architecture standpoint, but the performance is not there yet. Plus, ROCm libraries are not that mature, but they're getting there. Developing high performance, high quality code is deceivingly expensive, because it's very heavy in theory, and you fly very close to the metal. I did that in my Ph.D., so I know what it entails. So it requires more than a couple (hundred) hobbyists to pull off (see the development of Eigen linear algebra library, or any high end math library).

Some big guns are pouring money into AMD to implement good ROCm libraries, and it started paying off (Debian has a ton of ROCm packages now, too). However, you need to be able to pull it off in the datacenter to be able to pull it off on the desktop.

AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home.

> especially when the competing platform is completely closed...

NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.

So, as long as papers got published, researchers do their research, and something got invented, many people don't care about how open source the ecosystem is. This upsets me a ton, but when closed source AI companies and researchers who forget to add crucial details to their papers so what they did can't be reproduced don't care about open source, because they think like NVIDIA. "My research, my secrets, my fame, my money".

It's not about sharing. It's about winning, and it's ugly in some aspects.


Yep, this thread has a good compilation of ROCm woes: https://news.ycombinator.com/item?id=34832660

That said, for hobbyist inference on large pretrained models, I think there is an interesting set of possibilities here: maybe a number of operations aren't optimized, and it takes 10x as long to load the model into memory... but all that might not matter if AMD were to be the first to market for 128GB+ VRAM cards that are the only things that can run next-generation open-weight models in a desktop environment, particularly those generating video and images. The hobbyists don't need to optimize all the linear algebra operations that researchers need to be able to experiment with when training; they just need to implement the ones used by the open-weight models.

But of course this is all just wishful thinking, because as others have pointed out, any developments in this direction would require a level of foresight that AMD simply hasn't historically shown.


IDK, I found a post that's 2 years old that has links to doing llama and SD on an Arc [0] (although might be linux only), I feel like a cheap huge ram card would create a 'critical mass' as far as being able to start optimizing, and then from a longer term Intel could promise and deliver on 'scale up' improvements.

It would be a huge shift for them. To go from preferring some (sometimes not quite reached) metric, to, perhaps rightly play the 'reformed underdog'. Commoditize Big-Memory ML Capable GPUs, even if they aren't quite as competitive as the top players at first.

Will the other players respond? Yes. But ruin their margin. I know that sounds cutthroat[1] but hey I'm trying to hypothetically sell this to whomever is taking the reigns after Pat G.

> NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.

Ideally they need to do that too. Ideally they have some 'high powered' prototypes (e.x. lets say they decide a 2-gpu per card design with an interlink is feasible for some reason) to share as well. This may not be be entirely ethical[1] in this example of how a corp could play it out, again it's a thought experiment since intel has NOT announced or hinted at a larger memory card anyway.

> AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home

AMD's driver story has always been a hot mess, My desktop won't behave with both my onboard video and 4060 enabled, every AMD card I've had winds up with some weird firmware quirk one way or another... I guess I'm saying their general level of driver quality doesn't lend to hope they'll fix dev tools that soon...

[0] - https://old.reddit.com/r/LocalLLaMA/comments/12khkka/running...

[1] - As you said, it's about winning and it can get ugly.


ROCm doesn't really matter when the hardware is almost the same as Nvidia cards. AMD is not selling "cheaper" card with a lot of RAM, what the original poster was asking. (and a reason why people who like to tinker with large model are using Macs).


You're writing as if AMD cares about open source. If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.

I'm one of those academics. You've got it all wrong. So many people care about open source. So many people carefully release their code and make everything reproducible.

We desperately just want AMD to open up. They just refuse. There's nothing secret going on and there's no conspiracy. There's just a company that for some inexplicable reason doesn't want to make boatloads of money for free.

AMD is the worst possible situation. They're hostile to us and they refuse to invest to make their stuff work.


> If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.

Software wise, maybe. But you can't change AMD's hardware with a magic wand, and that's where a lot of CUDA's optimizations come from. AMD's GPU architecture is optimized for raster compute, and it's been that way for decades.

I can assure you that AMD does not have a magic button to press that would make their systems competitive for AI. If that was possible it would have been done years ago, with or without their consent. The problem is deeper and extends to design decisions and disagreement over the complexity of GPU designs. If you compare AMD's cards to Nvidia on "fair ground" (eg. no CUDA, only OpenCL) the GPGPU performance still leans in Nvidia's favor.


RDNA and UDNA have progressed more towards compute-unified architecture than Vega&CDNA's raster-first architecture.


That would require competently produced documentation. Intel can't do that for any of their side projects because their MBAs don't get a bonus if the tech writers are treated as a valuable asset.


Innovation is a bottom up process. If they sell the hardware the community will spring up to take advantage.


No. I've been reading up. I'm planning to run Flux 12b on my AMD 5700G with 64GB RAM. CPU will take 5-10minutes per image which will be fine for me tinkering while writing code. Maybe I'll be able to get the GPU going on it too.

Point of the OP is this is entirely possible with even an iGPU if only we have the RAM. nVidia should be irrelevant for local inference.


The Ryzen 5700G is one of the APUs tested on the Debian ROCm CI [1]. It works quite well with the Debian / Ubuntu system packages.

[1]: http://ci.rocm.debian.net/


No you don‘t need much bandwidth between cards for inference


Copying daemons (gdrcopy) is about pumping data in and out of a single card. docker-nvidia and rest of the stack is enablement for using cards.

GPU-Direct is about pumping data from storage devices to cards, esp. from high speed storage systems across networks.

MIG actually shares a single card to multiple instances, so many processes or VMs can use a single card for smaller tasks.

Nothing I have written in my previous comment is related to inter-card, inter-server communication, but all are related to disk-GPU, CPU-GPU or RAM-CPU communication.

Edit: I mean, it's not OK to talk about downvoting, and downvote as you like but, I install and enable these cards for researchers. I know what I'm installing and what it does. C'mon now. :D


Mostly, I think, we don’t really understand your argument that Intel couldn’t easily replicate the parts needed only for inference.


Yeah, for example llama.cpp runs on Intel GPUs via Vulkan or SYCL. The latter is actively being maintained by Intel developers.

Obviously that is only one piece of software, but its a certainly a useful one if you are using one of the many LLMs it supports.


i've run inference on Intel Arc and it works just fine so i am not sure what you're talking about. I certainly didn't need docker! I've never tried to do anything on AMD yet.

I had the 16GB arc, and it was able to run inference at the speed i expected, but twice as many per batch as my 8GB card, which i think is about what you'd expect.

once the model is on the card, there's no "disk" anymore, so having more vram to load the model and the tokenizer and whatever else on means there's no disk, and realistically when i am running loads on my 24GB 3090 the CPU is maybe 4% over idle usage. My bottleneck, as it stands, to running large models is vram, not anything else.

If i needed to train (from scratch or whatever) i'd just rent time somewhere, even with a 128GB card locally, because obviously more tensors is better.

and you're getting downvoted because there's literally lm studio and llama.cpp and sd-webui that run just fine for inference on our non-dc, non-nvlink, 1/15th the cost GPUs.


Inferencing is much more simple than you think:

See the precompute_input_logits() and forward() functions here:

https://github.com/ryao/llama3.c/blob/master/run.c#L520

As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.

Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.

Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:

  * memcpy()
  * memset()
  * GEMM (GEMV is a special case of GEMM)
  * sinf()
  * cosf()
  * expf()
  * sqrtf()
  * rmsnorm (see the C function for the definition)
  * softmax (see the C function for the definition)
  * Addition, subtraction, multiplication and division.
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.

If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.

My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.

By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.


I am favoriting this comment for reference later when I start poking around in the base level stuff. I find it pretty funny how simple this stuff can get. Have you messed with ternary computing inference yet? I imagine that shrinks the list even further - or at least reduces the compute requirements in favor of brute force addition. https://arxiv.org/html/2410.00907


No. I still have a long list of things to try doing with llama 3 (and possibly later llama 3.1) on more normal formats like fp32 and in the future, fp16. When I get things running on a GPU, I plan to try using bf16 and maybe fp8 if I get hardware that supports it. Low bit quantizations hurt model quality, so I am not very interested in them. Maybe that will change if good quality models trained to use them become available.


> but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function.

Perhaps you can reverse engineer it?


My plan is to make more attempts at rolling my own. Reverse engineering things is not something that I do, since that would prevent me from publishing the result as OSS.


Rather than tackling the entire market at once, they could start with one section and build from there. NVIDIA didn't get to where it was in a year, it took many strategic acquisitions. (All the networking and other HPC-specialized stuff I was buying a decade ago has seemingly been bought by NVIDIA).

Start by being a "second vendor" for huge customers of NVIDIA that want to foster competition, as well as a few others willing to take risks, and build from there.


Intel has already bought and killed everything they need to compete here. They seem incapable of sticking to any market that isn’t x86. Likely because when they were making those acquisitions they were drunk on margin and didn’t want to focus anywhere else.


> They seem incapable of sticking to any market that isn’t x86

Even within that they seem to have difficulty expanding to new areas, remember edison?


Killing QLogic’s infiniband business after buying it was a major loss for the industry.


On servers you're right: but for local LLM inference, I think you're wrong. For local LLMs most people are bottlenecked by not having enough VRAM: pretty much no one is running a 70b model on Nvidia GPUs locally, just due to the expense. You don't need maximum performance: you need it to run at all, which most people can't do for the good models — at least, not without heavy quantization that pretty badly lobotomizes them.

Apple is the king right now of local LLM inference, just because of their unified memory architecture meaning that people can get large amounts of "VRAM" (since all RAM is VRAM). They're not as fast as Nvidia — not even close to an H100, for example. But they don't need to be. No consumer can afford an H100, but they can afford a Mac.


So what you're saying is Intel, or any other would-be NVIDIA competitor, needs to put out fast interconnects, not just compute cards. This is true.

I'm not sure your argument stands when it comes to OP's idea of a single card with 128GB VRAM. This would be enough to run ~180B models with reasonable quantization --we're not near maxing out the capability of 180B yet (see the latest 32B models performing near public SOTA).

This indeed would push rapid and wide adoption and be quite disruptive. But sure, it wouldn't instantly enable competitive training of 405B models.


> NVIDIA has a complete ecosystem now. They have cards. They have cards of cards

Nvidia is starting to sound like a house of cards to me.


You are of course right WRT datacenter use of GPUs. The OP spoke about local generation though. It is of course a smaller market, but a market nevertheless, and the more amateurs and students are using your product, the more of them would consider applying it in more professional settings.

Sun (Sparc) and HP (PA-RISC) used to own most of the server market in 1990, but lost most of it to x86 by 2000. Few people had a Sun box with Solaris, but tons of people had access to a PC with Linux, which was inferior in many ways, but well-known and much less locked-up.


Lets see how quickly that changes if intel releases cards with massive amounts of ram for a fraction of the cost.


yes but if Intel provided the base, people will flock to it build software for it. Intel doesn't even need be involved but they should.


But all the tricks of developing a solid software stack to support the HW are already out no? The basic principles are there, to my understanding the main challenge is not doing this development in tandem with the HW people and the requirments to support older legacy device which makes it harder for example for Amd to compete. The only challenges,which intel are prepped to face is logistics and fabs. On a separate note, project like JAX are aiming to circumvent that abstraction layer cuda adds to nvidia, so having decent hardware competition is definitely an option. Just some time ago, vllm fully supported amd gpus! We need more competition.


> ... local generative AI lunch

I agree, just for my PC, something that'd enable small devs to create interesting foundation model apps that'd deploy to users using this local AI cards to run these new Apps.


There might be an hen-egg problem if the apps end up requiring a 128Gb AI accelerator card. You only get the card if there are apps to run, and you only develop the apps if the cards are wide-spread. With so much RAM, the cards will not be let's-through-them-into-a-cheap-build cheap.

I think there have to be a couple of killer apps that run "OK" with CPU or GPU, but would run tremendously better with such a card.


I have a question for you, since I’m somewhat entering the HPC world. In the EU the EuroHPC-JU is building what they call AI factories, afaict these are just batch processing (Slurm I think) clusters with GPUs in the nodes. So I wonder where you’d place those cards of cards. Are you saying there is another, perhaps better ways to use massive amounts of these cards? Or is that still in the “super powerful workstation” domain? Thanx in advance.


View it as Raspberry Pi for AI workloads. Initial stage is for enthusiasts that would develop the infra, figure out what is possible and spread the word. Then the next phase will be SME industry adoption, making it commercially interesting, while bypassing Nvidia completely. At some point it would live its own life and big players jump in. Classical disrupt strategy via low cost unique offerings.


Pretty sure they are talking about inference in the post you are responding to. Training the model obviously needs far more compute, but running them locally is what most devs are interested in.


Off the topoc. I think, in the long-term , inference should be done along with some kind of training.


How does any of this make money?


When this walled garden is the only way to use GPUs with high efficiency and everybody is using this stack, and NVIDIA controlling the supply of these "platform boards" to OEMs, they don't make money, but they literally print it.

However, AMD is coming for them because a couple of high profile supercomputer centers (LUMI, Livermore, etc.) are using Instinct cards and pouring money to AMD to improve their cards and stack.

I have not used their (Instinct) cards, yet, but their Linux driver architecture is way better than NVIDIA.


I think he was asking how HPC makes money. The answer as far as I know is that it often does not for anyone other than the vendors.

I wonder where you got your information on AMD’s “Linux driver architecture”. It is reportedly a mess:

https://news.ycombinator.com/item?id=34832660

So far, I have been very happy with Nvidia’s Linux drivers.


HPC is used in research, which is often not expected to make money. The hope is that the research will result in something that makes money. One example would be drug discovery. Another would be weather prediction, which is not so much a way to make money, but to minimize losses.


Having the complete ecosystem affords them significant margins.


Against what?


As of today they have SaaS company margins as a hardware company which is practically unheard of.


I mean, do they?


What?

It's like the most profitable set of products in tech. You have companies like Meta, MSFT, Amazon, Google etc spending $5B every few years buying this hardware.


Stale money is moving around. Nothing changed .


What is stale money?


Hmm. There is a lot of money that exists, doing nothing. I consider that stale money.

Edit: I can’t sort this out. Where did all the money go?


Have you looked under your sofa?


Sadly it’s just stale goldfish and magnetites.


> Disclosure: HPC admin who works with NIVIDA cards here. > Because, no. It's not as simple as that.

Wow what he said is way above your head! Please reread what he wrote.


Just how "basic" do you think a GPU can be while having the capability to interface with that much DRAM? Getting there with GDDR6 would require a really wide memory bus even if you could get it to operate with multiple ranks. Getting to 128GB with LPDDR5x would be possible with the 256-bit bus width they used on the top parts of the last generation, but would result in having half the bandwidth of an already mediocre card. "Just add more RAM" doesn't work the way you wish it could.


M3/M4 Max MacBooks with 128GB RAM are already way better than an A6000 for very large local LLMs. So even if the GPU is as slow as the one in M3/M4 Max (<3070), and using some basic RAM like LPDDR5x it would still be way faster than anything from NVidia.


The M4 Max needs an enormous 512bit memory bus to extract enough bandwidth out of those LPDDR5x chips, while the GPUs that Intel just launched are 192/160bit and even flagships rarely exceed 384bit. They can't just slap more memory on the board, they would need to dedicate significantly more silicon area to memory IO and drive up the cost of the part, assuming their architecture would even scale that wide without hitting weird bottlenecks.


The memory controller would be bigger, and the cost would be higher, but not radically higher. It would be an attractive product for local inference even at triple the current price and the development expense would be 100% justified if it helped Intel get any kind of foothold in the ML market.


Man, I'm old enough to remember when 512 was a thing for consumer cards back when we had 4-8gb memory

Sure that was only gddr5 and not gddr6 or lpddr5, but I would have bet we'd be up to 512bit again 10 years down the line..

(I mean supposedly hbm3 has done 1024-2048bit busses but that seems more research or super high end cards, not consumer)


Rumor is the 5090 will be bringing back the 512bit bus, for a whopping 1.5TB/sec bandwidth.


> They can't just slap more memory on the board

Why not? It doesn't have to be balanced. RAM is cheap. You would get an affordable card that can hold a large model and still do inference e.g. 4x faster than a CPU. The 128GB card doesn't have to do inference on a 128GB model as fast as a 16GB card does on a 16GB model, it can be slower than that and still faster than any cost-competitive alternative at that size.

The extra RAM also lets you do things like load a sparse mixture of experts model entirely into the GPU, which will perform well even on lower end GPUs with less bandwidth because you don't have to stream the whole model for each token, but you do need enough RAM for the whole model because you don't know ahead of time which parts you'll need.


To get 128GB of RAM on a GPU you'd need at least a 1024 bit bus. GDDR6x is 16Gbit 32 pins, so you'd need 64 GDDR6x chips, which good luck even trying to fit that around the GPU die since traces need to be the same length, and you want to keep them as short as possible. There's also a good chance you can't run a clamshell setup so you'd have to double the bus width to 2048 because 32 GDDR6x chips would kick off way too much heat to be cooled on the back of a GPU. Such a ridiculous setup would obviously be extremely expensive and would use way too much power.

A more sensible alternative would be going with HBM, except good luck getting any capacity for that since it's all being used for the extremely high margin data center GPUs. HBM is also extremely expensive both in terms of the cost of buying the chips and due to it's advanced packaging requirements.


You do not need a 1024-bit bus to put 128GB of some DDR variant on a GPU. You could do a 512-bit bus with dual rank memory. The 3090 had a 384-bit bus with dual rank memory and going to 512-bit from that is not much of a leap.

This assumes you use 32Gbit chips, which will likely be available in the near future. Interestingly, the GDDR7 specification allows for 64Gbit chips:

> the GDDR7 standard officially adds support for 64Gbit DRAM devices, twice the 32Gbit max capacity of GDDR6/GDDR6X

https://www.anandtech.com/show/21287/jedec-publishes-gddr7-s...


Yeah, the idea that you're limited by bus width is kind of silly. If you're using ordinary DDR5 then consider that desktops can handle 192GB of memory with a 128-bit memory bus, implying that you get 576GB with a 384-bit bus and 768GB at 512-bit. That's before you even consider using registered memory, which is "more expensive" but not that much more expensive.

And if you want to have some real fun, cause "registered GDDR" to be a thing.


> They can't just slap more memory on the board, they would need to dedicate significantly more silicon area to memory IO and drive up the cost of the part,

In the pedantic sense of just literally slapping more on existing boards? No, they might have one empty spot for an extra BGA VRAM chip, but not enough for the gain's we're talking about. But this is absolutely possible, trivially so for someone like Intel/AMD/NVidia, that has full control over the architectural and design process. Is it a switch they flip at the factory 3 days before shipping? No, obviously not. But if they intended this to be the case ~2 years ago when this was just a product on the drawing board? Absolutely. There is 0 technical/hardware/manufacturing reason they couldn't do this. And considering the "entry level" competitor product is the M4 Max which starts at at least $3,000 (for a 128GB equipped one), the margin on pricing more than exists to cover a few hundred extra in ram and extra overhead in higher-layer more populated PCB's.

The real impediment is what you landed on at the end there combined with the greater ecosystem not having support for it. Intel could drop a card that is, by all rights, far better performing hardware than a competing Nvidia GPU, but Nvidia's dominance in API's, CUDA, Networking, Fabric-switches (NVLink, mellanox, bluefield), etc etc for that past 10+ years and all of the skilled labor that is familiar with it would largely render a 128GB Arc GPU a dud on delivery, even if it was priced as a steal. Same thing happened with the Radeon VII. Killer compute card that no one used because while the card itself was phenomenal, the rest of the ecosystem just wasn't there.

Now, if intel committed to that card, and poured their considerable resources into that ecosystem, and continued to iterate on that card/family, then now we're talking, but yeah, you can't just 10X VRAM on a card that's currently a non-player in the GPGPU market and expect anyone in the industry to really give a damn. Raise an eyebrow or make a note to check back in a year? Sure. But raise the issue to get a greenlight on the corpo credit line? Fat chance.


A 128GB VRAM Intel Arc card at a low price would be an OSS developer’s dream come true. It would be the WRT54G of inference.


Of course. A cheap card with oodles of VRAM would benefit some people, I'm not denying that. I'm tackling the question of would it benefit intel (as the original question was "why doesn't intel do this"), and the answer is: Profit/Loss.

There's a huge number of people in that community that would love to have such a card. How many are actually willing and able to pony up >=$3k per unit? How many units would they buy? Given all of the other considerations that go into making such cards useful and easy to use (as described), the answer is - in intel's mind - nowhere near enough, especially when the financial side of the company's jimmies are so rustled that they sacked Pat G without a proper replacement and nominated some finance bros in as interim CEO's. Intel is ALREADY taking a big risk and financial burden trying to get into this space in the first place, and they're already struggling, so the prospect of betting the house like that just isn't going to fly for the finance bros that can't see passed the next 2 quarters.

To be clear, I personally think there is huge potential value in trying to support the OSS community to, in essence, "crowd source" and speedrun some of that ecosystem by supplying (Compared to the competition) "cheap" cards that aschew the artificial segmentation everyone else is doing and investing in that community. But I'm not running Intel, so while that'd be nice, it's not really relevant.


I suspect that Intel could hit a $2000 price point for a 128GB ARC card, but even at $3000, it would certainly be cheaper than buying 8 A770 LE ARC cards and connecting them to a machine. There are likely tens of thousands of people buying multiple GPUs to run local inferencing on Reddit’s local llama, and that is a subset of the market.

In Q1 2023, Intel sold 250,000 ARC cards. Sales then collapsed the next quarter. I would expect sales to easily exceed that and be maintained. The demand for high memory GPUs is far higher than many realize. You have professional inferencing operations such as the ones listed at openrouter.ai that would gobble up 128GB VRAM ARC cards for running smaller high context models, much like how you have businesses gobbling up the Raspberry Pi for low end tasks, without even considering the local inference community.


> The M4 Max needs an enormous 512bit memory bus to extract enough bandwidth out of those LPDDR5x chips

Does M4 Max have 64-byte cache lines?

If they can fetch or flush an entire cache line in a single memory-bus transaction, I wonder if that opens up any additional hardware / performance optimizations.


> Does M4 Max have 64-byte cache lines?

on the CPU side: 64 bytes at L1, 128 byte cachelines at L2


A single memory transaction is almost always a 16n burst for LPDDR5x.


Apple could do it. Why can’t Intel?


Because Apple isn't playing the same game as everyone else. They have the money and clout to buy out TSMCs bleeding-edge processes and leave everyone else with the scraps, and their silicon is only sold in machines with extremely fat margins that can easily absorb the BOM cost of making huge chips on the most expensive processes money can buy.


Bleeding edge processes is what Intel specializes in. Unlike Apple, they don’t need TSMC. This should have been a huge advantage for Intel. Maybe that’s why Gelsinger got the boot.


> Bleeding edge processes is what Intel specializes in. Unlike Apple, they don’t need TSMC.

Intel literally outsourced their Arrow Lake manufacturing to TSMC because they couldn't fabricate the parts themselves - their 20A (2nm) process node never reached a production-ready state, and was eventually cancelled about a month ago.


OK, so the question becomes: TSMC could do it. Why can’t Intel?


Intel is maybe a year or two behind TSMC right now. They might or might not catch up since it is a moving target, but I dont think there is anything TSMC is doing today that Intel wont be doing in the near future.


They are trying … for like 10 years


Wasn't it cancelled in favor of 18A?


It was, but that only puts them further away from shipping product.


These days, Intel merely specializes in bleeding processes. They spent far too many years believing the unrealistic promises from their fab division, and in the past few years they've been suffering the consequences as the problems are too big to be covered up by the cost savings of vertical integration.


Intel's foundry side has been floundering so hard that they've resorted to using TSMC themselves in an attempt to keep up with AMD. Their recently launched CPUs are a mix of Intel-made and TSMC-made chiplets, but the latter accounts for most of the die area.


I'm not certain this is quite as damning as it sounds. My understanding is that the foundry business was intentionally walled off from the product business, and that the latter wasn't going to be treated as a privileged customer.


no, in fact, it sounds even more damning because client side was able to pick whatever was best on the market, and it wasn't intel. Client side could to learn and customize their designs to use another company's processes (this is an extremely hard thing to do by the way) faster than intel foundry could even get their pants up in the morning.

Intel foundry screwed up so badly that Nokia's server division was almost shut down because of Intel Foundry's failure. (imagine being so bad at your job, that your clients go out of business) If Intel client side chose to use Foundry, there just wouldn't be any chips to sell.


Intel Arc hardware is manufactured by TSMC, specifically on N6 and N5 for this latest announcement.

Intel doesn't currently have nodes competitive with TSMC or excess capacity in their better processes.


Serious question, why don't they have excess capacity? They aren't producing many CPUs...that people want.


Hard to have excess capacity when your latest fabs aren't able to produce anything reliably.

They don't even have competitive capacity for all their CPU needs. They have negative spare capacity overall.


Because their production capacity is much smaller than TSMC, it's declined over the past few years, and their newer nodes are having yield issues.


if you think intel has bleeding edge processes, that hasn't been the case for over 6 years ...


> and their silicon is only sold in machines with extremely fat margins

Like the brand new Mini that cost 600 USD and went to 500 during Black week.


The 600 mini is on the m4 with the anemic 10 core GPU. This is for grandama or your 10 year old.

The good one which is still slower than m4 max is 2200.

If you want the max you need at least a macbook pro starting at 3200 and if you want the better one with 128G RAM it starts at about 5k


No matter how few GPU cores the M4 has, it is still an extremely potent product on the whole.

Better than most of the pc's out there.


Do you mean subjectively because you find Mac pleasant to use? The $600 mini certainly isn't more performant than 85% of desktops.


> The $600 mini certainly isn't more performant than 85% of desktops.

The average desktop isn't exactly a gaming machine. But a corporate box or a low end home desktop with a Core i5 and using the iGPU.


Transistor IO logic scaling died a while ago, which is what prompted AMD to go with a chiplet architecture. Being on a more advanced process does not make implementing an 512-bit memory bus any easier for Apple. If anything, it makes it more expensive for Apple than it would be for Intel.


Because LPDDR5x is soldered on RAM.

Everyone else wants configurable RAM that scales both down (to 16GB) and up (to 2TB), to cover smaller laptops and bigger servers.

GPUs with soldered on RAM has 500GB/sec bandwidths, far in excess of Apples chips. So the 8GB or 16GB offered by NVidia or AMD is just far superior at vid o game graphics (where textures are the priority)


> GPUs with soldered on RAM has 500GB/sec bandwidths, far in excess of Apples chips.

Apple is doing 800GB/sec on the M2 Ultra and should reach about 1TB/sec with the M4 Ultra, but that's still lagging behind GPUs. The 4090 was already at the 1TB/sec mark two years ago, the 5090 is supposedly aiming for 1.5TB/sec, and the H200 is doing 5TB/sec.


HBM is kind of not fair lol. But 4096-line bus is gonna have more bandwidth than any competitor.

It's pretty expensive though.

The 500GB/sec number is for a more ordinary GPU like the B580 Battlemage in the $250ish price range. Obviously the $2000ish 4090 will be better, but I don't expect the typical consumer to be using those.


But an on-package memory bus has some of the advantages of HBM, just to a lesser extent, so it's arguably comparable as an "intermediate stage" between RAM chips and HBM. Distances are shorter (so voltage drop and capacitance are lower, so can be driven at lower power), routing is more complex but can be worked around by more layers, which increases cost but on a significantly smaller area than required for dimms, and the dimms connections themselves can hurt performance (reflection from poor contacts, optional termination makes things more complex, and the expectations of mix-and-match for dimm vendors and products likely reduce fine tuning possibilities).

There's pretty much a direct opposite scaling between flexibility and performance - dimms > soldered ram > on-package ram > die-interconnects.


The question is why Intel GPUs, which already have soldered memory, aren't sold with more of it. The market here isn't something that can beat enterprise GPUs at training, it's something that can beat desktop CPUs at inference with enough VRAM to fit large models at an affordable price.


Intel also has lunar lake CPUs with on package RAM. They could have added more memory channels like Apple did.


It doesn't matter if the "cost is driven up". Nvidia has proven that we're all lil pay pigs for them. 5090 will be 3000$ for 32gb of VRAM. Screenshot this now, it will age well.

We'd be happy to pay 5000 for 128gb from Intel.


You are absolutely correct, and even my non-prophetic ass echoed exactly the first sentence of the top comment in this HN thread ("Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch?").

Yes, yes, it's not trivial to have a GPU with 128gb of memory with cache tags and so on, but is that really in the same universe of complexity of taking on Nvidia and their CUDA / AI moat any other way? Did Intel ever give the impression they don't know how to design a cache? There really has to be a GOOD reason for this, otherwise everyone involved with this launch is just plain stupid or getting paid off to not pursue this.

Saying all this with infinite love and 100% commercial support of OpenCL since version 1.0, a great enjoyer of A770 with 16GB of memory, I live to laugh in the face of people who claimed for over 10 years that OpenCL is deprecated on MacOS (which I cannot stand and will never use, yet the hardware it runs on...) and still routinely crushes powerful desktop GPUs, in reality and practice today.


Both Intel and AMD produce server chips with 12 channel memory these days (that's 12x64bit for 768bit) which combined with DDR5 can push effective socket bandwidth beyond 800GB/s, which is well into the area occupied by single GPUs these days.

You can even find some attractive deals on motherboard/ram/cpu bundles built around grey market engineering sample CPUs on aliexpress with good reports about usability under Linux.

Building a whole new system like this is not exactly as simple as just plugging a GPU into an existing system, but you also benefit from upgradeability of the memory, and not having to use anything like CUDA. llamafile, as an example, really benefits from AVX-512 available in recent CPUs. LLMs are memory bandwidth bound, so it doesn't take many CPU cores to keep the memory bus full.

Another benefit is that you can get a large amount of usable high bandwidth memory with a relatively low total system power usage. Some of AMD's parts with 12 channel memory can fit in a 200W system power budget. Less than a single high end GPU.


My desktop machine has had 128gb since 2018, but for the AI workloads currently commanding almost infinite market value, it really needs the 1TB/s bandwidth and teraflops that only a bona fide GPU can provide. An early AMD GPU with these characteristics is the Radeon VII with 16gb HBM, which I bought for 500 eur back in 2019 (!!!).

I'm a rendering guy, not an AI guy, so I really just want the teraflops, but all GPU users urgently need a 3rd market player.


That 128gb is hanging off a dual channel memory bus with only 128 total bits of bandwidth. Which is why you need the GPU. The Epyc and Xeon CPUs I'm discussing have 6x the memory bandwidth, and will trade blows with that GPU.


At a mere 20x the cost or something, to say nothing about the motherboard etc :( 500 eur for 16GB of 1TB/s with tons of fp32 (and even fp64! The main reason I bought it) back in 2019 is no joke.

Believe me, as a lifelong hobbyist-HPC kind of person, I am absolutely dying for such a HBM/fp64 deal again.


$1,961.19: H13SSL-N Motherboard And EPYC 9334 QS CPU + DDR5 4*128GB 2666MHZ REG ECC RAM Server motherboard kit

https://www.aliexpress.us/item/3256807766813460.html

Doesn't seem like 20x to me. I'm sure spending more than 30 seconds searching could find even better deals.


Isn't 2666 MHz ECC RAM obscenely slow? 32 cores without the fast AVX-512 of Zen5 isn't what anyone is looking for in terms of floating point throughput (ask me about electricity prices in Germany), and for that money I'd rather just take a 4090 with 24GB memory and do my own software fixed point or floating point (which is exactly what I do personally and professionally).

This is exactly what I meant about Intel's recent launch. Imagine if they went full ALU-heavy on latest TSMC process and packaged 128GB with it, for like, 2-3k Eur. Nvidia would be whipping their lawyers to try to do something about that, not just their engineers.


Yes and no. I have been developing some local llama 3 inference software on a machine with 3200MT/s ECC RAM and a Ryzen 7 5800X:

https://github.com/ryao/llama3.c

My experience is that input processing (prompt processing) is compute bottlenecked in GEMM. AVX-512 would help there, although my CPU’s Zen 3 cores do not support it and the memory bandwidth does not matter very much. For output generation (token generation), memory bandwidth is a bottleneck and AVX-512 would not help at all.


I don't think anyone's stopping you, buddy. Great chat. I hope you have a nice evening.


12 channel DDR5 is actually 12x32-bit. JEDEC in its wisdom decided to split the 64-bit channels of earlier versions of DDR into 2x 32-bit channels per DIMM. Reaching 768-bit memory buses with DDR5 requires 24 channels.

Whenever I see DDR5 memory channels discussed, I am never sure if the speaker is accounting for the 2x 32-bit channels per DIMM or not.


The question is whether there's enough overall demand for a GPU architecture with 4x the VRAM of a 5090 but only about 1/3rd of the bandwidth. At that point it would only really be good for AI inferencing, so why not make specialized inferencing silicon instead?


I genuinely wonder why no one is doing this? Why can't I buy this specialized AI inference silicon with plenty of VRAM?


Intel and Qualcomm are doing this, although Intel uses HBM and their hardware is designed to do both inference and training while Qualcomm uses more conventional memory and their hardware is only designed do inference:

https://www.intel.com/content/www/us/en/products/details/pro...

https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...

They did not put it into the PC parts supply chain for reasons known only to them. That said, it would be awesome if Intel made high memory variants of their Arc graphics cards for sale through the PC parts supply chains.


I guess that would be an NPU combined with LPDDR. Basically any Windows Copilot Plus approved device.


Me too, probably 2x. I’d sell like hot cakes.


If their memory IO supports multiple ranks like the RTX 3090 (it used dual rank) did, they could do a new PCB layout and then add more memory chips to it. No additional silicon area would be necessary.


That would basically mean Intel doubling the size of their current GPU die, with a different memory PHY. They're clearly not ready to make that an affordable card. Maybe when they get around to making a chiplet-based GPU.


Are you suggesting that Intel 'just' release a GPU at the same price point as an M4 Max SOC? And that there would be a large market for it if they did so? Seems like an extremely niche product that would be demanding to manufacture. The M4 Max makes sense because it's a complete system they can sell to Apple's price-insensitive audience, Intel doesn't have a captive market like that to sell bespoke LLM accelerator cards to yet.

If this hypothetical 128GB LLM accelerator was also a capable GPU that would be more interesting but Intel hasn't proven an ability to execute on that level yet.


Nothing in my comment says about pricing it at the M4 Max level. Apple charges as much because they can (typing this on an $8000 M3 Max). 128GB LPDDR5 is dirt cheap these days just Apple adds its premium because they like to. Nothing prevents Intel from releasing a basic GPU with that much RAM for under $1k.


You're asking for a GPU die at least as large as NVIDIA's TU102 that was $1k in 2018 when paired with only 11GB of RAM (because $1k couldn't get you a fully-enabled die to use 12GB of RAM). I think you're off by at least a factor of two in your cost estimates.


If Intel came out with an ARC GPU with 128GB VRAM at a $2000 price point, I and many others would likely buy it immediately.


Though Intel should also identify say the top-100 finetuners and just send it to them for free, on the down low. That would create some market pressure.


HBM plz.


Intel has Xeon Phi which was a spin-off of their first attempt at GPU so they have a lot of tech in place they can reuse already. They don't need to go with GDDRx/HBMx designs that require large dies.


I don't want to further this discussions but may be you dont realise some of the people who replied to you either design hardware for a living or has been in the hardware industry for longer than 20 years.


While it is not a GPU, Qualcomm already made an inferencing card with 128GB RAM:

https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...

It would be interesting if those saying that a regular GPU with 128GB of VRAM cannot be made would explain how Qualcomm was able to make this card. It is not a big stretch to imagine a GPU with the same memory configuration. Note that Qualcomm did not use HBM for this.


For some reason Apple did it with M3/M4 Max likely by folks that are also on HN. The question is how many of the years spent designing HW were spent also by educating oneselves on the latest best ways to do it.


>For some reason.....

They already replied with an answer.


Even LPDDR requires a large die. It only takes things out of the realm of technologically impossible to merely economically impractical. A 512-bit bus is still very inconveniently large for a single die.


> release a GPU at the same price point as an M4 Max SOC

Why would it need to be introduced at Apple's high-margin pricing?


It's also impossible and it would need to be a CPU.

CPUs and GPUs access memory very differently.


Thank You Wtallis. Somewhere along the line, this basic "knowledge" of hardware is completely lost. I dont expect this to be explained in any comment section on old Anandtech. It seems hardware enthusiast has mostly disappeared, I guess that is also why Anandtech closed. We now live in a world where most site are just BS rumours.


Qualcomm made an AI inferencing card with 128GB RAM without using HBM:

https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...

Would someone with “basic ‘knowledge’ of hardware” explain why a GPU cannot be made with the same memory configuration?


That's because Anand Lal Shimpi is a CompE by training.

Not too many hardware enthusiast site editors have that academic background.

And while fervor can sometimes substitute for education... probably not in microprocessor / system design.


The Real World Technologies forum is still an absolute gold mine for hardware discussion.

As for articles, IMO, Chips and Cheese is the closest thing we have to Anandtech or RWT in their peak.


It is possible to have multiple memory ranks to reduce the bus width requirements for a given amount of memory. Nvidia has demonstrated that this is doable with GDDR6X on the RTX 3090. The RTX 3090 has a 384-bit bus with 24 memory ICs, despite only needing 12 to reach 384-bit. That means it has every two chips sharing one 32-bit interface, which is a dual rank configuration. If you look at the history of computer memory, you can find many examples of multi-rank configurations. I also recall LR-DIMMs as being another way of achieving this.

Achieving 128GB VRAM with a 256-bit bus (which seems like a reasonable bus width) would mean some multiple of 8 chips. If Micron, Samsung or SK Hynix made 128Gb GDDR7 chips, then 8 would suffice. The best right now seems 24Gb, although 32Gb seems likely to follow (and it would likely come sooner if a large customer such as Intel asked for it), so they would just need to have 32 chips in a quad rank configuration to achieve 128GB.

This assumes that there is no limit in the GDDR7 specification that prevents quad rank configurations. If there is and it still supports dual rank like GDDR6X did, then a 512-bit bus could be done. It would likely be extremely pricy and require a new chip tape out that has much more IO logic transistors to handle the additional bus width (and IO logic transistor scaling is dead, so the die area would be huge), but it is hypothetically possible. Given how much people are willing to pay for more VRAM, it could make business sense to do.

Even if there is no limit in the GDDR7 specification that prevents quadrank, their memory IO logic would need to support it and if it does not, they would need to redesign that and do a new chip tape out in addition to a new board design. This would also be very expensive, although not as expensive as going to a 512-bit memory interface.

In summary, adding more memory would cost more to do and it would not improve competitiveness in the target market for these cards, which I imagine is the main reason that they do not do it.

By the way, the reason that Nvidia implemented support for 2 chips per channel is because they wanted to be able to reach 48GB VRAM on the workstation variant of the 3090 that is known as the RTX A6000 (non-Ada). I do not know why they used 24x 8Gb chips rather than 12x 16Gb on the 3090, although if I had to guess, it had something to do with rank interleaving.


Having four chips per channel is exactly why this is implausible. DDR5 can barely operate with four ranks per channel, at severely reduced speeds. Pulling that off with GDDR6 or GDDR7 is not something we can presume to be possible without specific evidence. The highest-density configurations possible for LPDDR5x are dual-rank and byte mode (one chip per 8 bits of the memory bus, so two chips ganged together to populate a 16-bit channel) — and that still operates at less than half the speed of GDDR6.

I've not seen any proposals for buffering LPDDR or GDDR, so an analog to LRDIMMs is not a readily-available technology.

GDDR is the memory technology that operates at the edge of what's possible for per-pin bandwidth. Loading that memory bus down with many ranks is not something we can expect to be achievable by just putting down more pads on the PCB.


> DDR5 can barely operate with four ranks per channel, at severely reduced speeds.

That is objectively false. See, for instance, V-color’s threadripper RAM[0]. If 96GB quad-rank modules @ 6000Mhz in octo-channel counts as “barely operating” maybe we have different definitions of operation requirements.

As a side note, their quad-channel 3-rank RAM [1] hits 8000MHz, out of the box. Admittedly only 24GB modules, but still.

[0] https://v-color.net/products/ddr5-ocrdimm-amd-wrx90-workstat... [1] https://v-color.net/products/ddr5-oc-rdimm-amd-trx50-worksta...


You linked to registered/buffered memory modules. I already addressed that case; it doesn't apply to LPDDR or GDDR.


In that case, we need a 512-bit memory bus to do this using the 32Gbit GDDR7 chips that should be on the market in the near future. This would be very expensive, but it should be possible, or do you see a reason why that cannot be done either?

That said, I am not an electrical engineer (although I work alongside one and have had a minor role in picking low end components for custom PCBs), I think if Intel were to make a GPU with 128GB VRAM using GDDR7 in the next year or two, the engineer who does the trace routing to make it possible should make a go fund me page for people to send beer money.


I think the goalposts may have shifted a bit, from why hasn't Intel made such a card to why is Intel not (publicly) working on such a card to be released in a year or two.

In terms of what would have been feasible for Intel to bring to market in 2024, the cheapest option for 128GB capacity would probably have been ~8.5Gb/s LPDDR5x on a 256-bit bus, but to at least match the bandwidth of the chip they just launched, it would have made more sense to use a 512-bit bus and bump the die size back up to ~half the reticle limit like their previous generation die with a 256-bit bus. So they would have had a quite slow but high-capacity GPU with a manufacturing cost equal to at least an RTX 4080, before adding in the cost of all that DRAM. And if they had started working on that chip as soon as LLaMA went public, they might have been able to deliver it by now.

It's no surprise at all that such a risky niche product did not emerge from a division of Intel that is lucky to not have been liquidated yet.


In hindsight, I misread you as saying that 128GB of RAM on a “basic GPU” is not technically feasible. My reply was to say it is feasible.

Intel is rumored to have a B770 GPU in development, but it was running late and then was delayed to next year since it had yet to tape out, so they are launching their B580 and B570 graphics cards, which had been ready to go for a while, now. That is why the bus size appears to have dropped across generations. Presumably, if they made a 512-bit bus version, it would be a 9 series card. They certainly left room for it in their lineup, but as far as leaks have been concerned, there is not much hope for one. I do not expect them to use anything other than GDDR7 on their battlemage cards.

As for a high memory ARC card, I am of the opinion that such a product would sell well among the local llama community. There might even be more sales of a high memory ARC card for inference than of the regular ARC cards for gaming given that their discrete graphics sales peaked at 250,000 in Q1 2023 before collapsing, which can be confirmed using the data here:

https://www.tomshardware.com/pc-components/gpus/discrete-gpu...

The market for high memory GPUs is surely bigger than that. That said, Intel is likely pricing their ARC GPUs at a loss after R&D costs are considered. This is likely intended to help them break into a new market, although it has not been going well for them so far. I would guess that they are at least a generation away from profitability.

Intel intends for its Gaudi 3 accelerators to be used for this rather than the ARC line. Those coincidentally have 128GB of RAM, but they use HBM rather than a DDR variant. Qualcomm on the other hand made its own accelerator with 128GB of LPDDR4x RAM:

https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...

If my math is right, Qualcomm went with a 1024-bit memory bus and some incorrect rounding (rounding 137.5 to 138 before multiplying by 4) to reach their stated bandwidth figure. Qualcomm is not selling it through the PC parts supply chain, so I have no clue how much it costs, but I assume that it is expensive. I assume that they used LPDDR4x to be able to build a product since they were too late in securing HBM supply and even if they did, they would not be able to scale production to meet demand growth since Nvidia is buying all of the HBM that it can.


What if they put 8 identical GPUs in the package, each with 1/8 the memory? Would that be a useful configuration for a modern LLM?


GPU inference is always a balancing act, trying to avoid bottlenecks on memory bandwidth (loading data from the GPU's global memory/VRAM to the much smaller internal shared memory, where it can be used for calculations) and compute (once the values are loaded).

Splitting the model up between several GPUs would add a third much worse bottleneck – memory bandwidth between the GPUs. No matter how well you connect them, it'll be slower than transfer within a single GPU.

Still, the fact that you can fit an 8× larger GPU might be worth it to you. It's a trade-off that's almost universally made while training LLMs (sometimes even with the model split down both its width and length), but is much less attractive for inference.


At least for LLMs and transformers this isn't relevant. Having 8x the chips and 8x the memory bandwidth is always better. Interchip communication for matrix multiplication against a constant left matrix with a tiny right matrix isn't bandwidth bound, only latency bound.


> Splitting the model up between several GPUs would add a third much worse bottleneck – memory bandwidth between the GPUs.

What if you allowed the system to only have a shared memory between every neighboring pair of GPUs?

Would that make sense for an LLM?



Last I've heard, the architecture makes that difficult. But my information may be outdated, and even if it isn't, I'm not a hardware designer and may have just misunderstood the limits I hear others discuss.


K80 used to be two glued K40 but their interconnect was barely faster than PCIe so it didn't have much benefit as one had to move stuff between two internal GPUs anyway.


Inference workloads likely won’t care very much. For llama 3.1 405B with bf16 when you split the workload across GPUs by layer, you need to do a 32KB memory copy before the next GPU can begin processing. That can be done incredibly quickly over PCI-E.


It could work, but would it be cost-competitive?


Also, cooling.


> "Just add more RAM" doesn't work the way you wish it could.

Re: Tomasulo's algorithm the other day: https://news.ycombinator.com/item?id=42231284

Cerebras WSE-3 has 44 GB of on-chip SRAM per chip and it's faster than HBM. https://news.ycombinator.com/item?id=41702789#41706409

Intel has HBM2e off-chip RAM in Xeon CPU Max series and GPU Max;

What is the difference between DDR, HBM, and Cerebras' 44GB of on-chip SRAM?


How do architectural bottlenecks due to modified Von Neumann architectures' debuggable instruction pipelines limit computational performance when scaling to larger amounts of off-chip RAM?

Tomasulo's algorithm also centralizes on a common data bus (the CPU-RAM data bus) which is a bottleneck that must scale with the amount of RAM.

Can in-RAM computation solve for error correction without redundant computation and consensus algorithms?

Can on-chip SRAM be built at lower cost?

Von Neumann architecture: https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N... :

> The term "von Neumann architecture" has evolved to refer to any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time (since they share a common bus). This is referred to as the von Neumann bottleneck, which often limits the performance of the corresponding system. [4]

> The von Neumann architecture is simpler than the Harvard architecture (which has one dedicated set of address and data buses for reading and writing to memory and another set of address and data buses to fetch instructions).

Modified Harvard architecture > Comparisons: https://en.wikipedia.org/wiki/Modified_Harvard_architecture

C-RAM: Computational RAM > DRAM-based PIM Taxonomy, See also: https://en.wikipedia.org/wiki/Computational_RAM

SRAM: Static random-access memory https://en.wikipedia.org/wiki/Static_random-access_memory :

> Typically, SRAM is used for the cache and internal registers of a CPU while DRAM is used for a computer's main memory.


For whatever reason Hynix hasn't turned their PIM into a usable product. LPDDR based PIM is insanely effective for inference. I can't stress this enough. An NPU+LPDDR6 PIM would kill GPUs for inference.


How many TOPS/W and TFLOPS/W? (T [Float] Operations Per Second per Watt (hour *?))

/? TOPS/W and FLOPS/W: https://www.google.com/search?q=TOPS%2FW+and+FLOPS%2FW :

- "Why TOPS/W is a bad unit to benchmark next-gen AI chips" (2020) https://medium.com/@aron.kirschen/why-tops-w-is-a-bad-unit-t... :

> The simplest method therefore would be to use TOPS/W for digital approaches in future, but to use TOPS-B/W for analogue in-memory computing approaches!

> TOPS-8/W

> [ IEEE should spec this benchmark metric ]

- "A guide to AI TOPS and NPU performance metrics" (2024) https://www.qualcomm.com/news/onq/2024/04/a-guide-to-ai-tops... :

> TOPS = 2 × MAC unit count × Frequency / 1 trillion

- "Looking Beyond TOPS/W: How To Really Compare NPU Performance" (2023) https://semiengineering.com/looking-beyond-tops-w-how-to-rea... :

> TOPS = MACs * Frequency * 2

> [ { Frequency, NNs employed, Precision, Sparsity and Pruning, Process node, Memory and Power Consumption, utilization} for more representative variants of TOPS/W metric ]


Is this fast enough for DDR or SRAM RAM? "Breakthrough in avalanche-based amorphization reduces data storage energy 1e-9" (2024) https://news.ycombinator.com/item?id=42318944


GDDR isnt like the ram that connects to cpu, it's much more difficult and expensive to add more. You can get up to 48GB with some expensive stacked gddr, but if you wanted to add more stacks you'd need to solve some serious signal timing related headaches that most users wouldn't benefit from.

I think the high memory local inference stuff is going to come from "AI enabled" cpus that share the memory in your computer. Apple is doing this now, but cheaper options are on the way. As a shape its just suboptimal for graphics, so it doesn't make sense for any of the gpu vendors to do it.


As someone else said - I don't think you have to have GDDR, surely there are other options. Apple does a great job of it on their APUs with up to 192GB, even an old AMD Threadripper chip can do quite well with its DDR4/5 performance


For ai inference you definitely have other options, but for low end graphics? the lpddr that apple (and nvidia in grace) use would be super expensive to get a comparable bandwidth (think $3+/gb and to get 500GB/sec you need at least 128GB).

And that 500GB/sec is pretty low for a gpu, its like a 4070 but the memory alone would add $500+ to the cost of the inputs, not even counting the advanced packaging (getting those bandwidths out of lpddr needs organic substrate).

It's not that you can't, just when you start doing this it stops being like a graphics card and becomes like a cpu.


They can use LPDDR5x, it would still massively accelerate inference of large local LLMs that need more than 48GB RAM. Any tensor swapping between CPU RAM and GPU RAM kills the performance.


I think we don't really disagree, I just think that this shape isn't really a gpu its just a cpu because it isn't very good for graphics at that point.


That's why I said "basic GPU". It doesn't have to be too fast but it should still be way faster than a regular CPU. Intel already has Xeon Phi so a lot of things were developed already (like memory controller, heavy parallel dies etc.)


I wonder at that point you'd just be better served by CPU with 4 channels of RAM. If my math is right 4 channels of DDR5-8000 would get you 256GB/s. Not as much bandwidth as a typical discrete GPU, but it would be trivial to get many hundreds of GB of RAM and would be expandable.

Unfortunately I don't think either Intel or AMD makes a CPU that supports quad channel RAM at a decent price.


4 channels of DDR5-8000 would give you 128GB/sec. DDR5 has 2x32-bit channels per DIMM rather than 1x64-bit channel like DDR4 did. You would need 8 channels of DDR5-8000 to reach 256GB/sec.


I think all the people saying "just use a CPU" massively underestimate the speed difference between current CPUs and current GPUs. There's like four orders of magnitude. It's not even in the same zip code. Say you have a 64-core CPU at 2Ghz with 512-bit 1-cycle FP16 instructions. That gives you 32 ops per cycle, 2048 across the entire package, so 4TFlops.

My 7900 XTX does 120TFlops.

To match that, you would need to scale that CPU up to either 2048 cores, 2KB per register (still one-cycle!) or 64Ghz.

I guess if you had 1024-bit registers and 8Ghz, you could get away with only 240 cores. Good luck thermal dissipating that btw. To reverse an opinion I'm seeing in this thread, at that point your CPU starts looking more like a GPU by necessity.


Usually, you can do 2 AVX-512 operations per cycle and using FMADD (fused multiply-add) instructions, you can do two floating point operations for the price of one. That would be 128 operations per cycle per core. The result would be 16TFlops on a 2GHz 64 core CPU, not 4 TFlops. This would give a 1 order of magnitude difference, rather than 4 orders of magnitude.

For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute.


That's fair. On the other hand, there's like exactly one CPU with FP16 AVX512 anyways, and 64core aren't exactly commonplace either. And even with all those advantages, using a datacenter CPU, you're still a factor of 10 off from a GPU that isn't even consumer top-end. With a normal processor, say 16 cores, 16 float ops, even with fused ops and dispatching two ops per cycle you're still only at 2T and ~50x. In consumer spaces, I'm more optimistic about dedicated coprocessors. Maybe even iTPU?


Zen 6 is supposed to add FP16 AVX512 support if AMD’s leaked slides mean what I think they mean. Here is a link to a screenshot of the leaked slides MLID published:

https://overclockers.ru/st/legacy/blog/428111/424644_O.jpg

I have been working on doing inference on a Ryzen 7 5800X lately and I have had good results:

https://github.com/ryao/llama3.c/blob/master/run.c

Running on a GPU like my 3090 Ti will likely outperform it by two orders of magnitude, but I have managed to push the needle slightly on the state of the art performance for prompt processing on my CPU. I suspect an additional 15% improvement is possible, but I do not expect to be able to realize it. In any case, it is an active R&D project that I am doing to learn how these things work.

Finally to answer your question, I have no good answers for you (or more specifically, answers that I like). I have been trying to think of ways to do fast local inference on high end models cost effectively for months. So far, I have nothing to show for it aside from my R&D into CPU llama 3 inference since none of my ideas are able to bring hardware costs needed for llama 3.1 405B below $10,000 with performance at an acceptable level. My idea of an acceptable performance level is 10 tokens per second for token generation and 4000 tokens per second for prompt processing, although perhaps lower prompt processing performance is acceptable with prompt caching.


This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention.

Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.


Out of interest, how does that look for diffusion?


I guess it's hard to know how well this would compete with integrated gpus, especially at a reasonable pricepoint. If you wanted to spend $4000+ on it, it could be very competitive and might look something like nvidias grace-hopper superchip, but if you want the product to be under $1k I think it might be better just to buy separate cards for your graphics and ai stuff.


It is not stacked. It is multirank. Stacking means putting multiple layers on the same chip. They are already doing it for HBM. They will likely do it for other forms of DRAM in the future. Samsung reportedly will begin doing it in the 2030s:

https://www.tomshardware.com/pc-components/dram/samsung-outl...

I am not sure why they can already do stacking for HBM, but not GDDR and DDR. My guess is that it is cost related. I have heard that HBM reportedly costs 3 times more than DDR. Whatever they are doing to stack it now that is likely much more expensive than their planned 3D fabrication node.


Meta comment: "why don't they just" phrase usually indicates significant ignorance about a subject, it's better to learn a little bit before dispensing criticism about beancounters or whatnot.

In this case, the die I/O limits precludes more than a reasonable number of DDR channels.


Op asked a question and got a bunch of answers "why they couldn't do just that". I think that's a win.


Or literally "just" take the "just" out of the question and 90% of the time the tone becomes inquisitive instead of rhetorical.


HBM3E memory is at least 3x the price of DDR5 (it requires 3x the wafer as DDR5) and capacity is sold out for all of 2025 already... that's the price and production bottleneck.

High speed, low latency server grade DDR5 is around $800-$1600 for 128GB. Triple that for $2400 - $4800 just for the memory. Still need the GPUs/APUs, card, VRMs, etc.

Even the nVidia H100 with "only" 94GB starts at $30k...


Nvidia's $30,000 is a 90% margin product at scale. They could charge 1/3 that and still be very profitable. There has rarely been such a profitable large corporation in terms of the combo of profit & margin.

Their last quarter was $35b in sales and $26b in gross profit ($21.8b op income; 62% op income margin vs sales).

Visa is notorious for their extreme margin (66% op income margin vs sales) due to being basically a brand + transaction network. So the fact that a hardware manufacturer is hitting those levels is truly remarkable.

It's very clear that either AMD or Intel could accept far lower margins to go after them. And indeed that's exactly what will be required for any serious attempt to cut into their monopoly position.


Visa doesn't actually make a ton of money off each transaction, if you divide out their revenue against their payment volume (napkin math)...

They processed $12T in payments last year (almost a billion payments per day), with a net revenue of $32B. That's a gross transaction margin of 0.26% and their GAAP net income was half that, about 0.14%. [1]

They're just a transaction network, unlike say Amex which is both an issuer and a network. Being just the network is more operationally efficient.

[1] https://annualreport.visa.com/financials/default.aspx


That’s a weird way to account for their business size. There isn’t a significant marginal cost per transaction. They didn’t sell $12T in products. They facilitated that much in payments. Their profits are fantastic.


If you have no clue how profit margins are calculated then you're better off staying quiet.

It's quite simple. Divide revenue minus costs by revenue. Transaction volume isn't revenue. Visa only gets the transaction fee.

Even if I give you the benefit of the doubt and do a proper interpretation of the number you've arrived at, its meaning is quite different and quite off topic from this discussion. What you have calculated is the total share of costs that Visa represents in that 12 trillion dollar part of the economy. It is like saying Visa's share of GDP is 0.1%.


I didn't say that was their profit margin, that's their transaction margin.

As my mother used to say if you have nothing nice to say you're better off staying quiet ;)


> And indeed that's exactly what will be required for any serious attempt to cut into their monopoly position.

You misunderstand why and how Nvidia is a monopoly. Many companies make GPUs, and all those GPUs can be used for computation if you develop compute shaders for them. This part is not the problem, you can already go buy cheaper hardware that outperforms Nvidia if price is your only concern.

Software is the issue. That's it - it's CUDA and nothing else. You cannot assail Nvidia's position, and moreover their hardware's value, without a really solid reason for datacenters to own them. Datacenters do not want to own GPUs because once the AI bubble pops they'll be bagholders for Intel and AMD's depreciated software. Nvidia hardware can at least crypto mine, or be leased out to industrial customers that have their own remote CUDA applications. The demand for generic GPU compute is basically nonexistent, the reason this market exists at all is because CUDA exists, and you cannot turn over Nvidia's foothold without accepting that fact.

The only way the entire industry can fuck over Nvidia is if they choose to invest in a complete CUDA replacement like OpenCL. That is the only way that Nvidia's value can be actually deposed without any path of recourse for their business, and it will never happen because every single one of Nvidia's competitors hate each other's guts and would rather watch each other die in gladiatorial combat than help each other fight the monster. And Jensen Huang probably revels in it, CUDA is a hedged bet against the industry ever working together for common good.


I feel people are exaggerating the impossibility of replacing CUDA. Adopting CUDA is convenient right now because yes it is difficult to replace it. Barrier to entry for orgs that can do that is very high. But it has been done. Google has the TPU for example.


The TPU is not a GPU nor is it commercially available. It is a chip optimized around a limited featureset with a limited software layer on top of it. It's an impressive demonstration on Google's behalf to be sure, but it's also not a shot across the bow at Nvidia's business. Nvidia has the TSMC relations, a refined and complex streaming multiprocessor architecture and actual software support their customers can go use today. TPUs haven't quite taken over like people anticipated anyways.

I don't personally think CUDA is impossible to replace - but I do think that everyone capable of replacing CUDA has been ignoring it recently. Nvidia's role as the GPGPU compute people is secure for the foreseeable future. Apple wants to design simpler GPUs, AMD wants to design cheaper GPUs, and Intel wants to pretend like they can compete with AMD. Every stakeholder with the capacity to turn this ship around is pretending like Nvidia doesn't exist and whistling until they go away.


I don’t disagree with what you are saying but I want to point out that the fact that the TPU is not a GPU is not really relevant. In the end what matters most is whether or not it can accelerate PyTorch.


They're not exaggerating it. The more things change, the more they stay the same. Nvidia and AMD had the exact same relationship 15 years ago that they do today. The AMD crowd clutching about their better efficiencies, and the Nvidia crowd having grossly superior drivers/firmware/hardware, including unique PhysX stuff that STILL has not been matched since 2012 (remember Planetside 2 or Broderlands 2 physics? Pepperidge Farm Remembers...)

So many billions of dollars and no one is even 1% close to displacing CUDA in any meaningful way. ZULDA is dead. ROCM is a meme, Scale is a meme. Either you use CUDA or you don't do meaningful AI work.


CUDA is not the issue. AMD have already reimplemented like 80% of it, and honestly that part of it mostly works fine. Pytorch supports it, (almost) all the big frameworks support it, if you're not doing really arcane things it just works. It's the drivers! They took like two years after the release of their flagship card to stop randomly crashing. Everything geohot has ever said about AMD drivers is 100% true. They just cannot stop shooting themselves in the foot.


What did he say?


Geohot (temporarily) giving up. https://github.com/ROCm/ROCm/issues/2198#issuecomment-157438... Sadly most of the real spicy Twitter messages are gone since he deleted all his content, but there was a really fun one where he went off on a beautifully cryptic commit message in the driver. He also begged AMD to opensource the firmware so he could debug it. Sadly, AMD promised to do it and then nothing happened, as is typical for AMD promises. That's why tinygrad nowadays is aiming to just bypass the driver and firmware entirely.


Who is tinygrad?


tinygrad = George Hotz. (It's his company, __tinygrad__ is basically his "work account".)


> The only way the entire industry can fuck over Nvidia is if they choose to invest in a complete CUDA replacement like OpenCL. That is the only way that Nvidia's value can be actually deposed without any path of recourse for their business, and it will never happen because every single one of Nvidia's competitors hate each other's guts and would rather watch each other die

Intel seems to have thrown their weight behind SYCL, which is an open standard intended to compete with CUDA. Its not clear there has been much interest from other hardware vendors though.


I do not misunderstand why Nvidia has a monopoly. You jumped drastically beyond anything I was discussing and incorrectly assumed ignorance on my part. I never said why I thought they had one. I never brought up matters of performance or software or moats at all. I matter of fact stated they had a monopoly, you assumed the rest.

It's impossible to assail their monopoly without utilizing far lower prices, coming up under their extreme margin products. It's how it is almost always done competitively in tech (see: ARM, or Office (dramatically undercut Lotus with a cheaper inferior product), or Linux, or Huawei, or Chromebooks, or Internet Explorer, or just about anything).

Note: I never said lower prices is all you'd need. Who would think that? The implication is that I'm ignorant of the entire history of tech, it's a poor approach to discussion with another person on HN frankly.


Nvidia's monopoly is pretty much detached from price at this point. That's the entire reason why they can charge insane margins - nobody cares! There is not a single business squaring Nvidia up with serious intent to take down CUDA. It's been this way for nearly two decades at this point, with not a single spark of hope to show for it.

In the case of ARM, Office, Linux, Huawei, and ChromeOS, these were all actual alternatives to the incumbent tools people were familiar with. You can directly compare Office and Lotus because they are fundamentally similar products - ARM had a real chance against x86 because wasn't a complex ISA to unseat. Nvidia is not analogous to these businesses because they occupy a league of their own as the provider of CUDA. It's not exaggeration to say that they have completely seceded from the market of GPUs and can sustain themselves on demand from crypto miners and AI pundits alone.

AMD, Intel and even Apple have bigger things to worry about than hitting an arbitrary price point, if they want Nvidia in their crosshairs. All of them have already solved the "sell consumer tech at attractive prices" problem but not the "make it complex, standardize it and scale it up" problem.


It is cheaper to pay Nvidia than it is to roll your own solution and no one else is competitive. That is the reason Nvidia can charge so much per card.


Thank you for laying it out. It's so silly to see people in the comments act like Intel or Nvidia can't EASILY add more VRAM to their cards. Every single argument against it is all hogwash.


Even 24 or 32 GB for an accessible price would sell out fast. NVIDIA wants $2000 for the 5090 to get 32.


Because you can't stack that much ram on a GPU without sufficient channels to do so. You could probably do 64GB on GDDR6 but you can't do 128GB on GDDR6 without more memory channels. 2GB per chip per channel is the current limit for GDDR6 this is why HBM was invented.

It is why you can only see GPUs with 24GB of memory at the moment.

HBM2 can handle 64GB ( 4 x 8GB Stack ) ( Total capacity 128GB )

HBM3 can handle 192GB ( 4 x 24GB Stack ) ( Total capacity 384GB )

You can not do this with GDDR6.


Look at the RTX 3090 and RTX A6000 (the non-Ada one). They both have 24 memory chips with a 384-bit memory bus, but one has 24GB of VRAM and the other has 48GB of VRAM. They both have two chips per channel. This breaks the 24GB VRAM limit that you claim to exist.


Nope it doesn't that is 12 channels 2 modules per channel each module at 2GB is the largest the chip gets and the cost of those drastically increase the cost.

The problem you experience with GDDR6 and channels is width. You get higher in size but you'll never be able to fill the memory fast enough to be cost effective. Also why HBM exists. Lets say the top is 960GB/s for an A6000 for GPU memory. HBM3 is 3.35 TB/s.

If Intel wanted to make better GPUs it needs to switch to HBM.


> Nope it doesn't that is 12 channels 2 modules per channel each module at 2GB

You say nope and then proceed to say exactly what I said. When if the ICs are limited to 16Gbit density for now, the 24GB limit you mentioned was wrong, since you assumed single rank was the highest possible.

As for HBM, that is earmarked for their gaudi 3 line. There is no chance of it being put into a consumer product, as it would be like selling gold at pyrite pricing.


Pretty obvious stuff, right? I mean, you don't even need HBM for that, you just need a TON of memory channels. Sure, that kind of setup would only be efficient for highly coalesced reads/writes, but that's what you need these days for inference - highly coalesced reads and writes. You could even get by with 64GB of DDR5. DDR5-4800 (rather modest) is 38.4GB/s per channel. To get 1TB/s you'd only need 26 channels. With the more expensive DDR5-6400 you'd only need 20. That doesn't at all sound insurmountable for a company of Intel's caliber. Heck, break up the dies (and the channels) across several chiplets even, if the interconnect is decent it'll still run really well.


48 GB is at the tail end of what's reasonale for normal GPUs. The IO requires a lot of die space. And intel's architecture is not very space efficient right now compared to nvidia's


> The IO requires a lot of die space.

And even if you spend a lot of die space on memory controllers, you can only fit so many GDDR chips around the GPU core while maintaining signal integrity. HBM sidesteps that issue but it's still too expensive for anything but the highest end accelerators, and the ordinary LPDDR that Apple uses is lacking in bandwidth compared to GDDR, so they have to compensate with ginormous amounts of IO silicon. The M4 Ultra is expected to have similar bandwidth to a 4090 but the former will need a 1024bit bus to get there while the latter is only 384bit.


Going off of how the 4090 and 7900 xtx is arranged I think you could maybe fit on or two chips more around the die over their 12, but that's still a far cry from 128. That would probably just need a shared bus like normal DDR as you're not fitting that much with 16 gbit density


Look at the 3090, which uses 24 chips (12 on one side and 12 on another). Pushing it to 32 is doable. 32 is all you need to reach 128GB VRAM with the 32Gbit GDDR7 chips that should be on the market in the near future.


Where would you route the connection to the additional 4 groups of chips around the die? The PCIe connection needs to be there too, and they also may not like power delivery going through them


Nvidia has done a 512-bit bus in the past. The 3090 has 4 groups of 3 on each side of the card. Switching to 4 groups of 4 should be doable visually. That said, I would not want to be the one responsible for doing the trace routing.


What if we did what others suggested was the practical limit - 48GB. Then just put 2-3 cards in the system and maybe had a little bridge over a separate bus for them to communicate?


I believe that would need some software work from Intel where they're lacking a bit now with their delayed start. Not sure how the frameworks themselves split up the inference work to avoid crossing GPUs as the bandwidth is horrible there.

If we're being reasonable and say that you're not using a modern HEDT CPU that costs a couple thousand, the best a consumer botherboard can get right now would be 2x 8x PCIe gen 5 at 32GB/s and one chipset x8 PCIe gen 4 at 16GB/s. I'm not sure if a motherboard like that actually exists but Intel's chipset should allow it; AMD only does x4 to chipset so the third slot is limited by that


After carefully reviewing all of the other comments explaining the many technical and organizational reasons why they should not “just do that,” I have come to the conclusion that it was a big missed opportunity by intel.


Totally agree. Someone needs to exploit the lack of available gpu memory in graphics cards for model runners. Even training tensors tends to run against memory issues with the current cards.


They don't need to do 128gb, 48gb+ would eat their lunch, Intel and AMD are sleeping.


Qualcomm built a card designed to do inferencing with 128GB of RAM:

https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...

I have no idea how much it costs. They do not sell it via PC parts channels.


I think a better idea would be an NPU with slower memory, or tie it to the system DDR. I don't think consumer inference (possibly even training) applications would need the memory bandwidth offered by GDDR/HBM. Inference on my 7950x is already stupid fast (all things considered).

The deeper problem is that the market for this is probably incredibly niche.


This GPU has a 192-bit memory bus. At 32-bit GDDR bus width (well w/ 6, 2x16b data channels per chip), that means you have 6 channels. With regular GDDR6, the largest produced size is 16Gb (2GB), so 12GB is what you get. You could double that up w/ a beefed up PCB if the memory controller supports it to get up to 24GB (in the way workstation cards like W7900 and A6000).

Beyond that, you'd have to move to GDDR7 (which has 24Gb/3GB chips incoming) or to HBM stacks, but at that point you're well beyond a "basic GPU". I think the only way you could get to 128GB would be either using regular (LP)DDR or HBM.

Note, Apple M chips have weak GPUs with decent MBW and large memory capacities (up to 192GB @ 800GB/s for an M2 Ultra, launched mid 2023) and have not been a major CUDA threat so I don't think your hypothesis actually stands up.


Judging by the number of 16 GB laptops I see around, 128 GB of RAM would probably cost a bajillion dollars


One of the great things about having a desktop is being able to get that much for under $300 instead of the price of a second laptop.


Desktop PCs RAM don't have the bandwidth and latency required for running AI.


Not the laptop RAM. It costs pennies, Apple's is just charging $200 for 12GB because they can. It's way too slow though..

And Nvidia doesn't want to cannibalize its high end chips but putting more memory into consumer ones.


[flagged]


They all look like amusement parks nowadays.


You can get a clean Fractal Design case with 0 lights or RGB. You are showing your ignorance. :)

Also, RGB means RGB... just set the color to white for a more clean look? But reading the manual is not for you.


Nope, because you can't get those at the mall downtown, rather have to custom build, ordering from online shops.


Considering how HN professedly struggles to use non-reversible cables, it wouldn't surprise me if someone started to complain about getting skillchecked by a Molex or SATA.

And that CPU, with the little gold tab you gotta line up? That's a whopping three ways you could go wrong! Forget about it, might as well just phone Dell or Apple and pay the idiot tax.


Given my experience with PC parts in the past, most folks would be skillchecked when getting any card for their slots, that don't exactly match the motherboard firmware, bus protocol version or speed, even a x.1 variation makes all the difference.

Also better check if the UEFI firmware does understand the nice NVMe SSD brand, version that was bought.

Be sure to not fill all the slots either, or ensure the electromagnetic radiation from each board doesn't cause issues with their neighbour, they might be a possible reason for random freezes.

Custom PCs aren't IKEA, and my time building them is long gone, I have better things to do.


> Also better check if the UEFI firmware does understand the nice NVMe SSD brand, version that was bought.

That's funny! I haven't heard of this being an issue since 2009. And back then, it was an issue with a select few manufacturers that took it upon themselves to modify the stock ACPI tables to behave differently from how the OEM intended it to behave.

Not naming any names, but it really was a pain in the ass for the people that bought commercial dual-boot software and couldn't run ostensibly recent OSes. Something you didn't encounter if you braved the impossible gauntlet of building a PC from scratch, as there was no "we know better than you" interruption from companies with a sizable investment in watching UEFI burn. Of course, you still have to put up with other fierce trials such as "which way does the DIMM line up" and "how do screws work" which will triumph over so-called hackers that thought they were smart enough to build a PC.

Just a thought. Custom PCs will never consume the world, but we are going to point and laugh if you buy a preconfigured rackmount or tower at the sticker price. That's like watching a car enthusiast proudly exit the used Honda dealership after paying three times the Blue Book price for their ride. Puts you up there with Elon and Zuckerberg in the pantheon of people that let FOMO drive irrational hardware purchases.

> Custom PCs aren't IKEA,

Correct, IKEA furniture actually demands literacy from the person installing it.

> I have better things to do.

You're a Hacker News user with 14 years of seniority and enough karma to make the Buddha blush. We both know that if you had anything better to do, you wouldn't be writing paragraph-length replies to arguments you supposedly don't take seriously in the first place. Whatever "better things" exist on your agenda, they must have quite the lengthy refractory period for you to find the time in your horribly busy life for this and the other 63,548 comments you've submit to this site.

If you can't prove that PC building is inconvenient, you can't prove that your time is worth something and you can't prove that your alternative is better value, what are you trying to tell me? That you're lazier than a middle schooler and too dumb to shop at Radioshack? Touché and well met, your rhetoric moves me.


Isn't nobody actually making anything close to a profit in the AI space? When your entire business model is propped up by VC money making your bill of materials that you use to make negative profit cheaper is probably not very high on your list of priorities.


They are probably held back by same reason thats preventing AMD and nVidia from doing it either.


The reason is AMD and Nvidia don't is that they don't want to cannibalize their high end AI market. Intel doesn't have a high end AI market to protect.


There are products like this one: https://www.intel.com/content/www/us/en/products/sku/232592/...

As far as I understand it, it gives you 64 GiB of HBM per socket.


that's a CPU. We are talking about GPUs here for highly parallel matrix multiplication problems. 2 different beasts.


This CPU is a bit of a hybrid because it as integrated matrix multiplication accelerators: https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions


NVidia and AMD make $$$ on datacenter GPUs so it makes sense they don't want to discount their own high-end. Intel has nothing there so they can happily go for commodization of AI hardware like what Meta did when releasing LLaMA to the wild.


Is nVidia or AmD offering 128gb cards in any configuration?


They aren't "cards" but MI300x has 192GB and MI325x has 256GB.


You can run an AMD APU with 128GB of shared RAM.


It's too slow and not very compatible. Most BIOSes also don't allow sharing that much memory with GPU (max like 16GB).


Isn’t that setting just a historical thing annd ann integrated GPU is able to access any system memory that is mapped by the IOMMU? I assume this is how it works for people using the NVIDIA Jetson AGX Orin 64GB Developer Kit to do inference. I do not know why it would be different for AMD APUs.


I remember somebody complaining about it on reddit, unable to overcome some BIOS limitation on an AMD G processor. Even on M3 Max one had to issue a special command to enable GPU to access more memory.


The command on the M3 Max is a sysctl command to adjust an operating system enforced limit. That is different than the aperture setting in the bios. The limitation on AMD is more interesting and more relevant.


you can do that with nvidia too but it takes you from 6tok/s to 6s/token or worse (not even exaggerating)


The size of the local inference market is too small. Maybe a couple of thousand LLM enthusiasts? It's not enough to make a profit or even breakeven on the development costs for the hardware.


For now. This might very well change once the general public realizes they can be movie directors (or generative world gamers) just by downloading some model and plugging in an eGPU. The potential inference market is huge


AMD has a 192GB GPU. I don’t see them eating NVidia’s lunch with it.


They are charging as much as Nvidia for it. Now imagine they offered such a card for $2k. Would that allow them to eat Nvidia's lunch?


Let’s say for the sake of argument that you could build such a card and sell it for less than $5k. Why would you do it? You know there’s huge demand in the tens of billions per quarter for high end cards. Why undercut so heavily that market? To overthrow NVidia? So you’ll end up with a profit margin way low and then your shareholders will eat you alive.


AMD would be selling it at a loss. Given that HBM costs 3x the price of desktop DRAM and a 192GB kit costs $600 at Newegg, the memory alone would cost 90% of the price. The GPU die, PCB, power circuitry, etc likely costs more than $200 to make.

This does not consider that the board of directors would crucify Lisa Su if she authorized the use of HBM on a consumer product while it is supply constrained and there is enterprise demand for products using it. AMD can only get a limited amount of it and what they do get is not enough for enterprise demand where AMD has extremely healthy margins.

Even if they by some miracle turned a profit on a $2000 consumer card with 192GB HBM, every sale would have a massive opportunity cost and effectively would be a loss in the eyes of the board of directors.

Meanwhile, Nvidia would be unaffected because AMD could not produce very many of these.


NVidia would be dramatically affected, just not overnight.

If Intel or AMD sold a niche product with 48GB RAM even at a loss, but hit high-end consumer pricing, there would be a flood of people doing various AI work to buy it. The end result would be that parts of NVidia's moat would start draining rather quickly, and AMD / Intel would be in a stronger position for AI products.

I use NVidia because when I bought AMD during the GPU shortage, ROCm simply didn't work for AI. This was a few years back, but I was burned badly enough that I'm unlikely to risk AMD again for a long, long time. Unused code sits broken, and no ecosystem gets built up. A few years later, things are gradually improving for AMD for the kinds of things I wanted to do years ago, but all my code is already built around NVidia, and all my computers have NVidia cards. It's a project with users, and all those users are buying NVidia as well (even if just for surface dependencies, like dev-ops scripts which install CUDA). That, times thousands of projects, is part of NVidia's moat.

If I could build a cheap system with around 200GB, that would be incentive for me to move the relatively surface dependencies to work on a different platform. I can buy a motherboard with four PCI slots, and plug in four 48GB cards to get there. I'd build things around Intel or AMD instead.

The alternative is NVidia would start shipping competitive cards. If they did that, their high-end profit margins would dissolve.

The breakpoints for inference functionality are really at around 16GB, 48GB, and 200GB, for various historical reasons.


> If I could build a cheap system with around 200GB,

Even if AMD dropped the price to $2000, you could not be able to build a system with one of these cards. You cannot buy these cards at their current 5 digit pricing. The idea that you could buy it if they dropped the price to $2000 is a fantasy, since others would purchase the supply long before you have a chance to purchase one, just like they do now.

AMD is already selling out at the current 5 digit pricing and Nvidia is not affected, since Nvidia is selling millions of cards per year and still cannot meet demand while AMD is selling around 100,000. AMD dropping the price to $2000 would not harm Nvidia in the slightest. It would harm AMD by turning a significant money maker into a loss leader. It would also likely result in Lisa Su being fired.

By the way, the CUDA moat is overrated since people already implement support for alternatives. llama.cpp for example supports at least 3. PyTorch supports alternatives too. None of this harms Nvidia unless Nvidia stops innovating and that is unlikely to happen. A price drop to $2000 would not change this.


Let's compare to see if it's really the same market:

HGX B200: 36 petaflops at FP16. 14.4 terabytes/second bandwidth.

RX4060 (similar to Intel): 15 teraflops at FP16. 272 gigabytes/second bandwidth

Hmmm.... Note the prefixes (peta versus tera)

A lot of that is apples-to-oranges, but that's kind of the point. It's a different market.

A low-performance high-RAM product would not cut into the server market since performance matters. What it would do is open up a world of diverse research, development, and consumer applications.

Critically, even if it were, Intel doesn't play in this market. If what you're saying were to happen, it should be a no-brainer for Intel to launch a low-cost alternative. That said, it wouldn't happen. What would happen is a lot of business, researchers, and individuals would be able to use ≈200GB models on their own hardware for low-scale use.

> By the way, the CUDA moat is overrated since people already implement support for alternatives.

No. It's not. It's perhaps overrated if you're building a custom solution and making the next OpenAI or Anthropic. It's very much not overrated if you're doing general-purpose work and want things to just work.

https://www.nvidia.com/en-us/data-center/hgx/ https://www.techpowerup.com/gpu-specs/geforce-rtx-4060.c4107


What treprinum suggested was AMD selling their current 192GB enterprise card (MI300X) for $2000, not a low end card. Everything you said makes sense but it is beside the point that was raised above. You want the other discussion about attaching 128GB to a basic GPU. I agree that would be disruptive, but that is a different discussion entirely. In fact, I beat you to saying that would be disruptive by about 16 hours:

https://news.ycombinator.com/item?id=42315309


If you want to load up 405B @ FP_16 into a single H100 box, how do you do it? You get two boxes. 2x the price.

Models are getting larger, not smaller. This is why H200 has more memory, but the same exact compute. MI300x vs. MI325x... more memory, same compute.


We would also need to imagine AMD fixing their software.


I think plenty of enthusiastic open source devs would jump at it and fix their software if the software was reasonably open. The same effect as what happened when Meta released LLaMA.


It is open and they regularly merge PRs.

https://github.com/ROCm/ROCm/pulls?q=is%3Apr+is%3Aclosed


AMD GPUs aren't very attractive to ML folks because they don't outshine Nvidia in any single aspect. Blasting lots of RAM onto a GPU would make it attractive immediately with lots of attention from devs occupied with more interesting things.


does the 7900xt outperform the 3090ti? if so, there's already a market because those are the same price. I don't mean in theory are there any workloads that the 7900xt can do better? Even if they're practically equal performance you get a warranty and support with your new 7900xt.

also i didn't know there was a 192GB amd GPU.


MI300X already leads in VRAM as it has 192 GB.

For local inference, 7900 XTX has 24 GB of VRAM for less than $1000.

At what threshold of VRAM would you start being interested in MI?


Problem with MI300x is the price. Problem with 7900XTX is that it's at best as good as Nvidia with the same RAM for a similar price. If 7900XTX had e.g. 64GB of RAM, was 2x slower than 4080, and kept its price, it would sell like crazy.


I have a 7900 XTX. Honestly I regret it. It took two years for the driver to stop randomly crashing with very pedestrian ROCm loads. And there's no future in AMD support now they're getting out of the high-end dual-use GPU game anyways. I should have gone with NVidia.


Who manufactures the type of RAM and can they buy enough capacity? I know nVidia bought up the high bandwidth memory supply for years to come.


Honestly, I don't think we would need "this type of RAM." The confused part of this discussion is the belief that we need obscene bandwidth.

If I need 300GB/s memory bandwidth for my workload, that can be accomplished with:

* One RAM chip with 300GB/s

* Two RAM chips with 150GB/s each

* Four RAM chips with 75GB/s each

Etc.

Stepping up from 16GB to 196GB, the bandwidth requirements for each chip go down 10-fold, and you can use much cheaper RAM as a result. And all the signalling requirements relax too.

Much of this discussion presumes a 200GB card would individually need the same capacity to each RAM chip as a 12GB card. This is just false. An A770 or 4060-grade card couldn't keep up with that much data. And if I'm using a small model, I can get the same bandwidth by properly distributing it among RAM chips (which most hardware does automatically).

An A770 or 4060-grade card, with the same total memory capacity as we have today, but 200GB RAM, would allow us to run high-quality LLMs locally or do high-resolution renders. That wouldn't have the same performance as a $200k card, obviously, but for many inferences uses, that's just not very important.

If I were buying for my own uses, I'd want 12x 32GB PC3200 DIMMs for a total of 384GB RAM at $600 for the RAM (say $2k total), with an individual throughput of 25GB/sec and a total throughput of 300GB/sec. I'd be okay with 4060-grade performance. My own uses are a bit niche, and I think for most other people's uses, something with a little more throughput and a little less capacity (48-196GB) might make more sense. But you definitely don't need the same throughput as existing GPU RAM.


armchair hardware enthusiast opinion: because silicon in the high-end is expensive, and not just a matter of slapping things together.

besides the clear limitation of the memory technology they are using compared to the nvidia's enterprise solution, for such large GPU chips that could really make use of such memory, they need to make binning possible by selling cut-down versions of them as well.

nvidia can pull this off because they can sell lower-end chips at the same time. intel is barely making a dent in sales, and making a high-end chip will only be very risky, at the cost of potentially benefiting a niche crowd.

> put them as a major CUDA threat. that is a software/ecosystem problem, which hardware alone cannot solve. for all the devs that use Macs, even in AI it is only about inference at the moment. nobody is coming at CUDA for training for the near future. amd tried and failed plenty already.


The question is is there a real market for this? I do think it could bootstrap from local inference enthousiasts, but it is not clearcut.

Rather than go all in with 128GB, they could test the waters easily with a cheap 32GB offering and take ot from there.


I would think someday we can use AI to port the software. Maybe even use AI to design the cards.


I've said this for a while...

I do think one challenge is, AFAIK with most GDDR5/6 there's a density issue that requires either larger memory bus paths or other additional complexity to support large sizes.

That said, the lack of even a 16GB variant is sus.

I'll take some copium in that maybe they're trying to solve the 'size' issue somehow and are just making sure whatever system they use isn't gonna be an i820 MTH debacle before they pull the trigger on announcing it.


Because they can’t


This is a gaming card. Look at benchmarks.


Because if they could just do that and it would rival what NVidia has, they would just do it.

But obvoiusly they don't.

And for reasons: NVidia has worked on CUDA for ages, do you believe they just replace this whole thing in no time?


Does CUDA even matter than much for LLMs? Especially inference? I don't think software would be the limiting factor for this hypothetical GPU. Afterall it would be competing with Apple's M chips not with the 4090 or Nvidia's enterprise GPUs.


It's the only thing that matters. Folks act like AMD support is there because suddenly you can run the most basic LLM workload. Try doing anything actually interesting (i.e, try running anything cool in the mechanistic interoperability or representation/attention engineer world) with AMD and suddenly everything broken, nothing works, and you have to spend millions worth of AI engineer developer time to try to salvage a working solution.

Or you can just buy Nvidia.


llama.cpp and its derivatives say yes.


This is the most script kiddy comment I've seen in a while.

llama.cpp is just inference, not training, and the CUDA backend is still the fastest one by far. No one is even close to matching CUDA on either training or inference. The closest is AMD with ROCm, but there's likely a decade of work to be done to be competitive.


Yes, and inference is a huge market in itself and potentially larger than training (gut feeling haven’t run numbers)

Keep NVIDIA for training and Intel/AMD/Cerebras/… for interference.


The funny thing about Cerebras is that it doesn't scale well at all for inference and if you talk to them in person, they are currently making all their money on training workloads.


Inference is still a lot faster on CUDA than on CPU. It's fine if you run it at home or on your laptop for privacy, but if you're serving those models at any scale, you're going to be using GPUs with CUDA.

Inference is also a much smaller market right now, but will likely be overtaken later as we have more people using the models than competing to train the best one.


NVidia Blackwell is not just a GPU. Its a Rack with a interconnect through a custom Nvidia based Network.

And it needs liquid cooling.

You don't just plugin intel cards 'out of the box'.


You're not wrong, but technically llama.cpp does have training (both raw model and fine tuning). And it's been around for a long time. Back around the ggml->gguf switch I used llama.cpp to train a tiny 0.9B llama 1 through the early fast parts of the loss reduction on 3GB of IRC logs with 64 tokens of context over about a month. It eventually produced some gpt2-like IRC lines within it's very short context.

Would anyone choose llama.cpp's training tools to do serious work? No. Do they exist and work, yes.


Inference on very large LLMs where model + backprop exceed 48GB is already way faster on a 128GB MacBook than on NVidia unless you have one of those monstrous Hx00s with lots of RAM which most devs don't.


No one is running LLMs on consumer NVidia GPUs or apple MacBooks.

A dev, if they want to run local models, probably run something which just fits on a proper GPU. For everything else, everyone uses an API key from whatever because its fundamentaly faster.

IF a affordable intel GPU would be relevant faster for inferencing, is not clear at all.

A 4090 is at least double the speed of Apples GPU.


4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.


Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]

    I have a 4090, PCIe 3x16, DDR4 RAM.
    
    oobabooga/text-generation-webui
    using exllama
    I can load 30B 4bit GPTQ models and use full 2048 context
    I get 30-40 tokens/s
[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...


Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.


Do you have the code for that test?


I ran some variation of llama.cpp that could handle large models by running portion of them on GPU and if too large, the rest on CPU and those were the results. Maybe I can dig it from some computer at home but it was almost like a year ago when I got M3 Max with 128GB RAM.


Because the CPU has to load the model in parts for every cycle so you're spending a lot of time on IO and it offsets processing.

You're talking about completely different things here.

It's fine if you're doing a few requests at home, but if you're actually serving AI models, CUDA is the only reasonable choice other than ASICs.


My comment was about Intel having a starter project, getting enthusiastic response from devs, network effects and iterate from there. They need a way to threaten Nvidia and just focusing on what they can't do won't bring them there. There is one route where they can disturb Nvidia's high end over time and that's a cheap basic GPU with lots of RAM. Like Ryzen 1st gen whose single core performance was two generations behind Intel trashed Intel by providing 2x as many cores for cheap.


It would be a good idea to start with some basic understanding of GPU, and realizing why this can't easily be done.


That's a question M3 Max with its internal GPU already answered. It's not like I didn't do any HPC or CUDA work in the past to be completely clueless about how GPUs work though I haven't created those libraries myself.


What have you implemented in CUDA?


A fraction of CUDA capabilities.


Sufficient for LLMs and image/video gen.


A fraction of what a GPU is used for.


FLUX.1 D generation is about a minute at 20 steps on a 4080, but takes 35 minutes on the CPU.


Yep. Any large GenAI image model (beyond SD 1.5) is hideously slow on Mac's irrespective of how much RAM you cram in - whereas I can spit out a 1024x1024 image from Flux.1 Dev model in ~15 seconds on a RTX 4090.


4080 won't do video due to low RAM. The GPU doesn't have to be as fast there, it can be 5x slower which is still way faster than a CPU. And Intel can iterate from there.


It won't be 5x slower, it would be 20-50x slower if you would implement it as you said.

You can't just "add more ram" to GPUs and have them work the same way. Memory access is completely different than on CPUs.


Not even close. Llama.cpp isn't even close to a production ready LLM inference engine, and it runs overwhelmingly faster when using CUDA




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: