VRAM is what takes a model from "can not run at all" to "can run" (even if slowl...

vlovich123 · 2025-03-05T14:46:54 1741186014

No, with limited VRAM you could offload the model partially or split across CPU and GPU. And since CPU has swap, you could run the absolute largest model. It’s just really really slow.

jeffhuys · 2025-03-05T16:07:08 1741190828

Really, really, really, really, really, REALLY REALLY slow.

Espressosaurus · 2025-03-05T16:46:10 1741193170

The difference between Deepseek-r1:70b (edit: actually 32b) running on an M4 Pro (48 GB unified RAM, 14 CPU cores, 20 GPU cores) and on an AMD box (64 GB DDR4, 16 core 5950X, RTX 3080 with 10 GB of RAM) is more than a factor of 2.

The M4 pro was able to answer the test prompt twice--once on battery and once on mains power--before the AMD box was able to finish processing.

The M4's prompt parsing took significantly longer, but token generation was significantly faster.

Having the memory to the cores that matter makes a big difference.

vlovich123 · 2025-03-05T17:18:29 1741195109

You're adding detail that's not relevant to anything I said. I was saying this statement:

> VRAM is what takes a model from "can not run at all" to "can run" (even if slowly), hence the emphasis.

Is false. Regardless of how much VRAM you have, if the criteria is "can run even if slowly", all machines can run all models because you have swap. It's unusably slow but that's not what OP was claiming the difference is.

Espressosaurus · 2025-03-05T17:31:50 1741195910

The criteria for purchase for anybody trying to use it is "run slowly but acceptably" vs. "run so slow as to be unusable".

My memory is wrong, it was the 32b. I'm running the 70b against a similar prompt and the 5950X is probably going to take over an hour for what the M4 managed in about 7 minutes.

edit: an hour later and the 5950 isn't even done thinking yet. Token generation is generously around 1 token/s.

edit edit: final statistics. M4 Pro managing 4 tokens/s prompt eval, 4.8 tokens/s token generation. 5950X managing 150 tokens/s prompt eval, and 1 token/s generation.

Perceptually I can live with the M4's performance. It's a set prompt, do something else, come back sort of thing. The 5950/RTX3080's is too slow to be even remotely usable with the 70b parameter model.

vlovich123 · 2025-03-05T19:45:00 1741203900

I don't disagree. I'm just taking OP at the literal statement they made.

lynndotpy · 2025-03-06T18:52:40 1741287160

Sure, this is technically correct, but somewhere there's a line of practicality. Running off a CPU (especially with swap) will be past that line.

Otherwise, you don't even need a computer. Pen and paper is plenty.

For all practical purposes, VRAM is a limiting factor.

dartos · 2025-03-05T14:31:39 1741185099

You can say the same about GPU clock speed as well…