No, with limited VRAM you could offload the model partially or split across CPU and GPU. And since CPU has swap, you could run the absolute largest model. It’s just really really slow.
The difference between Deepseek-r1:70b (edit: actually 32b) running on an M4 Pro (48 GB unified RAM, 14 CPU cores, 20 GPU cores) and on an AMD box (64 GB DDR4, 16 core 5950X, RTX 3080 with 10 GB of RAM) is more than a factor of 2.
The M4 pro was able to answer the test prompt twice--once on battery and once on mains power--before the AMD box was able to finish processing.
The M4's prompt parsing took significantly longer, but token generation was significantly faster.
Having the memory to the cores that matter makes a big difference.
You're adding detail that's not relevant to anything I said. I was saying this statement:
> VRAM is what takes a model from "can not run at all" to "can run" (even if slowly), hence the emphasis.
Is false. Regardless of how much VRAM you have, if the criteria is "can run even if slowly", all machines can run all models because you have swap. It's unusably slow but that's not what OP was claiming the difference is.
The criteria for purchase for anybody trying to use it is "run slowly but acceptably" vs. "run so slow as to be unusable".
My memory is wrong, it was the 32b. I'm running the 70b against a similar prompt and the 5950X is probably going to take over an hour for what the M4 managed in about 7 minutes.
edit: an hour later and the 5950 isn't even done thinking yet. Token generation is generously around 1 token/s.
edit edit: final statistics. M4 Pro managing 4 tokens/s prompt eval, 4.8 tokens/s token generation. 5950X managing 150 tokens/s prompt eval, and 1 token/s generation.
Perceptually I can live with the M4's performance. It's a set prompt, do something else, come back sort of thing. The 5950/RTX3080's is too slow to be even remotely usable with the 70b parameter model.