llama.cpp and its derivatives say yes.

m00x · 2024-12-03T19:17:50 1733253470

This is the most script kiddy comment I've seen in a while.

llama.cpp is just inference, not training, and the CUDA backend is still the fastest one by far. No one is even close to matching CUDA on either training or inference. The closest is AMD with ROCm, but there's likely a decade of work to be done to be competitive.

yumraj · 2024-12-03T19:24:43 1733253883

Yes, and inference is a huge market in itself and potentially larger than training (gut feeling haven’t run numbers)

Keep NVIDIA for training and Intel/AMD/Cerebras/… for interference.

latchkey · 2024-12-03T21:09:56 1733260196

The funny thing about Cerebras is that it doesn't scale well at all for inference and if you talk to them in person, they are currently making all their money on training workloads.

m00x · 2024-12-03T19:35:24 1733254524

Inference is still a lot faster on CUDA than on CPU. It's fine if you run it at home or on your laptop for privacy, but if you're serving those models at any scale, you're going to be using GPUs with CUDA.

Inference is also a much smaller market right now, but will likely be overtaken later as we have more people using the models than competing to train the best one.

Muskyinhere · 2024-12-03T19:32:53 1733254373

NVidia Blackwell is not just a GPU. Its a Rack with a interconnect through a custom Nvidia based Network.

And it needs liquid cooling.

You don't just plugin intel cards 'out of the box'.

superkuh · 2024-12-04T02:12:44 1733278364

You're not wrong, but technically llama.cpp does have training (both raw model and fine tuning). And it's been around for a long time. Back around the ggml->gguf switch I used llama.cpp to train a tiny 0.9B llama 1 through the early fast parts of the loss reduction on 3GB of IRC logs with 64 tokens of context over about a month. It eventually produced some gpt2-like IRC lines within it's very short context.

Would anyone choose llama.cpp's training tools to do serious work? No. Do they exist and work, yes.

treprinum · 2024-12-03T19:20:57 1733253657

Inference on very large LLMs where model + backprop exceed 48GB is already way faster on a 128GB MacBook than on NVidia unless you have one of those monstrous Hx00s with lots of RAM which most devs don't.

Muskyinhere · 2024-12-03T19:42:15 1733254935

No one is running LLMs on consumer NVidia GPUs or apple MacBooks.

A dev, if they want to run local models, probably run something which just fits on a proper GPU. For everything else, everyone uses an API key from whatever because its fundamentaly faster.

IF a affordable intel GPU would be relevant faster for inferencing, is not clear at all.

A 4090 is at least double the speed of Apples GPU.

treprinum · 2024-12-03T19:49:29 1733255369

4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.

skirmish · 2024-12-04T06:49:38 1733294978

Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]

    I have a 4090, PCIe 3x16, DDR4 RAM.
    
    oobabooga/text-generation-webui
    using exllama
    I can load 30B 4bit GPTQ models and use full 2048 context
    I get 30-40 tokens/s

[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...

treprinum · 2024-12-04T09:41:38 1733305298

Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.

m00x · 2024-12-03T21:55:11 1733262911

Do you have the code for that test?

treprinum · 2024-12-03T22:10:34 1733263834

I ran some variation of llama.cpp that could handle large models by running portion of them on GPU and if too large, the rest on CPU and those were the results. Maybe I can dig it from some computer at home but it was almost like a year ago when I got M3 Max with 128GB RAM.

m00x · 2024-12-03T19:31:29 1733254289

Because the CPU has to load the model in parts for every cycle so you're spending a lot of time on IO and it offsets processing.

You're talking about completely different things here.

It's fine if you're doing a few requests at home, but if you're actually serving AI models, CUDA is the only reasonable choice other than ASICs.

treprinum · 2024-12-03T20:05:50 1733256350

My comment was about Intel having a starter project, getting enthusiastic response from devs, network effects and iterate from there. They need a way to threaten Nvidia and just focusing on what they can't do won't bring them there. There is one route where they can disturb Nvidia's high end over time and that's a cheap basic GPU with lots of RAM. Like Ryzen 1st gen whose single core performance was two generations behind Intel trashed Intel by providing 2x as many cores for cheap.

m00x · 2024-12-03T21:53:38 1733262818

It would be a good idea to start with some basic understanding of GPU, and realizing why this can't easily be done.

treprinum · 2024-12-03T22:09:02 1733263742

That's a question M3 Max with its internal GPU already answered. It's not like I didn't do any HPC or CUDA work in the past to be completely clueless about how GPUs work though I haven't created those libraries myself.

m00x · 2024-12-04T07:06:49 1733296009

What have you implemented in CUDA?

pjmlp · 2024-12-03T19:16:22 1733253382

A fraction of CUDA capabilities.

treprinum · 2024-12-03T19:17:44 1733253464

Sufficient for LLMs and image/video gen.

pjmlp · 2024-12-03T21:07:32 1733260052

A fraction of what a GPU is used for.

m00x · 2024-12-03T19:22:02 1733253722

FLUX.1 D generation is about a minute at 20 steps on a 4080, but takes 35 minutes on the CPU.

vunderba · 2024-12-04T01:09:14 1733274554

Yep. Any large GenAI image model (beyond SD 1.5) is hideously slow on Mac's irrespective of how much RAM you cram in - whereas I can spit out a 1024x1024 image from Flux.1 Dev model in ~15 seconds on a RTX 4090.

treprinum · 2024-12-03T19:59:46 1733255986

4080 won't do video due to low RAM. The GPU doesn't have to be as fast there, it can be 5x slower which is still way faster than a CPU. And Intel can iterate from there.

m00x · 2024-12-03T21:52:45 1733262765

It won't be 5x slower, it would be 20-50x slower if you would implement it as you said.

You can't just "add more ram" to GPUs and have them work the same way. Memory access is completely different than on CPUs.

Der_Einzige · 2024-12-03T20:52:52 1733259172

Not even close. Llama.cpp isn't even close to a production ready LLM inference engine, and it runs overwhelmingly faster when using CUDA