Excuse me for my ignorance, but can someone explain why the Llama.cpp is so popu...

jmiskovic · on June 13, 2023

You can run it on RPi or any old hardware, only limited by the RAM and your patience. It is a lean code base easy to get up and running, and designed to be interfaced from any app without sacrificing performance.

They are also innovating (or at least implementing innovations from papers) different ways to fit bigger models in consumer HW, making them run faster and with better outputs.

Pytorch and other libs (bitsandbytes) can be horrible to setup with correct versions, and updating the repo is painful. PyTorch projects require a hefty GPU or enormous CPU+RAM resources, while llama.cpp is flexible enough to use GPU but doesn't require it and runs smaller models well on any laptop.

ONNX is a generalized ML platform for researchers to create new models with ease. Once your model is proven to work, there are many optimizations left on the table. At least for distributing an application that relies on LLM it would be easier to add llama.cpp than ONNX.

brucethemoose2 · on June 13, 2023

In Stable Diffusion land, onnx performance was not that great compared to ML compilers and some other implementations (at least when I tried it).

Also, llama.cpp is excellent at splitting the workload between CPU and accelerators since its so CPU focused. You can run 13B or 33B in a 6GB GPU and still get some acceleration.

Also, as said above, quantization. That is make or break. There is no reason to run a 7B model at fp16 when you can run 13B or 30B in the same memory pool at 2-5 bits.

v3ss0n · on June 13, 2023

Should be similar performance but gglm guy did it in what he knows best and biggest selling point is single binary

ianpurton · on June 13, 2023

ONNX doesn't support the same level of quantization as GGML.

So basically GGML will run on hardware with less memory.

regularfry · on June 13, 2023

Or alternatively, bigger models with the same memory (just quantised harder).

ykonstant · on June 13, 2023

Python Torture Chamber is, of course, an eminently viable tool, but I gather some people prefer a more streamlined toolchain like that of C. Or COBOL.

naasking · on June 13, 2023

Llama.cpp runs better than pytorch on a much wider variety of hardware, including mobile phones, Raspberry Pis and more.