Arguably the distinction between a .guff file and a .guff file with a llama.cpp ...

Arguably the distinction between a .guff file and a .guff file with a llama.cpp runner slapped in front of it is negligible. But it does raise an interesting point the article glosses over:

There is a lot happening between a model file sitting on a disk and serving it in an API with attached playground, billing, abuse handling, etc, handling the load of thousands or millions of users calling these incredibly demanding programs. A lot of clever software, good hardware, even down to acquiring buildings and dealing with the order backlog for backup diesel generators.

Improvements in that layer were a large part of what OpenAI to go from the relative obscurity of GPT3.5 to generating massive hype with a ChatGPT anyone could try at a whim. As a more recent example x.ai seems to be struggling with that layer a lot right now. Grok3 is pretty good, but has almost daily partial outages. The 1M context model is promised but never rolls out, instead on some days the served context size is even less than the usual 64k. And they haven't even started making it available on the API.

All of this will be easy when we reach the point where everyone can run powerful LLMs on their own device, but for now just having a 400B parameter model sitting on your hard drive doesn't get your business very far