I'm pretty much certain the cost of training and running large LLMs is going to come down, because it's only a matter of time before truly customized chips come out for these.
GPUs really aren't that. They're massively parallel vector processors that turn out to be generally better than CPUs at running these models, but they're still not the ideal chip for running LLMs. That would be a large even more specialized parallel processor where almost all the silicon is dedicated to running exactly the types of operations used in large LLMs and that natively supports quantization formats such as those found in the ggml/llama.cpp world. Being able to natively run and train on those formats would allow gigantic 100B+ models to be run with more reasonable amounts of RAM and at a higher speed due to memory bandwidth constraints.
These chips, when they arrive, will be a lot cheaper than GPUs when compared in dollars per LLM performance. They'll be available for rent in the cloud and for purchase as accelerators.
I'd be utterly shocked if lots of chip companies don't have projects working on these chips, since at this point it's clear that LLMs are going to become a permanent fixture of computing.
It takes a while to design an ASIC and it hasn't been that long since the hype wave really arrived for these things. I would bet on LLM chips showing up in 2024-2025.
There's also a run on foundries right now which might delay things further. The new foundries in the US being built under the Chips act in Arizona and Ohio won't be online until probably 2025.
Ah that's an excellent point about the foundries, I hadn't considered that. Also, in reflection, I'm being pretty handwavy about the "as soon as it was determined there was real money involved."
Honestly, I'd probably be pretty unsurprised to learn if the time difference between "big commercial interest" and "custom chips appear" is probably pretty similar for both cases.
GPUs really aren't that. They're massively parallel vector processors that turn out to be generally better than CPUs at running these models, but they're still not the ideal chip for running LLMs. That would be a large even more specialized parallel processor where almost all the silicon is dedicated to running exactly the types of operations used in large LLMs and that natively supports quantization formats such as those found in the ggml/llama.cpp world. Being able to natively run and train on those formats would allow gigantic 100B+ models to be run with more reasonable amounts of RAM and at a higher speed due to memory bandwidth constraints.
These chips, when they arrive, will be a lot cheaper than GPUs when compared in dollars per LLM performance. They'll be available for rent in the cloud and for purchase as accelerators.
I'd be utterly shocked if lots of chip companies don't have projects working on these chips, since at this point it's clear that LLMs are going to become a permanent fixture of computing.