Performance of 70b models is like 1 token every few seconds. And that's fitting the whole model into system RAM, not swap. It's interesting because some of the larger models are quite good, but too annoyingly slow to be practical for most use cases.
The Mixtral models run surprisingly well. They can run better than 1 token per second, depending on quantization. Still slow, but approaching a more practical level of usefulness.
Though if you're planning on accomplishing real work with LLMs, the practical solution for most people is probably to rent a GPU in the cloud.
The Mixtral models run surprisingly well. They can run better than 1 token per second, depending on quantization. Still slow, but approaching a more practical level of usefulness.
Though if you're planning on accomplishing real work with LLMs, the practical solution for most people is probably to rent a GPU in the cloud.