The backends de joure are either llama.cpp frontends (I use Kobold.cpp at the moment) or oobabooga as the guide specifies, but with the exLlamav2 backend.
If you are serving a bunch of people, run a vLLM backend instead since it supports batching, and host it on the Horde if you are feeling super nice: https://lite.koboldai.net/#
Technically only vLLM will work with this new model at the moment, but I'm sure cpp/ooba support will be added within days.
This comment will probably be obsolete within a month, when llama.cpp gets batching, MLC gets a better frontend, or some other breakthrough happens :P
The backends de joure are either llama.cpp frontends (I use Kobold.cpp at the moment) or oobabooga as the guide specifies, but with the exLlamav2 backend.
If you are serving a bunch of people, run a vLLM backend instead since it supports batching, and host it on the Horde if you are feeling super nice: https://lite.koboldai.net/#
Technically only vLLM will work with this new model at the moment, but I'm sure cpp/ooba support will be added within days.
This comment will probably be obsolete within a month, when llama.cpp gets batching, MLC gets a better frontend, or some other breakthrough happens :P