llama.cpp is great, if it fit your needs you can use it. I think at this point llama.cpp is effectively a platform that's hardened for production.
In its current form, I think of gemma.cpp is more of a direct model implementation (somewhere between the minimalism of llama2.c and the generality of ggml).
I tend to think of 3 modes of usage:
- hacking on inference internals - there's very little indirection, no IRs, the model is just code, so if you want to add support for your own runtime support for sparsity/quantization/model compression/etc. and demo it working with gemma, there's minimal barriers to do so
- implementing experimental frontends - i'll add some examples of this in the very near future. but you're free to get pretty creative with terminal UIs, code that interact with model internals like the KV cache, accepting/rejecting tokens etc.
- interacting with the model locally with a small program - of course there's other options for this but hopefully this is one way to play with gemma w/ minimal fuss.
In its current form, I think of gemma.cpp is more of a direct model implementation (somewhere between the minimalism of llama2.c and the generality of ggml).
I tend to think of 3 modes of usage:
- hacking on inference internals - there's very little indirection, no IRs, the model is just code, so if you want to add support for your own runtime support for sparsity/quantization/model compression/etc. and demo it working with gemma, there's minimal barriers to do so
- implementing experimental frontends - i'll add some examples of this in the very near future. but you're free to get pretty creative with terminal UIs, code that interact with model internals like the KV cache, accepting/rejecting tokens etc.
- interacting with the model locally with a small program - of course there's other options for this but hopefully this is one way to play with gemma w/ minimal fuss.