mcharytoniuk's comments

mcharytoniuk · 2024-06-03T11:06:00 1717412760

Yes, exactly. You can split the available context into "slots" (chunks) so it can handle multipe requests concurrently. The number of them is configurable.

mcharytoniuk · 2024-06-02T09:54:21 1717322061

It divides the context into smaller "slots", so it can process requests concurrently with continuous batching. See also: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

mcharytoniuk · 2024-06-02T09:53:33 1717322013

Just open an issue if you need anything. I want to make it as good and helpful as possible. Every kind of feedback is appreciated.

mcharytoniuk · 2024-06-02T09:52:47 1717321967

Currently, it is a single instance in memory, so it doesn't transfer state. HA is on the roadmap; only then will it need some kind of distributed state store.

Local states are reported by the agents installed alongside llama.cpp to the load balancer. That means they can be dynamically added and removed; it doesn't need a central configuration.

mcharytoniuk · 2024-06-02T09:50:35 1717321835

In progress. I added that to the readme; I need the feature myself. :)