A good rule of thumb is that a prefill token is about 1/6th the compute cost of decode token, and that you can get about 15k prefill tokens a second on Llama3 8B on a single H100. Bigger models will require more compute per token, and quantization like FP8 or FP4 will require less.
1. Balance / schedule incoming requests to the right backend
2. Model server replicas that can run on multiple hardware topologies
3. Prefix caching hierarchy with well-tested variants for different use cases
So it's a 3-tier architecture. The biggest difference with Dynamo is that llm-d is using the inference gateway extension - https://github.com/kubernetes-sigs/gateway-api-inference-ext... - which brings Kubernetes owned APIs for managing model routing, request priority and flow control, LoRA support etc.
That's a good example - I can at least answer about why it's a difference: different target user.
As I understand the Dynamo SDK it is about simplifying and helping someone get started with Dynamo on Kubernetes.
From the user set we work with (large inference deployers) that is not a high priority - they already have mature deployment opinions or a set of tools that would not compose well with something like the Dynamo SDK. Their comfort level with Kubernetes is moderate to high - either they use Kubernetes for high scale training and batch, or they are deploying to many different providers in order to get enough capacity and need a standard orchestration solution.
llm-d focuses on helping achieve efficiency dynamically at runtime based on changing traffic or workload on Kubernetes - some of the things the Dynamo SDK encodes are static and upfront and would conflict with that objective. Also, large deployers with serving typically have significant batch and training and they are looking to maximize capacity use without impacting their prod serving. That requires the orchestrator to know about both workloads at some level - which Dynamo SDK would make more difficult.
llm-d would make sense if you are running a very large production LLM serving setup - say 5+ full H100 hosts. The aim is to be much more focused than kserve is on exactly the needs of serving LLMs. It would of course be possible to run alongside kserve, but the user we are targeting is not typically a kserve deployer today.
Do you think https://github.com/openai/CLIP can be ran on it? LLM makes me think of chatbots but I suppose because it's inference-based it would work. Somewhat unclear on what's the difference between LLMs and inference, I think inference is the type of compute LLMs use.
Inference is the process of evaluating a model ("inferring" a response to the inputs). LLMs are uniquely difficult to serve because they push the limits on the hardware.
CPU tracking is provided by the metrics API, which either reads kubelet metrics directly (the original, old, but simplest way), or a metrics adapter that reads the metrics from a third party collector and implements the API.
The behavior is supported by the v1 api rules so no, it’s extremely unlikely to be moved out.
That said, with GPU workloads gaining steam I wouldn’t be surprised if we added new “supported everywhere” metrics at some point.
Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.
And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.
Disclaimer - work on GKE on enabling AI workloads.
There is a lot of work to make the actual infrastructure and lower level management of lots and lots of GPUs/TPUs open as well - my team focuses on making the infrastructure bit at least a bit more approachable on GKE and Kubernetes.
The actual training is still a bit of a small pool of very experienced people, but it's getting better. And every day serving models gets that much faster - you can often simply draft on Triton and TensorRT-LLM or vLLM and see significant wins month to month.
This principle (always shutdown uncleanly) was a significant point of design discussion in Kubernetes, another one of the projects that adapted lessons learned inside Google on the outside (and had to change as a result).
All of the core services (kubelet, apiserver, etc) mostly expect to shutdown uncleanly, because as a project we needed to handle unclean shutdowns anyway (and could fix bugs when they happened).
But quite a bit of the software run by Kubernetes (both early and today) doesn’t always necessarily behave that way - most notably Postgres in containers in the early days of Docker behaved badly when KILLed (where Linux terminates the process without it having a chance to react).
So faced with the expectation that Kubernetes would run a wide range of software where a Google-specific principle didn’t hold and couldn’t be enforced, Kubernetes always (modulo bugs or helpful contributors regressing under tested code paths) sends TERM, waits a few seconds, then KILLs.
And the lack of graceful Go http server shutdown (as well as it being hard to do correctly in big complex servers) for many years also made Kube apiservers harder to run in a highly available fashion for most deployers. If you don’t fully control the load balancing infrastructure in front of every server like Google does (because every large company already has a general load balancer approach built from Apache or nginx or haproxy for F5 or Cisco or …), or enforce that all clients handle all errors gracefully, you tend to prefer draining servers via code vs letting those errors escape to users. We ended up having to retrofit graceful shutdown to most of Kube’s server software after the fact, which was more effort than doing it from the beginning.
In a very real sense, Google’s economy of software scale is that it can enforce and drive consistent tradeoffs and principles across multiple projects where making a tradeoff saves effort in multiple domains. That is similar to the design principles in a programming language ecosystem like Go or orchestrator like Kubernetes, but is more extensive.
But those principles are inevitably under communicated to users (because who reads the docs before picking a programming language to implement a new project in?) and under enforced by projects (“you must be this tall to operate your own Kubernetes cluster”).
"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset
composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."
Also, I should point out that a set of machines hosting TPUs is referred to as a "pod", which is not the same thing as a Kubernetes pod (also referenced in this doc).
Kubernetes chose "pod" to represent a set of co-scheduled containers, like a "pod of whales". Other systems like Mesos and Google's Borg https://storage.googleapis.com/pub-tools-public-publication-... use "task" to refer to a single container but didn't have a concept for heterogenous co-scheduled tasks at the time.
Somewhat ironically, it now means TPUs on GKE are confusing because we have TPUs hosts organized into "pods", and "pods" for the software using the TPUs.
A Kubernetes pod using a TPU lands on a host which is part of a slice of a TPU pod.
True. The pod’s monotonic and atomic lifecycle across containers is a significant difference, but you can broadly accomplish similar behaviors with an alloc for sharing resources.
reply