More

smarterclayton · 2025-08-28T15:19:04 1756394344

A good rule of thumb is that a prefill token is about 1/6th the compute cost of decode token, and that you can get about 15k prefill tokens a second on Llama3 8B on a single H100. Bigger models will require more compute per token, and quantization like FP8 or FP4 will require less.

smarterclayton · 2025-05-20T16:42:41 1747759361

llm-d is intended to be three clean layers:

1. Balance / schedule incoming requests to the right backend

2. Model server replicas that can run on multiple hardware topologies

3. Prefix caching hierarchy with well-tested variants for different use cases

So it's a 3-tier architecture. The biggest difference with Dynamo is that llm-d is using the inference gateway extension - https://github.com/kubernetes-sigs/gateway-api-inference-ext... - which brings Kubernetes owned APIs for managing model routing, request priority and flow control, LoRA support etc.

rdli · 2025-05-20T17:02:37 1747760557

I would think that that the NVidia Dynamo SDK (pipelines) is a big difference as well (https://github.com/ai-dynamo/dynamo/tree/main/deploy/sdk/doc...), or am I missing something?

smarterclayton · 2025-05-20T17:27:14 1747762034

That's a good example - I can at least answer about why it's a difference: different target user.

As I understand the Dynamo SDK it is about simplifying and helping someone get started with Dynamo on Kubernetes.

From the user set we work with (large inference deployers) that is not a high priority - they already have mature deployment opinions or a set of tools that would not compose well with something like the Dynamo SDK. Their comfort level with Kubernetes is moderate to high - either they use Kubernetes for high scale training and batch, or they are deploying to many different providers in order to get enough capacity and need a standard orchestration solution.

llm-d focuses on helping achieve efficiency dynamically at runtime based on changing traffic or workload on Kubernetes - some of the things the Dynamo SDK encodes are static and upfront and would conflict with that objective. Also, large deployers with serving typically have significant batch and training and they are looking to maximize capacity use without impacting their prod serving. That requires the orchestrator to know about both workloads at some level - which Dynamo SDK would make more difficult.

smarterclayton · 2025-05-20T16:03:07 1747756987

We inherit any multi-host support from vLLM, so https://docs.vllm.ai/en/latest/serving/distributed_serving.h... would be the expected path.

We plan to publish examples of multi-host inference that leverages LeaderWorkerSets - https://github.com/kubernetes-sigs/lws - which helps run ranked serving workloads across hosts. LeaderWorkerSet is how Google supports both TPU and GPU multi-host deployments - see https://github.com/kubernetes-sigs/lws/blob/main/config/samp... for an example.

Edit: Here is an example Kubernetes configuration running DeepSeek-R1 on vLLM multi-host using LeaderWorkerSet https://github.com/kubernetes-sigs/wg-serving/blob/main/serv.... This work would be integrated into llm-d.

smarterclayton · 2025-05-20T15:09:06 1747753746

llm-d would make sense if you are running a very large production LLM serving setup - say 5+ full H100 hosts. The aim is to be much more focused than kserve is on exactly the needs of serving LLMs. It would of course be possible to run alongside kserve, but the user we are targeting is not typically a kserve deployer today.

anttiharju · 2025-05-20T15:37:32 1747755452

Do you think https://github.com/openai/CLIP can be ran on it? LLM makes me think of chatbots but I suppose because it's inference-based it would work. Somewhat unclear on what's the difference between LLMs and inference, I think inference is the type of compute LLMs use.

I wonder if inference-d would be a fitting name.

smarterclayton · 2025-05-20T15:41:30 1747755690

Inference is the process of evaluating a model ("inferring" a response to the inputs). LLMs are uniquely difficult to serve because they push the limits on the hardware.

The models we support come from the model server vLLM https://docs.vllm.ai/en/latest/models/supported_models.html, which has a focus on large generative models. I don't see CLIP in the list.

smarterclayton · on Aug 2, 2024

CPU tracking is provided by the metrics API, which either reads kubelet metrics directly (the original, old, but simplest way), or a metrics adapter that reads the metrics from a third party collector and implements the API.

The behavior is supported by the v1 api rules so no, it’s extremely unlikely to be moved out.

That said, with GPU workloads gaining steam I wouldn’t be surprised if we added new “supported everywhere” metrics at some point.

smarterclayton · on June 13, 2024

Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:

https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR

Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.

And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.

Disclaimer - work on GKE on enabling AI workloads.

smarterclayton · on Feb 21, 2024

There is a lot of work to make the actual infrastructure and lower level management of lots and lots of GPUs/TPUs open as well - my team focuses on making the infrastructure bit at least a bit more approachable on GKE and Kubernetes.

https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main

and

https://github.com/google/xpk (a bit more focused on HPC, but includes AI)

and

https://github.com/stas00/ml-engineering (not associated with GKE, but describes training with SLURM)

The actual training is still a bit of a small pool of very experienced people, but it's getting better. And every day serving models gets that much faster - you can often simply draft on Triton and TensorRT-LLM or vLLM and see significant wins month to month.

smarterclayton · on Jan 5, 2024

As an aside:

This principle (always shutdown uncleanly) was a significant point of design discussion in Kubernetes, another one of the projects that adapted lessons learned inside Google on the outside (and had to change as a result).

All of the core services (kubelet, apiserver, etc) mostly expect to shutdown uncleanly, because as a project we needed to handle unclean shutdowns anyway (and could fix bugs when they happened).

But quite a bit of the software run by Kubernetes (both early and today) doesn’t always necessarily behave that way - most notably Postgres in containers in the early days of Docker behaved badly when KILLed (where Linux terminates the process without it having a chance to react).

So faced with the expectation that Kubernetes would run a wide range of software where a Google-specific principle didn’t hold and couldn’t be enforced, Kubernetes always (modulo bugs or helpful contributors regressing under tested code paths) sends TERM, waits a few seconds, then KILLs.

And the lack of graceful Go http server shutdown (as well as it being hard to do correctly in big complex servers) for many years also made Kube apiservers harder to run in a highly available fashion for most deployers. If you don’t fully control the load balancing infrastructure in front of every server like Google does (because every large company already has a general load balancer approach built from Apache or nginx or haproxy for F5 or Cisco or …), or enforce that all clients handle all errors gracefully, you tend to prefer draining servers via code vs letting those errors escape to users. We ended up having to retrofit graceful shutdown to most of Kube’s server software after the fact, which was more effort than doing it from the beginning.

In a very real sense, Google’s economy of software scale is that it can enforce and drive consistent tradeoffs and principles across multiple projects where making a tradeoff saves effort in multiple domains. That is similar to the design principles in a programming language ecosystem like Go or orchestrator like Kubernetes, but is more extensive.

But those principles are inevitably under communicated to users (because who reads the docs before picking a programming language to implement a new project in?) and under enforced by projects (“you must be this tall to operate your own Kubernetes cluster”).

smarterclayton · on Dec 6, 2023

From the report:

"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."

ZeroCool2u · on Dec 6, 2023

Great catch!

smarterclayton · on Nov 10, 2023

Also, I should point out that a set of machines hosting TPUs is referred to as a "pod", which is not the same thing as a Kubernetes pod (also referenced in this doc).

The term "pod" originated in early data center design and occasionally crosses over from HPC to broad use - i.e. nVidia calls the set of DGX machines a "pod" https://blogs.nvidia.com/blog/2021/03/05/what-is-a-cluster-p....

Kubernetes chose "pod" to represent a set of co-scheduled containers, like a "pod of whales". Other systems like Mesos and Google's Borg https://storage.googleapis.com/pub-tools-public-publication-... use "task" to refer to a single container but didn't have a concept for heterogenous co-scheduled tasks at the time.

Somewhat ironically, it now means TPUs on GKE are confusing because we have TPUs hosts organized into "pods", and "pods" for the software using the TPUs.

A Kubernetes pod using a TPU lands on a host which is part of a slice of a TPU pod.

jeffbee · on Nov 10, 2023

As your second link mentions in section 2.4, Borg has "allocs" which are basically pods.

smarterclayton · on Nov 10, 2023

True. The pod’s monotonic and atomic lifecycle across containers is a significant difference, but you can broadly accomplish similar behaviors with an alloc for sharing resources.