It seems like it's token caching, not model caching.

Jaxkr · 2025-05-07T04:13:34 1746591214

That’s what this is. It’s caching the state of the model after the tokens have been loaded. Reduces latency and cost dramatically. 5m TTL on the cache usually.

cal85 · 2025-05-07T08:24:08 1746606248

Interesting! I’m wondering, does caching the model state mean the tokens are no longer directly visible to the model? i.e. if you asked it to print out the input tokens perfectly (assuming there’s no security layer blocking this, and assuming it has no ‘tool’ available to pull in the input tokens), could it do it?

saagarjha · 2025-05-07T08:49:36 1746607776

The model state encodes the past tokens (in some lossy way that the model has chosen for itself). You can ask it to try and, assuming its attention is well-trained, it will probably do a pretty good job. Being able to refer to what is in its context window is an important part of being able to predict the next token, after all.

noodletheworld · 2025-05-07T08:51:05 1746607865

It makes no difference.

Theres no difference between feeding an LLM a prompt and feeding it half the prompt, saving the state, restoring the state and feeding it other half of the prompt.

Ie. The data processed by the LLM is prompt P.

P can be composed of any number of segments.

Any number of segments can be cached, as long as all preceeding segments are cached.

The final input is P, regardless.

So; tldr; yes? Anything you can do with a prompt you can do, becasue its just a prompt.

chpatrick · 2025-05-07T12:46:39 1746621999

Isn't the state of the model exactly the previous generated text (ie. the prompt)?

int_19h · 2025-05-07T20:07:08 1746648428

When the prompt is processed, there is an internal key-value cache that gets updated with each token processed, and is ultimately used for inference of the new token. If you process the prompt first and then dump that internal cache, you can effectively resume prompt processing (and thus inference) from that point more or less for free.

https://medium.com/@plienhar/llm-inference-series-3-kv-cachi...

EGreg · 2025-05-07T04:39:05 1746592745

Can someone explain how to use Prompt Caching with LLAMA 4?

concats · 2025-05-07T11:35:57 1746617757

Depends on what front end you use. But for text-generation-webui for example, Prompt Caching is simply a checkbox under the Model tab you can select before you click "load model".

EGreg · 2025-05-07T12:59:52 1746622792

I basically want to interface with llama.cpp via an API from Node.js

What are some of the best coding models that run locally today? Do they have prompt caching support?