Hacker News new | past | comments | ask | show | jobs | submit login

It seems like it's token caching, not model caching.





That’s what this is. It’s caching the state of the model after the tokens have been loaded. Reduces latency and cost dramatically. 5m TTL on the cache usually.

Interesting! I’m wondering, does caching the model state mean the tokens are no longer directly visible to the model? i.e. if you asked it to print out the input tokens perfectly (assuming there’s no security layer blocking this, and assuming it has no ‘tool’ available to pull in the input tokens), could it do it?

The model state encodes the past tokens (in some lossy way that the model has chosen for itself). You can ask it to try and, assuming its attention is well-trained, it will probably do a pretty good job. Being able to refer to what is in its context window is an important part of being able to predict the next token, after all.

It makes no difference.

Theres no difference between feeding an LLM a prompt and feeding it half the prompt, saving the state, restoring the state and feeding it other half of the prompt.

Ie. The data processed by the LLM is prompt P.

P can be composed of any number of segments.

Any number of segments can be cached, as long as all preceeding segments are cached.

The final input is P, regardless.

So; tldr; yes? Anything you can do with a prompt you can do, becasue its just a prompt.


Isn't the state of the model exactly the previous generated text (ie. the prompt)?

When the prompt is processed, there is an internal key-value cache that gets updated with each token processed, and is ultimately used for inference of the new token. If you process the prompt first and then dump that internal cache, you can effectively resume prompt processing (and thus inference) from that point more or less for free.

https://medium.com/@plienhar/llm-inference-series-3-kv-cachi...


Can someone explain how to use Prompt Caching with LLAMA 4?

Depends on what front end you use. But for text-generation-webui for example, Prompt Caching is simply a checkbox under the Model tab you can select before you click "load model".

I basically want to interface with llama.cpp via an API from Node.js

What are some of the best coding models that run locally today? Do they have prompt caching support?




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: