Hacker News new | past | comments | ask | show | jobs | submit login

There are several issues that make the KV cache as-is unsuitable for caching across requests. First, it requires the cached tokens to be in the exact same position in the sentence, this means it's mainly only useful for autoregressive generation where the prefix is always the same. Second, it is extremely big, so without some sort of compression, the cost to store it between requests and the time required to transfer the data to the GPU will outweigh any compute savings.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: