> Automatically detecting what should be prefix cached? Yuck! Why don't you like...

simianwords · 2025-05-04T15:53:59 1746374039

I mean't that this is the only way to control prefix caching. I consider this a serious feature - if I were to make an application using prefix caching I would not consider OpenAI at all. I can't control what gets cached and for how long.

Wouldn't you want to give more power to the developer? Prefix caching seems like an important enough concept to leak to the end user.

simonw · 2025-05-04T16:51:34 1746377494

Gemini's approach to prefix caching requires me to pay per hour for keeping the cache populated. I have to do pretty sophisticated price modeling and load prediction to use that effectively.

Anthropic require me to add explicit cache breakpoints to my prompts, which charge for writes to the cache. If I get that wrong it can be more expensive than if I left caching turned off entirely.

With OpenAI I don't have to do any planning or optimistic guessing at all: if my app gets a spike in traffic the caching kicks in automatically and saves me money.

simianwords · 2025-05-04T17:15:09 1746378909

that's fair - i have some app ideas for which i would like control over prefix caching. for example you may want to prompt cache entire chunks of enterprise data that don't change too often. the whole RAG application would be built over this concept - paying per hour for caching is sensible here.

>With OpenAI I don't have to do any planning or optimistic guessing at all: if my app gets a spike in traffic the caching kicks in automatically and saves me money.

i think these are completely different use cases. is this not different just from having a redis sitting in front of the LLM provider?

fundamentally i feel like prompt caching is something i want to control and not have happen automatically; i want to use information i have over my (future) access patterns to save costs. for instance i might prompt cache a whole PDF and ask multiple questions. if i choose to prompt cache the PDF, i can save a non trivial amount of tokens processed. how can OpenAI's automatic approach help me here?