Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.

KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.

With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:

2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.

I think, anyway. It's hard to keep up with this stuff. :)



Yes but you can quantise the KV cache too just like you can the weights.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: