The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.
KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.
With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:
2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
I think, anyway. It's hard to keep up with this stuff. :)
KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.
With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:
2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
I think, anyway. It's hard to keep up with this stuff. :)