I suspect you might be confusing the numbers: 12B (which is the very first numbe...

I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.