This is the original paper: https://arxiv.org/abs/1911.02150 . The idea is that ...

This is the original paper: https://arxiv.org/abs/1911.02150 . The idea is that with a transformer you have many heads, say 64 for LLaMa, and for each head you have 1 "query" vector one "key" vector and one "value" vector per token. Most of the cost of inferencing models is loading the key and value vectors from GPU memory to the GPU itself. the idea behind MQA is that instead of having 64 queries, 64 keys, and 64 values, you have 64 queries, 1 key, and 1 value ("Multi-Query" as opposed to "Multi-Head", the original name). This means that there is much less data to load from GPU memory to the GPU during inference.