> Using H20 to serve DeepSeek V3 / R1 is just SUPER inefficient. Like, R1 is the...

terafo · 2025-06-28T11:56:44 1751111804

MLA uses way more flops in order to conserve memory bandwidth, H20 has plenty of memory bandwidth and almost no flops. MLA makes sense on H100/H800, but on H20 GQA-based models are a way better option.

pama · 2025-06-28T19:07:23 1751137643

Not sure what you are referring to—do you have a pointer to a technical writeup perhaps? In training and inference MLA has way less flops than MHA, which is the gold standard, and way better accuracy (model performance) than GQA (see comparisons in the DeepSeek papers or try deepseek models vs llama for long context.)

More generally, with any hardware architecture you use, you can optimize the throughput for your main goal (initially training; later inference) by balancing other parameters of the architecture. Even if training is suboptimal, if you want to make a global impact with a public model, you aim for the next NVidia inference hardware.

cma · 2025-06-28T17:06:45 1751130405

Didn't deep-seek figure out how to train with mixed precision and so get much more out of the cards, with a lot of the training steps able to run at what was traditionally post training quantization type precisions (block compressed).

reliabilityguy · 2025-06-28T12:00:16 1751112016

MLA as in multi-head latent attention?

terafo · 2025-06-28T12:08:30 1751112510

reliabilityguy · 2025-06-28T12:26:34 1751113594

Ah, gotcha. Thank you