Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Using H20 to serve DeepSeek V3 / R1 is just SUPER inefficient. Like, R1 is the most anti-H20 model released ever.

Why? Any chance you have some links to read about why it’s the case?



MLA uses way more flops in order to conserve memory bandwidth, H20 has plenty of memory bandwidth and almost no flops. MLA makes sense on H100/H800, but on H20 GQA-based models are a way better option.


Not sure what you are referring to—do you have a pointer to a technical writeup perhaps? In training and inference MLA has way less flops than MHA, which is the gold standard, and way better accuracy (model performance) than GQA (see comparisons in the DeepSeek papers or try deepseek models vs llama for long context.)

More generally, with any hardware architecture you use, you can optimize the throughput for your main goal (initially training; later inference) by balancing other parameters of the architecture. Even if training is suboptimal, if you want to make a global impact with a public model, you aim for the next NVidia inference hardware.


Didn't deep-seek figure out how to train with mixed precision and so get much more out of the cards, with a lot of the training steps able to run at what was traditionally post training quantization type precisions (block compressed).


MLA as in multi-head latent attention?


Yes


Ah, gotcha. Thank you




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: