Actually no! If we take their paper at face value, the crucial innovation to get...

Actually no! If we take their paper at face value, the crucial innovation to get a strong model with efficiency is their much reduced KV cache and their MoE approach: - where a standard model needs to store two large vectors for each token at inference time (and load/store those over and over from memory) deepseek v3/R1 only stores one smaller vector C that is a „compression“ from which the large k,v vectors can be decoded on the fly. - They use a fairly standard Mixture of Expert (MoE) approach, which works well in training with their tricks, but whose inference time advantages are immediate and equal to all other MoE techniques, which is to say that from ~85% of the 600B+ params that are inside the MoE layers, the model at each token inference step will only pick a small fraction to use. This reduces FLOPs and memory io by a large factor in comparison to a so-called dense model where all weights are used for every token (cf Llama 3 405B)