I think their analysis relies on missing out the obvious use for SRAM - caches of DRAM data.
SRAM is for data that needs to be read/written/used very frequently - for example, read in 1 out of 10 clock cycles.
LLM weights are certainly not this. If a GPU is calculating 200 tokens per second, then most weights are only used 200 times per second. For a 1 GHz GPU, you're only using the data for 1 cycle out of 5,000,000! The rest of the time, that SRAM is wasted power, wasted silicon area, and eventually wasted dollars.
Instead they should use SRAM for intermediate results (ie. the accumulators) of matrix multiplication - they will end up being read/written every few cycles.
Weights should be streamed in from in-package DRAM. Activations too (but they are often used multiple times in quick succession, so it might make sense to cache them in SRAM).
I think it’s all about the performance-to-cost ratio. The reason you need a cache is because you want to reduce the latency and power accessing data. DRAM can also be thought of as the cache of disc drives, why dont people use cheap disc drives for deep learning? It’s way too slow.
Weights in SRAM is more expensive than weights in DRAM, however, the latency and energy streaming in weight from DRAM is even more expensive than that. LLM is so memory bound and I guess that's why they use a expensive but faster memory.
This might only makes sense for companies like Google and Microsoft, who really need to do LLM on millions of tokens per sec and really care about the performance-to-cost ratio.
SRAM is for data that needs to be read/written/used very frequently - for example, read in 1 out of 10 clock cycles.
LLM weights are certainly not this. If a GPU is calculating 200 tokens per second, then most weights are only used 200 times per second. For a 1 GHz GPU, you're only using the data for 1 cycle out of 5,000,000! The rest of the time, that SRAM is wasted power, wasted silicon area, and eventually wasted dollars.
Instead they should use SRAM for intermediate results (ie. the accumulators) of matrix multiplication - they will end up being read/written every few cycles.
Weights should be streamed in from in-package DRAM. Activations too (but they are often used multiple times in quick succession, so it might make sense to cache them in SRAM).