A key architectural feature to achieve this is the ability to fit all model parameters inside the on-chip SRAMs of the chiplets to eliminate bandwidth limitations. Doing so is non-trivial as the amount of memory required is very large and growing for modern LLMs
...
On-chip memories such as SRAM have better read latency and read/write energy than external memories such as DDR or HBM but require more silicon per bit. We show this design choice wins in the competition of TCO per performance for serving large generative language models but requires careful consideration with respect to the chiplet die size, chiplet memory capacity and total number of chiplets to balance the fabrication cost and model performance (Sec.3.2.2) We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design leveraging mapping strategies such as tensor and pipeline model parallelism
SRAM has stopped scaling based on TSMC's upcoming N3E specs and their planned N2 node specs. So if models are tens of GB large, then I don't see how their proposed chips can be done in an economical way.
Also, a GPU is already an ASIC but with a fancy name.
Nowadays GPUs have sacrificed some performance for better programmability. ASICs always trade programmability for better performance and energy efficiency, it's really about how 'specific' you want it to be. I guess for applications as important and popular as LLM, we probably want a very 'specific' chip
I actually think the chip level HW-SW co-design is a good idea. It does open up more opportunities to mitigate communication issue than optimizing the mapping given a fixed chip and system design.
For example, the number of GPUs per server limits the maximum tensor model parallelism size, you don’t want to do tensor parallelism across servers due to the low bandwidth between servers.
Here the # of chips/server depends on chip size and cooling, etc. So you probably want to do the co-design -- you have the chance. It’s difficult though.
Having hardware and software talk to each other before tape out is a really good idea. The early Graphcore work was done on a whiteboard with people from both sides writing on it.
There's still a lot of compromises and tradeoffs to be made:
> We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design
Doubtful. Especially given it's all vapourware. Codesign is not adequately magic to handwave away this one.
I think their analysis relies on missing out the obvious use for SRAM - caches of DRAM data.
SRAM is for data that needs to be read/written/used very frequently - for example, read in 1 out of 10 clock cycles.
LLM weights are certainly not this. If a GPU is calculating 200 tokens per second, then most weights are only used 200 times per second. For a 1 GHz GPU, you're only using the data for 1 cycle out of 5,000,000! The rest of the time, that SRAM is wasted power, wasted silicon area, and eventually wasted dollars.
Instead they should use SRAM for intermediate results (ie. the accumulators) of matrix multiplication - they will end up being read/written every few cycles.
Weights should be streamed in from in-package DRAM. Activations too (but they are often used multiple times in quick succession, so it might make sense to cache them in SRAM).
I think it’s all about the performance-to-cost ratio. The reason you need a cache is because you want to reduce the latency and power accessing data. DRAM can also be thought of as the cache of disc drives, why dont people use cheap disc drives for deep learning? It’s way too slow.
Weights in SRAM is more expensive than weights in DRAM, however, the latency and energy streaming in weight from DRAM is even more expensive than that. LLM is so memory bound and I guess that's why they use a expensive but faster memory.
This might only makes sense for companies like Google and Microsoft, who really need to do LLM on millions of tokens per sec and really care about the performance-to-cost ratio.
Large language models would need tens or hundreds of gigabytes of SRAM. Pretty sure the enormous cost for this makes the approach economically unfeasible.
At high end production nodes it is impossible to get an entire wafer free of defects. Chips, let alone wafers, already include circuitry to disable parts of themselves if those parts have defects. Cerebras must have spent a ton of effort on getting this done for a full wafer. You also have problems like variability at the wafer level which you're less sensitive too when you put thousands of chips on a single wafer rather than just one, since they cover a smaller area.
Look at how successful AMD's chiplet strategy has been. Chiplets sidestep the yield problems. Wafer scale amplifies them hundred or thousand fold.
Nothing in the industry is designed to work with wafer scale products, so everything has to be custom made. Yes this is a chicken-egg problem, but it's going to be expensive to get any sort of momentum. The silicon industry is extremely conservative.
It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.
> Cerebras achieves 100% yield by designing a system in which any manufacturing defect can be bypassed – initially Cerebras had 1.5% extra cores to allow for defects, but we’ve since been told this was way too much as TSMC's process is so mature.
I'm well aware of chip defect rates and how they affect chips.
>It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.
But Cerebras has made it work since 2019 as @cubefox pointed out. They're on the second generation already and they have been shipping to customers for years.
A key architectural feature to achieve this is the ability to fit all model parameters inside the on-chip SRAMs of the chiplets to eliminate bandwidth limitations. Doing so is non-trivial as the amount of memory required is very large and growing for modern LLMs
...
On-chip memories such as SRAM have better read latency and read/write energy than external memories such as DDR or HBM but require more silicon per bit. We show this design choice wins in the competition of TCO per performance for serving large generative language models but requires careful consideration with respect to the chiplet die size, chiplet memory capacity and total number of chiplets to balance the fabrication cost and model performance (Sec.3.2.2) We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design leveraging mapping strategies such as tensor and pipeline model parallelism