The key point: *A key architectural feature to achieve this is the ability to fi...

senttoschool · on July 12, 2023

SRAM has stopped scaling based on TSMC's upcoming N3E specs and their planned N2 node specs. So if models are tens of GB large, then I don't see how their proposed chips can be done in an economical way.

Also, a GPU is already an ASIC but with a fancy name.

punkgenius · on July 12, 2023

Nowadays GPUs have sacrificed some performance for better programmability. ASICs always trade programmability for better performance and energy efficiency, it's really about how 'specific' you want it to be. I guess for applications as important and popular as LLM, we probably want a very 'specific' chip

kraken12 · on July 12, 2023

Maybe they could do something like AMD's GPU memory stacking, that is good for scaling, and of course they are using many chips not one chip..

JonChesterfield · on July 12, 2023

Graphcore got up to a gigabyte or so of on chip memory with the same plan of keeping the model in that memory. Does work really well if the data fits.

Recent x64 chips are at about that amount of L3 cache which might be pretty similar. I've lost track of GPU hardware specs.

That proper hardware software co-design to mitigate communication? Viciously difficult bordering on imaginary.

yvn1uo · on July 12, 2023

I actually think the chip level HW-SW co-design is a good idea. It does open up more opportunities to mitigate communication issue than optimizing the mapping given a fixed chip and system design. For example, the number of GPUs per server limits the maximum tensor model parallelism size, you don’t want to do tensor parallelism across servers due to the low bandwidth between servers. Here the # of chips/server depends on chip size and cooling, etc. So you probably want to do the co-design -- you have the chance. It’s difficult though.

JonChesterfield · on July 12, 2023

Having hardware and software talk to each other before tape out is a really good idea. The early Graphcore work was done on a whiteboard with people from both sides writing on it.

There's still a lot of compromises and tradeoffs to be made:

> We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design

Doubtful. Especially given it's all vapourware. Codesign is not adequately magic to handwave away this one.

londons_explore · on July 12, 2023

I think their analysis relies on missing out the obvious use for SRAM - caches of DRAM data.

SRAM is for data that needs to be read/written/used very frequently - for example, read in 1 out of 10 clock cycles.

LLM weights are certainly not this. If a GPU is calculating 200 tokens per second, then most weights are only used 200 times per second. For a 1 GHz GPU, you're only using the data for 1 cycle out of 5,000,000! The rest of the time, that SRAM is wasted power, wasted silicon area, and eventually wasted dollars.

Instead they should use SRAM for intermediate results (ie. the accumulators) of matrix multiplication - they will end up being read/written every few cycles.

Weights should be streamed in from in-package DRAM. Activations too (but they are often used multiple times in quick succession, so it might make sense to cache them in SRAM).

punkgenius · on July 12, 2023

I think it’s all about the performance-to-cost ratio. The reason you need a cache is because you want to reduce the latency and power accessing data. DRAM can also be thought of as the cache of disc drives, why dont people use cheap disc drives for deep learning? It’s way too slow. Weights in SRAM is more expensive than weights in DRAM, however, the latency and energy streaming in weight from DRAM is even more expensive than that. LLM is so memory bound and I guess that's why they use a expensive but faster memory. This might only makes sense for companies like Google and Microsoft, who really need to do LLM on millions of tokens per sec and really care about the performance-to-cost ratio.

cubefox · on July 12, 2023

Large language models would need tens or hundreds of gigabytes of SRAM. Pretty sure the enormous cost for this makes the approach economically unfeasible.

senttoschool · on July 12, 2023

Cerebras Wafer Scale Engine has 40GB of onboard SRAM using TSMC 7nm. It uses the entire wafer as the chip.

Costs millions per chip though.

Source: https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...

cubefox · on July 12, 2023

40 GB wouldn't even be enough for LLaMA 65B with 8 bit quantization. Let alone something like GPT-4.

skummetmaelk · on July 12, 2023

Wafer scale integration is a dead end technology. The engineering issues are just too great.

senttoschool · on July 12, 2023

Do you have a source?

skummetmaelk · on July 12, 2023

At high end production nodes it is impossible to get an entire wafer free of defects. Chips, let alone wafers, already include circuitry to disable parts of themselves if those parts have defects. Cerebras must have spent a ton of effort on getting this done for a full wafer. You also have problems like variability at the wafer level which you're less sensitive too when you put thousands of chips on a single wafer rather than just one, since they cover a smaller area.

Look at how successful AMD's chiplet strategy has been. Chiplets sidestep the yield problems. Wafer scale amplifies them hundred or thousand fold.

Nothing in the industry is designed to work with wafer scale products, so everything has to be custom made. Yes this is a chicken-egg problem, but it's going to be expensive to get any sort of momentum. The silicon industry is extremely conservative.

It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.

cubefox · on July 12, 2023

> Cerebras achieves 100% yield by designing a system in which any manufacturing defect can be bypassed – initially Cerebras had 1.5% extra cores to allow for defects, but we’ve since been told this was way too much as TSMC's process is so mature.

senttoschool · on July 12, 2023

I'm well aware of chip defect rates and how they affect chips.

>It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.

But Cerebras has made it work since 2019 as @cubefox pointed out. They're on the second generation already and they have been shipping to customers for years.

Here's a good overview of how they did it: https://www.anandtech.com/show/14758/hot-chips-31-live-blogs...

kraken12 · on July 12, 2023

They are using many chips and taking advantage of the way data flows in LLMs to make it work; so it would be cost-effective unlike Cerebras

foobiekr · on July 12, 2023

I mean... Cerebras has _plenty_ on on-wafer SRAM and is basically equivalent to a maximum scale hypothetical chiplet.