punkgenius's comments

punkgenius · on July 12, 2023

This seems to be more achievable and cost-effective than Cerebras. Some comments mention Cerebras cost millions for each 'die'.

brucethemoose2 · on July 13, 2023

Yes, that because Cerebras's "chip" is actially an entire wafer many GPUs would normally be carved out of.

The extra stuff TSMC must do to pull that off are probably expensive... But I can't imagine it being, say, 10x more expensive than a wafer full of reticle sized dies (like Nvidia does). And thats setting aside the massive IO advantage of Cerebras's mega die.

punkgenius · on July 12, 2023

I think it’s all about the performance-to-cost ratio. The reason you need a cache is because you want to reduce the latency and power accessing data. DRAM can also be thought of as the cache of disc drives, why dont people use cheap disc drives for deep learning? It’s way too slow. Weights in SRAM is more expensive than weights in DRAM, however, the latency and energy streaming in weight from DRAM is even more expensive than that. LLM is so memory bound and I guess that's why they use a expensive but faster memory. This might only makes sense for companies like Google and Microsoft, who really need to do LLM on millions of tokens per sec and really care about the performance-to-cost ratio.

punkgenius · on July 12, 2023

Nowadays GPUs have sacrificed some performance for better programmability. ASICs always trade programmability for better performance and energy efficiency, it's really about how 'specific' you want it to be. I guess for applications as important and popular as LLM, we probably want a very 'specific' chip

punkgenius · on July 12, 2023

Maybe 74 tflops is the best they've achieved, but not all 16 GPUs can consistently hit that number? Just guessing.. The 211 tokens/sec throughput on GPU is just insane, it's even better than what TPU can do on PaLM 540B.

why_only_15 · on July 13, 2023

Well LM-175B is 540/175=3.08x smaller, so it makes sense you would get better performance. Also, in Table D.4 it takes them 9.614s to process (28 input + 8 output tokens = 136 tok * 256 batches = 34,816 tokens with 24 A100s, which is ~150 tok/s/A100. It feels totally plausible that they could hit 211 tok/s with a bigger model. I think 211 tok/s is in fact a pretty poor showing from them and you could do significantly better.

punkgenius · on July 12, 2023

Looks like figure 8 of paper [1] says it is 18 tokens/s

punkgenius · on July 12, 2023

~200MB to 1GB per ASIC, from Table 2 on page 10.