This seems like a pretty bad paper. Their headline claim that they are 300x faster than an A100 at serving GPT-3 uses obviously wrong numbers for how fast A100s can run GPT-3. They seem to have misread the DeepSpeed Inference paper and claim that the best throughput on GPT-3 sized models was 18 tok/s, but if you look at figure 8 on page 11 of the paper [1], it shows that they are able to achieve ~74 teraflops on serving LM-175B (GPT-3), which is about 211 tok/s. To calculate TCO, we can say 211 tok/s = 760,000 tok/hour and A100s are about $1/hr, so TCO per 1k tokens using that paper's method is about $0.0013, much lower than the $0.02 that they claim, reducing the claimed TCO advantage from 94x to 6.2x. Combine this with the fact that the paper they used is from a year ago and there are more efficient inference methods than there were then and the speedup probably goes even lower, maybe to 3x. This is without even looking at the chip design itself, whose costs are probably far underestimated.
The whole thing is imaginary: "In this paper, we propose Chiplet Cloud, a chiplet-based ASIC AI-supercomputer architecture that optimizes total cost of ownership
(TCO) per generated token for serving large generative language
models to reduce the overall cost to deploy and run these applica-
tions in the real world."
So they are comparing actual implementations with a theoretical implementation. Never mind that they got the A100 figures wrong, they are still in the 'wouldn't it be nice if we had 'x'' stage. This looks like a paper whose sole purpose is to raise funds for a research project that will probably ultimately go nowhere and they needed a reason that looks good on paper to increase their chances of getting funded. A100 can already be had for $0.87/hour so even their theoretical advantage is under significant pressure and assuming they got everything else right by the time the project has run the market will have overtaken them. This is what usually happens to CPUs that are application specific.
Sure, to me it more of an extra item than the main one but it is one that you can readily verify because most of the other claims are far more vague. If they're willing to fudge on that one then I have much less confidence in the rest of their claims.
Yeah, it is an architectural simulation study, this is what is usually done right at the beginning before resources are allocated to go deep on idea. So in that sense it is imaginary; but this is how new ideas get incubated.
take 3 ideas that are hot: chiplets, cloud, and LLM - remix them into the title of a paper that describes a hypothetical machine.. academia playing catch up and trying to stay relevant in my cynical eye.
I asked gpt for giggles and the comparison is much more thorough, it has written also power per watt improvements, benefits of denser packing, and sustainability of moving toward more energy efficient solution.
I think the costs are from the Moonwalk model, which is a pretty good reference for estimating costs, although it might be low if you use all Google engineers to build the HW. =P
It shows 18 tokens per second but that's how fast tokens are generated I think. The number of tokens generated is that times the batch size, which appears to be 12? The graph is quite unclear and I didn't feel like reading the paper more in-depth.
Seems to me the 18 tokens per second from [1] is the throughput and includes the batch size, so I don't think they misread the Deepspeed inference paper. So the chiplet ASIC supercomputer paper would seem to show a decent performance/TCO benefit.
Of course, it's a first architectural study to illustrate the promise of the idea, lots more details to work out in a physical implementation and the final realized benefit is likely to be lower. But even a 3X is huge in this space.
If you look at the "metrics" section on Page 9, it says:
> 2) Metrics: We use three performance metrics: (i) latency,
i.e., end-to-end output generation time for a batch of input
prompts, (ii) token throughput, i.e., tokens-per-second processed, and (iii) compute throughput, i.e., TFLOPS per GPU.
This is somewhat confusing to me because at least two of these three definitions should be essentially the same thing, but I don't think there's any way to interpret their claim of ~74 teraflops achieved other than ~211 tokens/second of throughput.
Put another way, 18 tokens per second is 2% flops utilization, which we are obviously capable of doing better than for bulk inference.
3x is not huge in this space because just using a 4090 instead of an A100 is a 5x gain.
Maybe 74 tflops is the best they've achieved, but not all 16 GPUs can consistently hit that number? Just guessing.. The 211 tokens/sec throughput on GPU is just insane, it's even better than what TPU can do on PaLM 540B.
Well LM-175B is 540/175=3.08x smaller, so it makes sense you would get better performance. Also, in Table D.4 it takes them 9.614s to process (28 input + 8 output tokens = 136 tok * 256 batches = 34,816 tokens with 24 A100s, which is ~150 tok/s/A100. It feels totally plausible that they could hit 211 tok/s with a bigger model. I think 211 tok/s is in fact a pretty poor showing from them and you could do significantly better.
One possible explanation is that they hit the teraflops number during the prefill stage, where you can process all tokens at once, and are generally more operationally intensive, so you can use more compute. Utilization usually drops during the token generation stage. The utilization of the TPU during token generation is 3% when batch size 16. (https://arxiv.org/pdf/2211.05102.pdf, Table on the last page).
I mean maybe? This seems unlikely. I agree that decode is much more expensive and tok/s depends a lot on what your ratio of decode tokens to prefill tokens is.
This table was very helpful by the way, I didn't see that before. To me it clearly shows that 211 tok/s/A100 is very plausible and in fact kind of a poor showing because if you look at table D.4 and specifically the results for BS=256 PP3/TP8 they achieve ~150 tok/s/A100 on a model that's 3x larger than GPT-3.
A key architectural feature to achieve this is the ability to fit all model parameters inside the on-chip SRAMs of the chiplets to eliminate bandwidth limitations. Doing so is non-trivial as the amount of memory required is very large and growing for modern LLMs
...
On-chip memories such as SRAM have better read latency and read/write energy than external memories such as DDR or HBM but require more silicon per bit. We show this design choice wins in the competition of TCO per performance for serving large generative language models but requires careful consideration with respect to the chiplet die size, chiplet memory capacity and total number of chiplets to balance the fabrication cost and model performance (Sec.3.2.2) We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design leveraging mapping strategies such as tensor and pipeline model parallelism
SRAM has stopped scaling based on TSMC's upcoming N3E specs and their planned N2 node specs. So if models are tens of GB large, then I don't see how their proposed chips can be done in an economical way.
Also, a GPU is already an ASIC but with a fancy name.
Nowadays GPUs have sacrificed some performance for better programmability. ASICs always trade programmability for better performance and energy efficiency, it's really about how 'specific' you want it to be. I guess for applications as important and popular as LLM, we probably want a very 'specific' chip
I actually think the chip level HW-SW co-design is a good idea. It does open up more opportunities to mitigate communication issue than optimizing the mapping given a fixed chip and system design.
For example, the number of GPUs per server limits the maximum tensor model parallelism size, you don’t want to do tensor parallelism across servers due to the low bandwidth between servers.
Here the # of chips/server depends on chip size and cooling, etc. So you probably want to do the co-design -- you have the chance. It’s difficult though.
Having hardware and software talk to each other before tape out is a really good idea. The early Graphcore work was done on a whiteboard with people from both sides writing on it.
There's still a lot of compromises and tradeoffs to be made:
> We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design
Doubtful. Especially given it's all vapourware. Codesign is not adequately magic to handwave away this one.
I think their analysis relies on missing out the obvious use for SRAM - caches of DRAM data.
SRAM is for data that needs to be read/written/used very frequently - for example, read in 1 out of 10 clock cycles.
LLM weights are certainly not this. If a GPU is calculating 200 tokens per second, then most weights are only used 200 times per second. For a 1 GHz GPU, you're only using the data for 1 cycle out of 5,000,000! The rest of the time, that SRAM is wasted power, wasted silicon area, and eventually wasted dollars.
Instead they should use SRAM for intermediate results (ie. the accumulators) of matrix multiplication - they will end up being read/written every few cycles.
Weights should be streamed in from in-package DRAM. Activations too (but they are often used multiple times in quick succession, so it might make sense to cache them in SRAM).
I think it’s all about the performance-to-cost ratio. The reason you need a cache is because you want to reduce the latency and power accessing data. DRAM can also be thought of as the cache of disc drives, why dont people use cheap disc drives for deep learning? It’s way too slow.
Weights in SRAM is more expensive than weights in DRAM, however, the latency and energy streaming in weight from DRAM is even more expensive than that. LLM is so memory bound and I guess that's why they use a expensive but faster memory.
This might only makes sense for companies like Google and Microsoft, who really need to do LLM on millions of tokens per sec and really care about the performance-to-cost ratio.
Large language models would need tens or hundreds of gigabytes of SRAM. Pretty sure the enormous cost for this makes the approach economically unfeasible.
At high end production nodes it is impossible to get an entire wafer free of defects. Chips, let alone wafers, already include circuitry to disable parts of themselves if those parts have defects. Cerebras must have spent a ton of effort on getting this done for a full wafer. You also have problems like variability at the wafer level which you're less sensitive too when you put thousands of chips on a single wafer rather than just one, since they cover a smaller area.
Look at how successful AMD's chiplet strategy has been. Chiplets sidestep the yield problems. Wafer scale amplifies them hundred or thousand fold.
Nothing in the industry is designed to work with wafer scale products, so everything has to be custom made. Yes this is a chicken-egg problem, but it's going to be expensive to get any sort of momentum. The silicon industry is extremely conservative.
It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.
> Cerebras achieves 100% yield by designing a system in which any manufacturing defect can be bypassed – initially Cerebras had 1.5% extra cores to allow for defects, but we’ve since been told this was way too much as TSMC's process is so mature.
I'm well aware of chip defect rates and how they affect chips.
>It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.
But Cerebras has made it work since 2019 as @cubefox pointed out. They're on the second generation already and they have been shipping to customers for years.
They claim the investment will be justified for a 1.5 year life span of the system. But LLMs are changing and improving at a much faster speed that 1.5 years feels like centuries!
"Moving fast" may take on a whole new meaning and I'd put money on the rate of iteration soon being beyond the vast majority's comprehension (myself included).
We're already in the phase of AI self improvement, albeit through human mediation for the time being with copilot and other code generation tools that are used to get the next improved version out faster.
it's already beyond my comprehension, i've not lived very long but in the time i have i've never seen any technology develop so rapidly and at such a rapidly increasing pace. I assume this is what it must have felt like during the dawn of the age of computing.
It makes you wonder if those singularity proponents don't have a point, and it all depends on whether it keeps accelerating or whether it will slow down again. I hope for the latter and I fear for the former. Even if it does slow down eventually a long enough period of such change is going to make the industrial revolution (whose negative effects we are still coming to terms with today!) like a walk in the park.
Reality is biased towards the fast-moving scenario, so long as we aren’t running into the bounds of physics, which as far as I can tell we’re not. Kurzweil was much more right than he was wrong. The opposite is true of people who strongly disagreed with him and called him a quack.
The transhumanist movement has more than its fair share of quacks, but I think Kurzweil is enough of a scientist to take his arguments a bit more serious. That said, I'm getting pretty tired of all the mind uploading, eternal life and other afterlife nonsense. That to me is just religion in a new jacket.
Ya there isn’t enough talk of the more medium-term, practical uses of AI. I don’t care about mind uploading or AI doom risk. Will AI make stuff cheaper and better? Will it incrementally improve my life? I think the answers are yes and yes. That’s where the focus should be. Where and how can AI can help in construction, finance, medicine, etc…
A so-called singularity would require accelerated development for many more technological spheres, not just semiconductor fabrication, and related information computation and AI advances. Logistics, supply chains, mining, farming, manufacturing, energy, biotechnology. While the former may continue to accelerate development in the latter categories, the scale of such impact is purely speculative. I don't believe the advances will be proportional in these harder spheres, as their physicality cannot be as readily manipulated as information.
Advances in the harder spheres will come. It’s just a matter of time. Their transformation will happen in a step-wise fashion, unlike the curvy exponential growth you’re seeing in pure software.
AI cannot safely and cheaply be used to drive trucks. But as soon as it can with one truck, it can with millions of trucks. All at once.
AI and robots haven’t advanced enough to replace construction workers for fluid, dexterous tasks, but as soon as they do, robots can replace millions of construction workers and surpass them in sophistication and speed.
This will happen in our lifetime, and the change will be extremely transformative.
Seems to align with Google, seems like they release a new TPU every 1.5 to 2 years.
I think modern LLMs are powerful enough now that they will still be useful in a couple years even if they aren't state-of-the-art. ChatGPT still lets you run their older model for cheaper than running GPT4, I could see a world where GPT4 is still available in 1.5 years even if there are better models out there.
1.5 years is actually not that bad. In fact, all changes and improvements to LLMs since the original Transformer paper is just the size -- tensor dimension, layers, etc. GPT-3, which is still widely used today, was proposed more than 3 years ago.
When the LLM wave first burst into public consciousness, I hoped that people would find a way to repurpose all the crypto-mining hardware for this -- alas, a different set of problems.
94x cost improvement over GPU and 15x TPU is insane, but fits right in there with performance gains seen in Moore's Law.
This development presents a more compelling case that we are in fact on the precipice of larger LLMs being able to serve everyone for cheap. Still not really convinced by the AGI argument, but this does spook me. Overall though very cool.
It's insane because it is theoretical. They haven't shown that it works, think of this paper as a prelude to a funding round or research grant so they have to show some kind of advantage. Which I'm highly skeptical of, usually when papers show this kind of improvement over SOTA it tends to be either a mistake or purposeful nonsense.
Group at my school recently got a grant for 10MM for such a fantasy. All they had was an ISA - no RTL, no functional model, no compiler. Kid in my group (co-advised) is busy scrawling assembly on notebook paper lol. Suffice it to say I don't have high hopes for a tapeout anytime soon.
Just skimmed the paper. Seems to me like this paper wants to optimize transformer inference e2e, i.e. from ASIC level all the way to cloud.
I'm not exactly convinced though, since all the results seem to be purely theoretical or simulated. I would've liked to see a prototype built across several FPGAs with clock speeds extrapolated for ASICs.
I think FPGAs would be an awesome prototype but maybe too constricting in terms of resources? The extrapolation might be so far out to be just as accurate as their simulated model...
It seems fine to say "others have proved that this math makes a good LLM, we have designed an ASIC that can do this math fast, therefore we can make a good fast LLM"
SRAM uses multiple transistors and takes up a lot more space than DRAM so it is inherently more expensive because it needs more area. The advantage is that it is fast and doesn't need refreshing like DRAM does. You can also put it on the same die as your computation logic which is technically possible with DRAM, but kinda silly since you need an optimized process to get the best out of that. This process is then quite bad for high-speed logic.
The problem with hardware solution, is lack of flexibility. LLM is not (yet) as established technology to warrant fixed in-silico solutions, compared to say GPUs.
Yes, that because Cerebras's "chip" is actially an entire wafer many GPUs would normally be carved out of.
The extra stuff TSMC must do to pull that off are probably expensive... But I can't imagine it being, say, 10x more expensive than a wafer full of reticle sized dies (like Nvidia does). And thats setting aside the massive IO advantage of Cerebras's mega die.
[1]: https://arxiv.org/pdf/2207.00032.pdf