Hacker News new | past | comments | ask | show | jobs | submit login
Chiplet ASIC supercomputers for LLMs like GPT-4 (arxiv.org)
142 points by kraken12 on July 12, 2023 | hide | past | favorite | 84 comments



This seems like a pretty bad paper. Their headline claim that they are 300x faster than an A100 at serving GPT-3 uses obviously wrong numbers for how fast A100s can run GPT-3. They seem to have misread the DeepSpeed Inference paper and claim that the best throughput on GPT-3 sized models was 18 tok/s, but if you look at figure 8 on page 11 of the paper [1], it shows that they are able to achieve ~74 teraflops on serving LM-175B (GPT-3), which is about 211 tok/s. To calculate TCO, we can say 211 tok/s = 760,000 tok/hour and A100s are about $1/hr, so TCO per 1k tokens using that paper's method is about $0.0013, much lower than the $0.02 that they claim, reducing the claimed TCO advantage from 94x to 6.2x. Combine this with the fact that the paper they used is from a year ago and there are more efficient inference methods than there were then and the speedup probably goes even lower, maybe to 3x. This is without even looking at the chip design itself, whose costs are probably far underestimated.

[1]: https://arxiv.org/pdf/2207.00032.pdf


The whole thing is imaginary: "In this paper, we propose Chiplet Cloud, a chiplet-based ASIC AI-supercomputer architecture that optimizes total cost of ownership (TCO) per generated token for serving large generative language models to reduce the overall cost to deploy and run these applica- tions in the real world."

So they are comparing actual implementations with a theoretical implementation. Never mind that they got the A100 figures wrong, they are still in the 'wouldn't it be nice if we had 'x'' stage. This looks like a paper whose sole purpose is to raise funds for a research project that will probably ultimately go nowhere and they needed a reason that looks good on paper to increase their chances of getting funded. A100 can already be had for $0.87/hour so even their theoretical advantage is under significant pressure and assuming they got everything else right by the time the project has run the market will have overtaken them. This is what usually happens to CPUs that are application specific.


The $0.87/hour price you gave is theoretical and also we know any price in a paper for compute is wrong by the time of publication.

Pragmatically the prices are closer to $2/hr according to this recent post here on Hacker News: https://llm-utils.org/Nvidia+H100+and+A100+GPUs+-+comparing+...

Although again prices change on a daily basis on spot providers.


> The $0.87/hour price you gave is theoretical

https://cloud.google.com/blog/products/compute/a2-vms-with-n...

That's as close as I got to verifying that price.


Another list that shows pricing both constant and spot. The best GCP spot price is $1.1, but Jarvis seems to say its spot for the 40GB A100 is $0.69:

https://fullstackdeeplearning.com/cloud-gpus/

I feel there are more fair criticisms of that paper than its inclusion of the snapshot price of variable priced compute resource.


Sure, to me it more of an extra item than the main one but it is one that you can readily verify because most of the other claims are far more vague. If they're willing to fudge on that one then I have much less confidence in the rest of their claims.


Seems like HN comments have determined that the cost number is not fudged..


Yeah, it is an architectural simulation study, this is what is usually done right at the beginning before resources are allocated to go deep on idea. So in that sense it is imaginary; but this is how new ideas get incubated.


take 3 ideas that are hot: chiplets, cloud, and LLM - remix them into the title of a paper that describes a hypothetical machine.. academia playing catch up and trying to stay relevant in my cynical eye.


Using ChatGPT


I asked gpt for giggles and the comparison is much more thorough, it has written also power per watt improvements, benefits of denser packing, and sustainability of moving toward more energy efficient solution.


I did the same exact mind exercise using ChatGPT but I haven't produced a paper out of the chat session.


I think the costs are from the Moonwalk model, which is a pretty good reference for estimating costs, although it might be low if you use all Google engineers to build the HW. =P


Looks like figure 8 of paper [1] says it is 18 tokens/s


Doesn’t that graph have a toks/sec of 18? Or am I reading it wrong


It shows 18 tokens per second but that's how fast tokens are generated I think. The number of tokens generated is that times the batch size, which appears to be 12? The graph is quite unclear and I didn't feel like reading the paper more in-depth.


Seems to me the 18 tokens per second from [1] is the throughput and includes the batch size, so I don't think they misread the Deepspeed inference paper. So the chiplet ASIC supercomputer paper would seem to show a decent performance/TCO benefit.

Of course, it's a first architectural study to illustrate the promise of the idea, lots more details to work out in a physical implementation and the final realized benefit is likely to be lower. But even a 3X is huge in this space.


If you look at the "metrics" section on Page 9, it says:

> 2) Metrics: We use three performance metrics: (i) latency, i.e., end-to-end output generation time for a batch of input prompts, (ii) token throughput, i.e., tokens-per-second processed, and (iii) compute throughput, i.e., TFLOPS per GPU.

This is somewhat confusing to me because at least two of these three definitions should be essentially the same thing, but I don't think there's any way to interpret their claim of ~74 teraflops achieved other than ~211 tokens/second of throughput.

Put another way, 18 tokens per second is 2% flops utilization, which we are obviously capable of doing better than for bulk inference.

3x is not huge in this space because just using a 4090 instead of an A100 is a 5x gain.


Maybe 74 tflops is the best they've achieved, but not all 16 GPUs can consistently hit that number? Just guessing.. The 211 tokens/sec throughput on GPU is just insane, it's even better than what TPU can do on PaLM 540B.


Well LM-175B is 540/175=3.08x smaller, so it makes sense you would get better performance. Also, in Table D.4 it takes them 9.614s to process (28 input + 8 output tokens = 136 tok * 256 batches = 34,816 tokens with 24 A100s, which is ~150 tok/s/A100. It feels totally plausible that they could hit 211 tok/s with a bigger model. I think 211 tok/s is in fact a pretty poor showing from them and you could do significantly better.


One possible explanation is that they hit the teraflops number during the prefill stage, where you can process all tokens at once, and are generally more operationally intensive, so you can use more compute. Utilization usually drops during the token generation stage. The utilization of the TPU during token generation is 3% when batch size 16. (https://arxiv.org/pdf/2211.05102.pdf, Table on the last page).


I mean maybe? This seems unlikely. I agree that decode is much more expensive and tok/s depends a lot on what your ratio of decode tokens to prefill tokens is.

This table was very helpful by the way, I didn't see that before. To me it clearly shows that 211 tok/s/A100 is very plausible and in fact kind of a poor showing because if you look at table D.4 and specifically the results for BS=256 PP3/TP8 they achieve ~150 tok/s/A100 on a model that's 3x larger than GPT-3.


The 4090 has half the memory bandwidth, so it could not get a 5X gain, it would actually run slower on a memory bound LLM like this.


5x gain per dollar


The key point:

A key architectural feature to achieve this is the ability to fit all model parameters inside the on-chip SRAMs of the chiplets to eliminate bandwidth limitations. Doing so is non-trivial as the amount of memory required is very large and growing for modern LLMs

...

On-chip memories such as SRAM have better read latency and read/write energy than external memories such as DDR or HBM but require more silicon per bit. We show this design choice wins in the competition of TCO per performance for serving large generative language models but requires careful consideration with respect to the chiplet die size, chiplet memory capacity and total number of chiplets to balance the fabrication cost and model performance (Sec.3.2.2) We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design leveraging mapping strategies such as tensor and pipeline model parallelism


SRAM has stopped scaling based on TSMC's upcoming N3E specs and their planned N2 node specs. So if models are tens of GB large, then I don't see how their proposed chips can be done in an economical way.

Also, a GPU is already an ASIC but with a fancy name.


Nowadays GPUs have sacrificed some performance for better programmability. ASICs always trade programmability for better performance and energy efficiency, it's really about how 'specific' you want it to be. I guess for applications as important and popular as LLM, we probably want a very 'specific' chip


Maybe they could do something like AMD's GPU memory stacking, that is good for scaling, and of course they are using many chips not one chip..


Graphcore got up to a gigabyte or so of on chip memory with the same plan of keeping the model in that memory. Does work really well if the data fits.

Recent x64 chips are at about that amount of L3 cache which might be pretty similar. I've lost track of GPU hardware specs.

That proper hardware software co-design to mitigate communication? Viciously difficult bordering on imaginary.


I actually think the chip level HW-SW co-design is a good idea. It does open up more opportunities to mitigate communication issue than optimizing the mapping given a fixed chip and system design. For example, the number of GPUs per server limits the maximum tensor model parallelism size, you don’t want to do tensor parallelism across servers due to the low bandwidth between servers. Here the # of chips/server depends on chip size and cooling, etc. So you probably want to do the co-design -- you have the chance. It’s difficult though.


Having hardware and software talk to each other before tape out is a really good idea. The early Graphcore work was done on a whiteboard with people from both sides writing on it.

There's still a lot of compromises and tradeoffs to be made:

> We observe that the inter-chiplet communication issues can be effectively mitigated through proper software-hardware co- design

Doubtful. Especially given it's all vapourware. Codesign is not adequately magic to handwave away this one.


I think their analysis relies on missing out the obvious use for SRAM - caches of DRAM data.

SRAM is for data that needs to be read/written/used very frequently - for example, read in 1 out of 10 clock cycles.

LLM weights are certainly not this. If a GPU is calculating 200 tokens per second, then most weights are only used 200 times per second. For a 1 GHz GPU, you're only using the data for 1 cycle out of 5,000,000! The rest of the time, that SRAM is wasted power, wasted silicon area, and eventually wasted dollars.

Instead they should use SRAM for intermediate results (ie. the accumulators) of matrix multiplication - they will end up being read/written every few cycles.

Weights should be streamed in from in-package DRAM. Activations too (but they are often used multiple times in quick succession, so it might make sense to cache them in SRAM).


I think it’s all about the performance-to-cost ratio. The reason you need a cache is because you want to reduce the latency and power accessing data. DRAM can also be thought of as the cache of disc drives, why dont people use cheap disc drives for deep learning? It’s way too slow. Weights in SRAM is more expensive than weights in DRAM, however, the latency and energy streaming in weight from DRAM is even more expensive than that. LLM is so memory bound and I guess that's why they use a expensive but faster memory. This might only makes sense for companies like Google and Microsoft, who really need to do LLM on millions of tokens per sec and really care about the performance-to-cost ratio.


Large language models would need tens or hundreds of gigabytes of SRAM. Pretty sure the enormous cost for this makes the approach economically unfeasible.


Cerebras Wafer Scale Engine has 40GB of onboard SRAM using TSMC 7nm. It uses the entire wafer as the chip.

Costs millions per chip though.

Source: https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...


40 GB wouldn't even be enough for LLaMA 65B with 8 bit quantization. Let alone something like GPT-4.


Wafer scale integration is a dead end technology. The engineering issues are just too great.


Do you have a source?


At high end production nodes it is impossible to get an entire wafer free of defects. Chips, let alone wafers, already include circuitry to disable parts of themselves if those parts have defects. Cerebras must have spent a ton of effort on getting this done for a full wafer. You also have problems like variability at the wafer level which you're less sensitive too when you put thousands of chips on a single wafer rather than just one, since they cover a smaller area.

Look at how successful AMD's chiplet strategy has been. Chiplets sidestep the yield problems. Wafer scale amplifies them hundred or thousand fold.

Nothing in the industry is designed to work with wafer scale products, so everything has to be custom made. Yes this is a chicken-egg problem, but it's going to be expensive to get any sort of momentum. The silicon industry is extremely conservative.

It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.


> Cerebras achieves 100% yield by designing a system in which any manufacturing defect can be bypassed – initially Cerebras had 1.5% extra cores to allow for defects, but we’ve since been told this was way too much as TSMC's process is so mature.


I'm well aware of chip defect rates and how they affect chips.

>It's sexy and enticing. If someone can make it work that's awesome. I will remain skeptical though.

But Cerebras has made it work since 2019 as @cubefox pointed out. They're on the second generation already and they have been shipping to customers for years.

Here's a good overview of how they did it: https://www.anandtech.com/show/14758/hot-chips-31-live-blogs...


They are using many chips and taking advantage of the way data flows in LLMs to make it work; so it would be cost-effective unlike Cerebras


I mean... Cerebras has _plenty_ on on-wafer SRAM and is basically equivalent to a maximum scale hypothetical chiplet.


They claim the investment will be justified for a 1.5 year life span of the system. But LLMs are changing and improving at a much faster speed that 1.5 years feels like centuries!


"Moving fast" may take on a whole new meaning and I'd put money on the rate of iteration soon being beyond the vast majority's comprehension (myself included).


We're already in the phase of AI self improvement, albeit through human mediation for the time being with copilot and other code generation tools that are used to get the next improved version out faster.


it's already beyond my comprehension, i've not lived very long but in the time i have i've never seen any technology develop so rapidly and at such a rapidly increasing pace. I assume this is what it must have felt like during the dawn of the age of computing.


It makes you wonder if those singularity proponents don't have a point, and it all depends on whether it keeps accelerating or whether it will slow down again. I hope for the latter and I fear for the former. Even if it does slow down eventually a long enough period of such change is going to make the industrial revolution (whose negative effects we are still coming to terms with today!) like a walk in the park.


Reality is biased towards the fast-moving scenario, so long as we aren’t running into the bounds of physics, which as far as I can tell we’re not. Kurzweil was much more right than he was wrong. The opposite is true of people who strongly disagreed with him and called him a quack.


The transhumanist movement has more than its fair share of quacks, but I think Kurzweil is enough of a scientist to take his arguments a bit more serious. That said, I'm getting pretty tired of all the mind uploading, eternal life and other afterlife nonsense. That to me is just religion in a new jacket.


Ya there isn’t enough talk of the more medium-term, practical uses of AI. I don’t care about mind uploading or AI doom risk. Will AI make stuff cheaper and better? Will it incrementally improve my life? I think the answers are yes and yes. That’s where the focus should be. Where and how can AI can help in construction, finance, medicine, etc…


A so-called singularity would require accelerated development for many more technological spheres, not just semiconductor fabrication, and related information computation and AI advances. Logistics, supply chains, mining, farming, manufacturing, energy, biotechnology. While the former may continue to accelerate development in the latter categories, the scale of such impact is purely speculative. I don't believe the advances will be proportional in these harder spheres, as their physicality cannot be as readily manipulated as information.


Advances in the harder spheres will come. It’s just a matter of time. Their transformation will happen in a step-wise fashion, unlike the curvy exponential growth you’re seeing in pure software.

AI cannot safely and cheaply be used to drive trucks. But as soon as it can with one truck, it can with millions of trucks. All at once.

AI and robots haven’t advanced enough to replace construction workers for fluid, dexterous tasks, but as soon as they do, robots can replace millions of construction workers and surpass them in sophistication and speed.

This will happen in our lifetime, and the change will be extremely transformative.


Material science and genetics are prime candidates for a different approach and AI techniques are already paying off dividends in those domains.


Unloading Ray Kurzweil to the cloud in 5, 4, 3…


Seems to align with Google, seems like they release a new TPU every 1.5 to 2 years.

I think modern LLMs are powerful enough now that they will still be useful in a couple years even if they aren't state-of-the-art. ChatGPT still lets you run their older model for cheaper than running GPT4, I could see a world where GPT4 is still available in 1.5 years even if there are better models out there.


1.5 years is actually not that bad. In fact, all changes and improvements to LLMs since the original Transformer paper is just the size -- tensor dimension, layers, etc. GPT-3, which is still widely used today, was proposed more than 3 years ago.


So one day we'll be buying LLM cartridges like we used to buy cartridges for the Atari.


Your ChatGPT6 cartridge is empty. Please replace your ChatGPT6 cartridge.


When the LLM wave first burst into public consciousness, I hoped that people would find a way to repurpose all the crypto-mining hardware for this -- alas, a different set of problems.


ASIC stands for Application Specific Integrated Circuit. So by definition, they cannot be repurposed.


94x cost improvement over GPU and 15x TPU is insane, but fits right in there with performance gains seen in Moore's Law.

This development presents a more compelling case that we are in fact on the precipice of larger LLMs being able to serve everyone for cheap. Still not really convinced by the AGI argument, but this does spook me. Overall though very cool.


It's insane because it is theoretical. They haven't shown that it works, think of this paper as a prelude to a funding round or research grant so they have to show some kind of advantage. Which I'm highly skeptical of, usually when papers show this kind of improvement over SOTA it tends to be either a mistake or purposeful nonsense.


>prelude to a funding round or research grant

Group at my school recently got a grant for 10MM for such a fantasy. All they had was an ISA - no RTL, no functional model, no compiler. Kid in my group (co-advised) is busy scrawling assembly on notebook paper lol. Suffice it to say I don't have high hopes for a tapeout anytime soon.


Yeah, a preliminary architectural study to sanity check if an idea could potentially pay off.


Just skimmed the paper. Seems to me like this paper wants to optimize transformer inference e2e, i.e. from ASIC level all the way to cloud.

I'm not exactly convinced though, since all the results seem to be purely theoretical or simulated. I would've liked to see a prototype built across several FPGAs with clock speeds extrapolated for ASICs.


I think FPGAs would be an awesome prototype but maybe too constricting in terms of resources? The extrapolation might be so far out to be just as accurate as their simulated model...


It seems fine to say "others have proved that this math makes a good LLM, we have designed an ASIC that can do this math fast, therefore we can make a good fast LLM"


Yes, but saying that shouldn't be mistaken for "we can make an asic that runs some model fast". There's a wide implementation void between the two.


Yep, it's a research paper in comp arch, the initial proof-of-concept study before you go and spend real money on it.


How much have they optimized the software here? Is it tinygrad level optimization?

Also does this lower total cost depend on SRAM being available for DRAM prices?

What makes SRAM so much more expensive than DRAM?


SRAM uses multiple transistors and takes up a lot more space than DRAM so it is inherently more expensive because it needs more area. The advantage is that it is fast and doesn't need refreshing like DRAM does. You can also put it on the same die as your computation logic which is technically possible with DRAM, but kinda silly since you need an optimized process to get the best out of that. This process is then quite bad for high-speed logic.


Do you think the price of including SRAM might be reduced somewhat if the big foundries optimize for including lots of SRAM in these types of ASICs?


Unlikely. SRAM is already used heavily in ASICs today and lots of R&D goes, and has gone, into optimizing it already.


The problem with hardware solution, is lack of flexibility. LLM is not (yet) as established technology to warrant fixed in-silico solutions, compared to say GPUs.


And the performance on GPT4 and Falcon40B?

If the design cannot serve models of this level, there will be no economic interest.

And a comparison with Jim Keller's Tenstorrent AICloud?


Can anyone share how much SRAM their proposed ASIC actually has? I skimmed the paper but that number didn't jump out to me.


~200MB to 1GB per ASIC, from Table 2 on page 10.


... Isn't this basically the Cerebras WS2? Each "die" has 40GB of SRAM, and they have a fast interconnect.


This seems to be more achievable and cost-effective than Cerebras. Some comments mention Cerebras cost millions for each 'die'.


Yes, that because Cerebras's "chip" is actially an entire wafer many GPUs would normally be carved out of.

The extra stuff TSMC must do to pull that off are probably expensive... But I can't imagine it being, say, 10x more expensive than a wafer full of reticle sized dies (like Nvidia does). And thats setting aside the massive IO advantage of Cerebras's mega die.


Was expecting something else I suppose. Is this kind of stuff for potential future investors?


Yep, pretty unique system. I imagine anything that runs LLMs better than we currently can would pique the interest of VCs.


nice, soon i will have my Pocket best friend




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: