At the end of the day my question is simply why does anyone care about the drama over this one way or another?
Either the research is as much of a breakthrough as is claimed and Google is about to pull way ahead of all these other "idiots" who can't replicate their method even when it is described to them in detail, or the research is flawed and overblown and not as effective as claimed. This seems like exactly the sort of question the market will quickly decide over the next couple of years and not worth arguing over.
Why do a non-zero amount of people have seemingly religious beliefs about this topic on one side or the other?
The reason Jeff Dean cares is that his team's improvement compared to standard EDA tools was marginal at best and may have overfitted to a certain class of chips. Thus, he is defending his research because it is not widely accepted. Open source code has been out for years and in that time the EDA companies have largely done their own ML-based approaches that do not match his. He attributes this not to failings in his own research but to the detractors at these companies not giving it a fair chance.
The guys at EDA companies care because Google's result makes them look like idiots when you take the paper at face value, and does advance the state of the art a bit. They have been working hard for marginal improvements, and that some team of ML people can come in and make a big splash with something like this is offensive to them. Furthermore, the result is not that impressive and does not generalize enough to be useful to them (and competent teams at these companies absolutely have checked).
The fact that the result is so minor is the reason that this is so contentious.
The result is minor AND Google spent a (relative) lot of money to achieve it (especially in the eyes of the new CFO). Jeff Dean is desperately trying to save the prestige of the research (in a very insular, Google-y way) because he wants to save the 2017-era economically-not-viable blue sky culture where Tensorflow & the TPU flourished and the transformer was born. But the reality is that Google’s core businesses are under attack (anti-trust, Jedi Blue etc), the TPU now has zero chance versus NVidia, and Google is literally no longer growing ads. His financing is about to pop in the next 1-2 years.
What makes you say TPU has zero chance against growing NVIDIA?
If anything, now is the best time for TPU to grow and I'd say investing in TPU gave Google an edge. There is no other large scale LLM that was trained on anything but NVIDIA GPUs. Gemini is the only exception. Every big company is scrambling to make their own hardware in the AI era while Google already has it.
Everyone I know who worked with TPUs loves how well they scale. Sure Jax has a learning curve but it's not a problem, especially given the performance advantages it gives.
Besides the many CAPEX-vs-OPEX tradeoffs that are completely unavailable due to not being able to buy physical TPU pods, there are inherent Google-y risks e.g. risk of the TPU product and/or support getting killed or fragmented / deprecated (very very common with Google), your data & traffic must also be locked in to Google’s pricing, and you must indefinitely put up with / negotiate with Google Cloud people (in my experience at multiple companies: worst customer support ever).
Google does indeed lock in their own ROI with deciding to not compete with AMD / Graphcore etc, but that also rooflines their total market. If they were to come up with a compelling Android-based Jetson-like edge product, and if demand for said product eclipses total GPU demand (robotics explosion?) then they might have a ramp to compete with NVidia. But the USB TPUs and phone accelerators today are just toys. And toys go to the Google graveyard, because Googlers don’t build gardens they treat everything like toys and throw them away when they get bored.
> Why do a non-zero amount of people have seemingly religious beliefs about this topic on one side or the other?
Because lots of engineers are being told by managers "Why aren't we using that tool?" and a bunch of engineers are stuck saying "Because it doesn't actually work." aka "Google is lying through their teeth." to which the response is "Oh, so you know better than Google?" to which the reponse is "Yeah, actually, I fucking do. Now piss off and let me finish timing closure this goddamn block that is already 6 weeks late."
Now can you understand why this is a bit contentious?
Marketing "exaggerations" from authority can cause huge amounts of grief.
In my little corner of the world, I had to sit and defend against the lies that a startup with famous designers were putting out about power consumption while we were designing similar chips in the space. I had to go toe to toe with Senior VPs over it and I had to stand my ground and defend my team who analyzed things dead on. All this occurred in spite of the fact that they had no silicon. In addition, I knew the famous designers involved would happily lie straight to your face having worked with them before and having been lied straight to my face and having had to clean up the mess when they left the company.
To be fair, it is also the only time I have had a Senior VP remember the kerfuffle and apologize when said startup finally delivered silicon and not only were the real numbers not what they claimed they weren't even close to the ones we were getting.
And do you believe that that is what's happening in this case?
If you have personal experience with Jeff Dean et al that you're willing to share, I'd be interested in hearing about it.
From where I'm sitting it looks like, "Google spent a fortune on deep learning, and got a small but real win. People who don't like Google failed to follow Google's recipe and got a large and easily replicated loss."
It's not even clear that Google's approach is feasible right now for companies not named Google. It is not clear that it works on other classes of chip. It is not clear that the technique will grow beyond what Google already got. It is really not clear that anyone should be jumping on this.
But there is a world of difference between that, and concluding that Google is lying.
> From where I'm sitting it looks like, "Google spent a fortune on deep learning, and got a small but real win. People who don't like Google failed to follow Google's recipe and got a large and easily replicated loss."
From where I'm sitting it looks like Google cooked the books maximally, barely beat humans let alone state of the art algorithms, published a crappy article in Nature because it would never have passed editorial muster at something like DAC or an IEEE journal and now have to browbeat other people who are calling them out on it.
And that's the best interpretation we can cough up.
I'll go further, we don't even have any raw data that says that they actually did beat the humans. Some of the humans I know who run P&R are REALLY good at what they do. The data could be completely made up. Given how much scientific fraud has come out lately, I'm amazed at the number of people defending Google on this.
Where I'm from, we call what Google is doing both "lying" and "bullying".
Look, Google can easily defuse this in all manner of ways. Publish their raw data. Run things on testbenches and benchmarks that the EDA tools vendors have been running on for years. Run things on the open source VLSI designs that they sponsored.
What I suspect happened is that Google's AI group has gotten used to being able to make hyperbolic marketing claims which are difficult to verify. They then poked at place and route, failed, and published an article anyway because someone's promotion is tied to this. They expected that everybody would swallow their glop just like every other time, be mostly ignored and the people involved can get their promotions and move on.
Unfortunately, Google is shoveling bullshit around something that has objective answers; real money is at stake; and they're getting rightfully excoriated for it.
Look, either the follow-up article did pretraining or not. Jeff Dean is claiming that the importance of pretraining was mentioned 37 times and the follow-up didn't do it. That sounds easy to verify.
Likewise the importance of spending 20x as much money on the training portion seems easy to verify, and significant.
That they would fail to properly test against industry standard workbenches seems reasonable to me. This is a bunch of ML specialists who know nothing about chip design. Their background is beating everyone at Go and setting a new state of the art for protein folding, and not chip design. If you dismiss those particular past accomplishments as hyperbolic marketing, that's your decision. But you aren't going to find a lot of people in these parts who agree with you.
If you think that those were real, but that a bunch of more recent accomplishments are BS, I haven't been following closely enough to have an opinion. The stuff that crossed my radar since AlphaFold is mostly done at places like OpenAI, and not Google.
Regardless, the truth will out. And what Google is claiming for itself here really isn't all that impressive.
Reading those papers and looking at the code, it doesn't look easy. However, let's imagine that the Cheng et al team comes back with results for pretraining a few months from now, and they support the conclusions of their earlier paper. What should they do to help everyone reach a conclusion?
"If Cheng et al. had reached out to the corresponding authors of the Nature paper, we would have gladly helped them to correct these issues prior to publication" (https://arxiv.org/pdf/2411.10053)
That's how you actually do a reproduction study - you reach out to the corresponding authors and make sure you do everything exactly the same. But at this point, it's hard to imagine the AlphaChip folks having much patience with them.
> published a crappy article in Nature because it would never have passed editorial muster at something like DAC or an IEEE journal and now have to browbeat other people who are calling them out on it.
I don't think it's easier to get into DAC / an IEEE journal than Nature.
Their human baseline was the TPU physical design team, with access to the best available tools: rdcu.be/cmedX
and this is still the baseline to beat in order to get used in production, which has happened for multiple generations of TPU.
TPU is export controlled and super confidential -- multi-billion dollar IP! -- so I don't see raw data coming out anytime soon.
Nature papers get retracted every year. I have not heard of DAC papers retracted.
If the Nature paper made it clear that RL is not seriously expected to work on non-TPU chips, it would have have probably been rejected. If RL works on many other chips, then evidence should be easy to publish.
When Google published the Nature article, Nature included a rosy intro article by a leading expert in chip design. His name was Andrew Kahng, and he apparently liked Google at the time. But when he dug into Google code (released way after publication), he retracted his intro and co-authored the Cheng et al article. You see how your theory breaks down here.
As Andrew Kahng was one of the co-authors of Cheng et al., all of the issues with his reproduction still matter here. The Nature paper went through an investigation and second round of peer review.
AlphaChip is used to make real chips in production. Google publicly announced its use in multiple generations of TPUs and Axion CPUs, and MediaTek said they've built on it as well.
Pulling way ahead sounds sufficient, not necessary. Can we prove it's not the case? Let's say someone says that's why Gemini inference is so cheap. Can we show that's wrong?
Either the research is as much of a breakthrough as is claimed and Google is about to pull way ahead of all these other "idiots" who can't replicate their method even when it is described to them in detail, or the research is flawed and overblown and not as effective as claimed. This seems like exactly the sort of question the market will quickly decide over the next couple of years and not worth arguing over.
Why do a non-zero amount of people have seemingly religious beliefs about this topic on one side or the other?