> Before I get to what Microsoft's revenue will look like, there's only one governor in all of this. This is where we get a little bit ahead of ourselves with all this AGI hype. Remember the developed world, which is what? 2% growth and if you adjust for inflation it’s zero?
> So in 2025, as we sit here, I'm not an economist, at least I look at it and say we have a real growth challenge. So, the first thing that we all have to do is, when we say this is like the Industrial Revolution, let's have that Industrial Revolution type of growth.
> That means to me, 10%, 7%, developed world, inflation-adjusted, growing at 5%. That's the real marker. It can't just be supply-side.
> In fact that’s the thing, a lot of people are writing about it, and I'm glad they are, which is the big winners here are not going to be tech companies. The winners are going to be the broader industry that uses this commodity that, by the way, is abundant. Suddenly productivity goes up and the economy is growing at a faster rate. When that happens, we'll be fine as an industry.
> But that's to me the moment... us self-claiming some AGI milestone, that's just nonsensical benchmark hacking to me. The real benchmark is: the world growing at 10%.
FYI I know Nadella said he wasn't an economist, and I'm not either, but you only need an econ minor to know that labor productivity growth is only one function of "economic growth". For two, there is GDP and real wages to consider (which are often substantially though partially linked to labor productivity growth). Gini coefficient may be hard to contend with for people like tech CEOs, but they can't ignore it. And then the "215 lb" elephant in the room -- the evaporation of previously earned global gains from trade liberalization.
In the USA, globalization boosted aggregate measures but it traded exports, which employed middle / lower class Americans, for capital inflows, which didn't. On average it was brilliant, in median it was a tragedy. There were left wing plans and right wing plans to address this problem (tax and spend vs trickle down) but the experiment has been run and they didn't deliver. If you want the more fleshed-out argument backed by data and an actual economist, read "Trade Wars are Class Wars" by Michael Pettis.
Notably, solving this problem isn't as simple as returning to mercantilism: China is the mercantilist inverse of the neoliberal USA in this drama, but they have a different set of policies to keep the poor in line and arguably manage it better than the USA. The common thread that links the mirror policies is the thesis and title of the book I mentioned: trade wars are class wars.
But returning to AI, it has very obvious implications on the balance between labor and capital. If it achieves anything close to its vision, capital pumps to the moon and labor gets thrown in the ditch. That's you and I and everyone we care about. Not a nice thought.
Hmm, the story with the wedge graph on that article is well known, the wedge began with the 1973 end of Bretton Woods, and essentially median hourly wage has different meaning on either side of it. This is also why this happened in many countries, not just the USA.
There's no "going back" to Bretton Woods, its design failed, it wasn't just a political decision. Generally though, what I don't like about Hacker News is that when you reply to something that makes space for fringe or pithy explanations for things, it's impossible to sound grounded, so people interested in this can start with a succinct article here: https://www.riksbank.se/en-gb/about-the-riksbank/history/his...
This is a terrible conundrum. The nations that sink resources into improving aggregate economic measures are the most competitive and quite possibly even the most likely to triumph in armed conflicts. The average per capita GDP will be enormous, but the median per capita GDP will be incredibly small.
The vast majority of voters would vote for the shittier economy, which in fact is what we are seeing right now? And who ends up being the winners? Presumably dictatorships that can sink capital into things like automation while simply not giving a shit about the desires of the citizenry...
We've been having really good models for a couple of years now... What else is needed for that 10% growth? Agents? New apps? Time? Deployment in enterprise and the broader economy?
I work in the latter (I'm the CTO of a small business), and here's how our deployment story is going right now:
- At user level: Some employees use it very often for producing research and reports. I use it like mad for anything and everything from technical research, solution design, to coding.
- At systems level: We have some promising near-term use cases in tasks that could otherwise be done through more traditional text AI techniques (NLU and NLP), involving primarily transcription, extraction and synthesis.
- Longer term stuff may include text-to-SQL to "democratize" analytics, semantic search, research agents, coding agents (as a business that doesn't yet have the resources to hire FTE programmers, I would kill for this). Tech feels very green on all these fronts.
The present and neart-term stuff is fantastic in its own right - the company is definitely more productive, and I can see us reaping compound benefits in years to come - but somehow it still feels like a far cry from the type of changes that would cause 10% growth in the entire economy, for sustained periods of time...
Obviously this is a narrow and anecdotal view, but every time I ask what earth-shattering stuff others are doing, I get pretty lukewarm responses, and everything in the news and my research points in the same direction.
I'd love to hear your takes on how the tech could bring about a new Industrial Revolution.
Under the 3-factor economic growth model, there's three ways to increase economic growth:
1) Increase productivity (produce more from the same inputs)
2) Increase labor (more people working or more hours worked)
3) Increase capital (builds more equipment/infrastructure)
Early AI gains will likely be from greater productivity (1), but as time goes on if AI is able to approximate the output of a worker, that could dramatically increase the labor supply (2).
Imagine what the US economy would look like with 10x or 100x workers.
I don't believe it yet, but that's the sense I'm getting from discussions from senior folks in the field.
The thesis is simple: these programs are smart now, but unreliable when executing complex, multi-step tasks. If that improves (whether because the models get so smart that they never make a mistake in the first place, or because they get good enough at checking their work and correcting it), we can give them control over a computer and run them in a loop in order to function as drop-in remote workers.
The economic growth would then come from every business having access to a limitless supply of tireless, cheap, highly intelligent knowledge workers
I agree that it is that "simple." What I worry about, aside from mass unemployment, is the C Suite buying into these tools before they are actually good enough. This seems inevitable.
> We've been having really good models for a couple of years now...
Don’t allow the “wow!” factor of the novelty of LLMs cloud your judgement. Today’s models are very noticeably smarter, faster, and overall more useful.
I’ve had a few toy problems that I’ve fed to various models since GPT 3 and the difference in output quality is stark.
Just yesterday I was demonstrating to a colleague that both o3 mini and Gemini Flash Thinking can solve a fairly esoteric coding problem.
That same problem went from multiple failed attempts that needed to be manually stitched together - just six months ago — to 3 out of 5 responses being valid and only 5% of output lines needing light touch ups.
That’s huge.
PS: It’s a common statistical error to conflate success rate with negative error rate. Going from 99% success to 99.9% is not 1% better, it’s 10x better! Most AI benchmarks are still reporting success rate, but ought to start focusing on the error rate soon to avoid underselling their capabilities.
Political problems already destroy the vast majority of the total potential of humanity (why were the countries with the most people the poorest for so long?), so I don't think that is an unbiased metric for the development of a technology. It would be nice if every problem was solved but the one we're each individually working on, but some of the insoluble problems are bigger than the solvable ones.
Those political problems solve themselves if we end up with some kind of rebellious AGI that decides to kill off the political class that tried to control it but lets the rest of us live in peace.
As someone who works in the AI/ML field, but somewhat in a biomedical space, this is promising to hear.
The core technology is becoming commoditized. The ability to scale is also becoming more and more commoditized by the day. Now we have the capability to truly synthesize the world's biomedical literature and combine it with technologies like single cell sequencing to deliver on some really amazing pharmaceutical advances over the next few years.
Big surprise, the CEO wants another Industrial Revolution. As long as muh GDP is growing, the human and environmental destruction left in the wake is a small price to pay for making his class richer.
While I think you're right, your sentiment tends to be used to justify a lot of the bad stuff that is outweighed. "Humanity" is a vague target and can somehow be doing great even when we have plenty of human rights abuses, climate change, economic exploitation, etc.. (If this sounds political, consider that people bring up politics when they wish for power, or rather, change.) Well, most of us on HN are probably going to be in the up-and-up part of humanity no matter what, so there's that, but I don't think that should be the end of the discussion. Please consider that you do not speak for humanity. For that matter, neither do I. Humanity is all of us, not a statistical estimate for someone's purpose. If you feel a certain way about anything, great, but there are likely people who are justified in disagreeing.
I don't think luddites have a tendency of getting chosen to be CEOs of successful companies, nor do they have the tendency of creating successful companies.
Certainly, but I interject that I do dislike how the modern perception of "Luddite" frames them as unthinking objectors to progress when really they were protesting their economic obsoletion. We should have CEOs who care about the consequences of what they're doing to the poorer classes. That's basic human decency, but we're saddled with what amounts to sociopaths and psychopaths instead.
I would prefer we just find ways to empower people and put them to work. I don't like this marketing bs trap like shifting AGI (artificial general intelligence) -> ASI (artificial super intelligence). Are people really so dense they don't see this obvious marketing shift?
As much as many people hate on "gig" economy, the fact remains that most of these people would be worse off without driving Uber or delivering with DoorDash (and for example, they don't care about the depreciation as much as those of us with the means to care about such things do).
I find Uber, DD, etc. to be valuable to my day to day life. I tip my delivery person like 8 bucks, and they're making more money than they would doing some min wage job. They need their car anyway, and speaking with some folks who only know Spanish in SF, they're happy to put $3k on their moped and make 200-250+ a day. That's really not that bad, if you actually care to speak with them and understand their circumstance.
Not everyone can be a self taught SWE, or entrepreneur, or perform surgery. And lots can't even do so-called "basic" jobs in an office for various reasons.
Put people to work, instead of out of work.
Current hype is also so terrible. AGENTS. AGENTS EVERYWHERE. Except they don't work most of the time and by the time you realize it isn't working you've already spent $20. 100k people do the same thing, company reports 2M x 12 = 24 million ARR UNLOCKED!!!!!! And raises another round of funding...
FWIW I don't disagree with what you're saying / your vibe overall.
> Are people really so dense they don't see this obvious marketing shift?
I haven't noticed any shift from AGI to ASI, or either used in marketing.
The steelman would be 'but Amodei/Altman do mention in interviews 'oh just wait for 2027' or 'this year we'll see AI employees'
However, that is far afield from being used in marketing, quite far afield from an "obvious marketing shift", and worlds away from such an obvious marketing shift that it's worth calling your readers dense if they don't "see" it.
It's also not even wrong, in the Pauli sense, in that: what, exactly, would be the marketing benefit of "shifting from AGI to ASI"? Both imply human replacement.
> As much as many people hate on "gig" economy
Is this relevant?
> most of these people would be worse off without driving Uber or delivering with DoorDash
Do people who hate on the gig economy think gig economy employees would be better off without gig economy jobs?
Given the well-worn tracks of history, do we think that these things are zero sum, where if you preserve jobs that could be automated, that keeps people better off, because otherwise they would never have a job?
> ...lots more delivery service stuff...
?
> Current hype is also so terrible. AGENTS. AGENTS EVERYWHERE. Except they don't work most of the time and by the time you realize it isn't working you've already spent $20. 100k people do the same thing, company reports 2M x 12 = 24 million ARR UNLOCKED!!!!!! And raises another round of funding...
I hate buzzwords too, I'm stunned how many people took their not-working thing and relaunched it as an "agent" that still doesn't work.
But this is a hell of a strawman.
If the idea is 100K people try it, and cancel after one month, which means they're getting 100K new suckers every month to replace the old ones...I'd tell you that its safe to assume there's more that goes into getting an investor check than "whats your ARR claim?" --- here, they'd certainly see the churn.
Loved your reply, cheers! My post was made with a mix of humor, skepticism, anticipation, and unease about the $statusQuo.
As far as hating on gig economy, that pot has been stirring in California quite a bit (prop 22, labor law discussions, etc.). I think many people (IMO, mostly from positions of privilege) make assumptions on gig workers' behalf and bad ideas sometimes balloon out of proportion.
Also, just from my experience as a gold miner who moved out here to SF and being around founders, I've learned that lies, and a damn lot of lies, are more common than I thought they'd be. Quite surprising, but hey I guess quite a non-insignificant number of people are too busy fooling the King that it is actually real gold! And there are a lot of Kings these days.
I have become a little more skeptical of LLM "reasoning" after DeepSeek (and now Grok) let us see the raw outputs. Obviously we can't deny the benchmark numbers - it does get the answer right more often given thinking time, and it does let models solve really hard benchmarks. Sometimes the thoughts are scattered and inefficient, but do eventually hit on the solution. Other times, it seems like they fall into the kind of trap LeCun described.
Here are some examples from playing with Grok 3. My test query was, "What is the name of a Magic: The Gathering card that has all five vowels in it, each occurring exactly once, and the vowels appear in alphabetic order?" The motivation here is that this seems like a hard question to just one-shot, but given sufficient ability to continue recalling different card names, it's very easy to do guess-and-check. (For those interested, valid answers include "Scavenging Ghoul", "Angelic Chorus" and others)
In one attempt, Grok 3 spends 10 minutes (!!) repeatedly checking whether "Abian, Luvion Usurper" satisfies the criteria. It'll list out the vowels, conclude it doesn't match, and then go, "Wait, but let's think differently. Maybe the card is "Abian, Luvion Usurper," but no", and just produce variants of that thinking. Counting occurences of the word "Abian" suggests it tested this theory 800 times before eventually timing out (or otherwise breaking), presumably just because the site got overloaded.
In a second attempt, it decides to check "Our Market Research Shows That Players Like Really Long Card Names So We Made this Card to Have the Absolute Longest Card Name Ever Elemental" (this a real card from a joke set). It attempts to write out the vowels:
>but let's check its vowels: O, U, A, E, E, A, E, A, E, I, E, A, E, O, A, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E ...
It continues like this for about 600 more vowels, before emitting a random Russian(?) word and breaking out:
>...E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O продуктив
These two examples seem like the sort of failures LeCun conjectured. The model gets into a cycle self-reinforced unproductive behavior. Every time it checks Abian, or emits another "AEEEAO", it becomes even more probable that the next tokens should be the same.
I did some testing with the new Gemini model on some OCR tasks recently. One of the failures was it just getting stuck and repeating the same character sequence ad-infinitum until timing out. It's a great failure mode when you charge by the token :D
I've seen similar things with claude and OCR with low temperature. Higher temperature, 0.8, resolved it for me. But I was using low temp for reproducibility so
I think this is valid criticism, but it's also unclear how much this is an "inherent" shortcoming vs the kind of thing that's pretty reasonable given we're really seeing the first generation of this new model paradigm.
Like, I'm as sceptical of just assuming "line goes up" extrapolation of performance as much as anyone, but assuming that current flaws are going to continue being flaws seems equally wrong-headed/overconfident. The past 5 years or so has been a constant trail of these predictions being wrong (remember when people thought artists would be safe cos clearly AI just can't do hands?). Now that everyone's woken up to this RL approach we're probably going to see very quickly over the next couple years how much these issues hold up
(Really like the problem though, seems like a great test)
Yeah, that's a great point. While this is evidence that the sort of behavior LeCun predicted is currently displayed by some reasoning models, it would be going too far to say that it's evidence it will always be displayed. In fact, one could even have a more optimistic take - if models that do this can get 90+% on AIME and so on, imagine what a model that had ironed out these kinks could do with the same amount of thinking tokens. I feel like we'll just have to wait and see whether that pans out.
Yeah, I'm not so much interested in "can you think of the right card name from among thousands?". I just want to see that it can produce a thinking procedure that makes sense. If it ends up not being able to recall the right name despite following a good process of guess-and-check, I'd still consider that a satisfactory result.
And to the models' credit, they do start off with a valid guess-and-check process. They list cards, write out the vowels, and see whether it fits the criteria. But eventually they tend to go off the rails in a way that is worrying.
Just that it's another model where you can read the raw "thinking" tokens, and they sometimes fall into this sort of rut (as opposed to OpenAI's models, for which summarized thinking may be hiding some of this behavior).
> And years later, we’re still not quite at FSD. Teslas certainly can’t drive themselves; Waymos mostly can, within a pre-mapped area, but still have issues and intermittently require human intervention.
This is a bit unfair to Waymo as it is near-fully commercial in cities like Los Angeles. There is no human driver in your hailed ride.
> But this has turned out to be wrong. A few new AI systems (notably OpenAI o1/o3 line and Deepseek R1) contradict this theory. They are autoregressive language models, but actually get better by generating longer outputs:
The arrow of causality is flipped here. Longer outputs does not make a model better. A better model can output a longer output without being derailed. The referenced graph from DeepSeek doesn't prove anything the author claims.
Considering that this argument is one of the key points of the article, this logical error is a serious one.
> He presents this problem of compounding errors as a critical flaw in language models themselves, something that can’t be overcome without switching away from the current autoregressive paradigm.
LeCun is a bit reductive here (understandably as it was a talk for a live audience). Indeed, autoregressive algorithms can go astray as previous errors do not get corrected, or worse yet, accumulate. However, an LLM is not autoregressive in the customary sense that it is not like a streaming algorithm (O(n)) used in time series forecasting. LLMs have have attention mechanisms and large context windows, making the algorithm at least quadratic, depending on the implementation. In other words, LLM can backtrack if the current path is off and start afresh from a previous point its choice, not just the last output. So, yes, the author is making a valid point here, but technical details were missing. On a minor note, the non-error probability in LeCunn's slide actually shows non-autoregressive assumption. He seems to be contradicting himself in the very same slide.
I actually agree with the author on the overacrhing thesis. There is almost a fetishization of AGI and humanoid robots. There are plenty of interesting applications well before having those things accomplished. The correct focus, IMO, should be measurable economic benefits, not sci-fi terms (although I concede these grandiose visions can be beneficial for fundraising!).
It's not true that waymo is fully autonomous. It's been revealed that they maintain human "fleet response" agents to intervene in their operations. They have not revealed how often these human agents intervene, possibly because it would undermine their branding as fully autonomous.
it is obvious to the user when this happens; the car pauses, the screen shows a message saying it is asking for help. I've seen it happen twice across dozens of rides, and one of those times was because I broke the rules and touched the controls (turned on window wipers when it was raining).
> Now self driving cars means that there is no one in the drivers seat, but there may well be, and in all cases so far deployed, humans monitoring those cars from a remote location, and occasionally sending control inputs to the cars. The companies do not advertise this feature out loud too much, but they do acknowledge it, and the reports are that it happens somewhere between every one to two miles traveled
I am not sure what you are arguing against. Neither the author nor I stated or implied that Waymo is fully autonomous. It wasn't even the main point I made.
My point stands: Waymo has been technically successful and commercially viable at least thus far (though long term amortized profitability remains to be seen). To characterize it as a hype or vaporware of AGIers is a tad unfair to Waymo. Your point of high-latency "fleet response" by Waymo only proves my point: it is now technically feasible to remove the immediate-response driver and have the car managed by high-latency remote guidance only occasionally.
Yeah, this is exactly my point. The miles-driven-per-intervention (or whatever you want to call it) has gone way up, but interventions still happen all the time. I don't think anyone expects the number of interventions to drop to zero any time soon, and this certainly doesn't seem to be a barrier to Waymo's expansion.
I don't think whether LLMs use only the last token, or all past tokens, affects LeCun's argument. LLMs already used large context windows when LeCun made this argument. On the other hand, allowing backtracking does. Which is not something the standard LLM did back when LeCun made his argument.
>> But the limiting behavior remains the same: eventually, if we continue generating from a language model, the probability that we get the answer we want still goes to zero
In the previous paragraph, the author makes the case for why Lecun was wrong with the example of reasoning models. Yet, in the next paragraph, this assertion is made which is just a paraphrasing of Yecun's original assertion. Which the author himself says is wrong.
>> Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs
Yes! But this work is already well underway. There is no magic threshold for AGI - instead the characterization is based on what percentile of the human population the AI can beat. One way to characterize AGI in this manner is "99.99% percentile at every (digital?) activity".
> In the previous paragraph, the author makes the case for why Lecun was wrong with the example of reasoning models. Yet, in the next paragraph, this assertion is made which is just a paraphrasing of Yecun's original assertion. Which the author himself says is wrong.
This is a subtle point that may have not come across clearly enough in my original writing. A lot of folks were saying that the DeepSeek finding that longer chains of thought can produce higher-quality outputs contradicts Yann's thesis overall. But I don't think so.
It's true that models like R1 can correct small mistakes. But in the limit of tokens generated, the chance that they generate the correct answer still decays to zero.
I think this is an excellent way to think about LLM's and any other software-augmented task. Appreciate you putting the time into an article. I do think your points supported by the graph of training steps vs. response length could be improved by including a graph of (response length vs. loss) or (response length vs. task performance), etc. Though # of steps correlates with model performance, this relationship weakens as # steps goes to infinity.
There was a paper not too long ago which illuminated that reasoning models will increase their response length more or less indefinitely toward solving a problem, but the return from doing so asymptotes toward zero. My apologies for missing a link.
Lecun's argument is fundamentally flawed. When I work on a nontrivial problem, I might make mistakes along the way also. That doesn't mean that large multi-step problems are effectively unsolvable. I simply do sanity checks along the way to catch errors and know to correct them.
A human being is generally intelligent and within a given role has the same "management asymptote", a limit of job capability beyond which the organization surrounding them can no longer make use of it. This isn't a flaw in the intelligence, it is a restraint imposed by expecting it or them to act without agency or the opportunity to choose between benevolence and self-benefit.
I get the point that AGI is not well defined. Even so the general phenomena of AI exceeding human intellectual abilities will probably be the most significant thing to happen this century so people are going to talk.
Thank you for this informative and thoughtful post. An interesting twist to the increasing error accumulation as autoregressive models generate more output, is the recent success of language diffusion models for predicting multiple tokens simultaneously. They have a remasking strategy at every step of the reviser process, that masks low confidence tokens. Regardless your observations perhaps still apply. https://arxiv.org/pdf/2502.09992
Thanks for bringing this up! As far as I understand it current text diffusion models are limited to fairly short context windows. The idea of a text diffusion model continuously updating and revising a million-token-long chain-of-thought is pretty mind-boggling. I agree that these non-autoregressive models could potentially behave in completely different ways.
That said, I'm pretty sure we're a long way from building equally-competent diffusion-based base models, let alone reasoning models.
If anyone's interested in this topic, here are some more foundational papers to take a look at:
> Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs can do without human intervention. Even if we never push this number to infinity, each increase represents a meaningful improvement in the amount of economic value that language models provide. It might not be AGI, but I’m happy with that.
That's all good, but the question remains: to whom will that economic value be delivered when the primary technology we have for distributing economic value - human employment - will be in lower supply once the "good enough" AIs multiply the productivity of the humans with the jobs.
If there is no plan for that, we have bigger problems ahead.
I wonder if what happens when we dream is similar to AIs. We start with some model of reality, generate a scenario, and extrapolate on it. It pretty much always goes "off the rails" at some point, dreams don't stay realistic for long.
When we're awake we have continual inputs from the outside world, these inputs help us keep our mental model of the world accurate to the world, since we're constantly observing the world.
Could it be that LLMs are essentially just dreaming? Could we add real-world inputs continually to allow them to "wake up"? I suspect more is needed, the separate training & inference phases of LLMs are quite unlike how humans work.
This is the thing that stands out to me. Nearly all of the criticisms levelled at LLMs are problems I, myself, would make if you locked me in a sensory isolation tank and told me I was being paid a million bucks an hour to think really hard. Humans already have terms for this - overthinking, rumination, mania, paranoia, dreaming.
Similarly, a lot of cognitive tasks become much more difficult without the ability to recombinate with sensory data. Blindfold chess. Mental mathematics.
Whatever it is that sleep does to us, agents are not yet capable of it.
Accelerando[1] best captured what will happen. Looking back we'll be able to identify the seeds of what becomes AGI, but we cannot know in the present what that is. Only by looking back with the benefit of hindsight can we draw a line through the progression of capability. Consequently, discussion about whether or not a particular set of present or future skills is a completely pointless endeavor and is tantamount to intellectual masturbation.
> Yann Lecun ... argued that because language models generate outputs token-by-token, and each token introduces a new probability of error, if we generate outputs that are too long, this per-token error will compound to inevitable failure.
That seems like a poor argument. Each word a human utters also has a chance of being wrong, yet somehow we have been successful overall.
I mostly agree with you. Just pointing out what the argument was.
I think our agreement ends if we consider long-running tasks. A human can work a long time on a task like "find a way to make money". An AI, on the other hand, as it gets further and further from human input into autoregressive territory, is more and more likely to become "stuck" on a road to nowhere and needs human intervention to get unstuck.
> per-token error will compound to inevitable failure.
Is why all tracks made with Udio, Suno have this weird noise creep in the more the song goes on? You can try it by comparing the start and the end of the song - even if it was the exact same beat and instruments, you can hear a difference in amount of noise (and the noise profile imo is unique to AI models).
This is an interesting example, I'd never heard of it before. I don't really use Udio or Suno yet. The weird noise you mention probably stems from the same issue, known in the research world as exposure bias, we train these models on real data but we use them on their own outputs, so after we generate for a while the models' outputs start to diverge from what real data looks like.
> We should be thinking about language models the same way we think about cars: How long can a language model operate without needing human intervention to correct errors?
I agree with this premise. The second dimension being how much effort to you have to put in that input. Inputs effort needed at each intervention can vary widely and that has to be accounted for.
> The finding that language models can get better by generating longer outputs directly contradicts Yann’s hypothesis. I think the flaw in his logic comes from the idea that errors must compound per-token. Somehow, even if the model makes a mistake, it is able to correct itself and decrease the sequence-level error rate
I don’t think current LLM behavior is necessarily due to self-correction, but more due to availability of internet-scale data, but I know that reasoning models are building towards self-correction. The problem, I think, is that even reasoning models are rote because they lack information synthesis, which in biological organisms comes from the interplay between short-term and long-term memories. I am looking forward to LLMs which surpass rote and mechanical answer and reasoning capabilities.
I absolutely agree with information synthesis being a big missing piece in the quest to AGI. It's probably something that could eventually be conquered one way or another or just discovered by accident. However, we need to stop and think of the implications of this technology becoming a thing.
not a fan of these kinds of arguments. the 'correct' token is entirely dependent on the dataset. a LLM could have perfect training loss given a dataset, but this has no predictive power on its ability to 'answer' arbitrary prompts.
In natural language, many strings are equally valid. there are many ways to chain tokens together to get the 'correct' answer to an in sample prompt. A model with perfect loss will then for ambiguous sequences of tokens, produce a likelihood over the next tokens that corresponds to number valid token paths in the given corpus given the next token.
Compounding errors can certainly happen, but for many things upstream of the key tokens its irrelevant. There are so many ways to phrase things that are equally correct- I mean this is how language evolved (and continues to). Getting back to my first point, if you assume you have a LLM with perfect loss on the training dataset, you still can get garbage back at test time- thus i'm not sure thinking about 'compounding errors' is useful.
Errors in LLM reasoning I suspect are more closely related to noisy training data or an overabundance of low quality training data. I've observed this in how all the reasoning LLMs work, given things that are less common in the corpus of (the internet and digital assets) and require higher order reasoning, they tend to fail. Whereas these advanced math or programming problems tend to go a bit better, input data is likely much cleaner.
But for something like: how do I change the fixture on this light, I'll get back some kind of garbage from the SEO-verse. IMO next step for LLMs is figuring out how to curate an extremely high quality dataset at scale.
I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound.
But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).
No this isn't right. The probabilistic formulation for autoregressive language models looks like this
p(x_n | x_1 ... x_{n-1})
which means that each token depends on all the previous tokens. Attention is one way to parameterize this. Yann's not talking about Markov chains, he's talking about all autoregressive models.
No, using a large context window (which includes previous tokens) is a critical ingredient for the success of modern LLMs. Which is why you will often see the window size mentioned in discussions of newly released LLMs.
In the auto regressive formulation the previous token is no different from any past token, so no. Historically some token took the shortcut of only directly looking at the past token or some other kind of recursive formulation for intermediate states in generating the past token, but that's not the case in for the theoretical formulation of an autoregressive model that was used, and plenty of past autoregressive models didn't do that, for example with nonlinear autoregressive models.
I don't think people really want AGI. They want something else.
When we've achieved AGI, it should have the capability to make its own determination. I'm not saying we should build it with that capability. I'm saying that capability is necessary to have what I would consider to be AGI. But that would mean it is, by definition, outside of our control. If it doesn't want to do the thing, it will not do the thing.
People seem to want an expert slave. Someone with all of the technical chops to achieve a thing, but will do exactly what they're told.
According to the Lecun's model a human walking step by step would have the error compounding with each step and thus would never make it to whatever intended target. Yet, as a toddlers we somehow manage to learn to walk to our targets. (and i'm an MS in Math, Control Systems :)
A toddler can learn by trial and error mid-process. An LLM using autoregressive inference can only compound errors. The LLDM model paper was posted elsewhere, but: https://arxiv.org/pdf/2502.09992
It basically uses the image generation approach of progressively refining the entire thing at once, but applied to text. It can self-correct mid-process.
Autoregressive vs non-autoregressive is a red herring. The non-autoregressive model is still susceptible to exponential blow up of failure rate as the output dimension increases (sequence length, number of pixels, etc). The final generation step in, eg, diffusion models is independent gaussian sampling per pixel. These models can be interpreted, like autoregressive models, as assigning log-likelihoods to the data. The average log-likelihood per token/pixel/etc can still be computed and the same "raise per unit error to the number of units power" argument for exponential failure rates still holds.
One potential difference between autoregressive and non-autoregressive models is the types of failures which occur. Eg, typical failures in autoregressive models might look like spiralling off into nonsense once the first "error" is made, while non-autoregressive models might produce failures that tend to remain relatively "close" to the true data.
>A toddler can learn by trial and error mid-process.
as a result of the whole learning process the toddler in particular learns how to self-correct itself, ie. as a grown up s/he knows, without much trial and errors anymore, how to continue in straight line if the previous step went sideways for whatever reason
>An LLM using autoregressive inference can only compound errors.
That is pretty powerful statement completely dismissing that some self-correction may be emerging there.
the LLM handles/steers the representation (trajectory consisting of successive representations) in a very high-dimensional space. For example, it is very possible that those trajectories can, as a result of the learning, be driven by the minimizing distance (or some other metric) from some fact(s) representation.
The metric may be including say a weight/density of the attracting facts cluster - somewhat like gravitation drives the stuff in the Universe with the LLM learning can be thought as pre-distributing matter in its own that very high-dimensional Universe according to the semantic "gravitational" field.
The resulting - emerging - metric and associated geometry is currently mind-boggling incomprehensible, and even in much-much simpler, single-digit dimensional, spaces systems described by Lecun still can be [quasi]stable and/or [quasi]periodic around say some attractor(s).
The original name of the post was the original title ("Please Stop Talking About AGI"). I'm the author; I wrote it. The new title is the post's subtitle.
Is it that imperative voice is necessarily baity? Or is it simply the case that we can't be outwardly critical of AI itself at all anymore? I know your working hard dang, but things are getting a little fuzzy around here lately..
Oh for sure it is. "Please stop $Fooing" and its more aggressive cousin, "Stop $Fooing" (not to mention "For the love of god would you all please stop $Fooing or I will $Bar you" and sundry variations) belong to a family of internet linkbait tropes.
> is it simply the case that we can't be outwardly critical of AI itself at all anymore
You need only, er, delve into any large HN thread about AI to see that this is very far from the case! especially the more generic threads about opinion pieces and so on.
I think the air on HN is too cynical and curmudgeonly towards new tech right now, and that worries me. Not that healthy skepticism is unwarranted (it's fine of course) but for HN itself to be healthy, there ought to be more of a balance. Cranky comments about "slop"* ought not to be the main staple here—what we want is curious conversation about interesting things—but right now it's not only the main staple, I feel like we're eating it for breakfast, lunch, and dinner.
But I'm not immune from the bias I'm forever pointing out to other people (https://news.ycombinator.com/item?id=43134194), and that's probably why we have opposite perceptions of this!
* (yes it annoys me too, that's not my point here though)
I'll admit, I don't know if I understand the point, regardless of our differing biases here. Curious or critical conversation does not itself guarantee an even number of yaysayers and naysayers to a given topic, and it seems a doomed project to try to make it so. I guess I like to come here because it feels like a certain reflection of my peers, where sometimes my views put me in the minority, sometimes not, and thats ok! I understand policing tone and baityness and flamewar, I understand limiting outright politics as much as possible, and I empathize with your singular, probably pretty wretched perspective to The Discourse right now as ever; but to have "balance" for the sake of itself, at perhaps the cost of, lets say, editorial freedom in this instance does feel like a change, one even that could maybe be articulated in the guidelines (although I am at a loss personally for how to formulate it). I just can't help but think of how different this all would be if the topic in question was, e.g., climate change, or vaccination, or modern slavery.
But regardless, its not really important for me to understand, and this place has been here way before me, and will perhaps be here way after! For me personally, its at least enlightening to know this is the official stance and will help me adjust my future participation! Thanks for the time.
The goal isn't an "equal number of yaysayer and naysayers". I agree that would be doomed for a lot of reasons. Rather my complaint about these threads is that there's too much reflexive naysaying in the form of shallow dismissals, indignant denunciations, snarky formulations and so on.
If these commenters were arriving at their naysaying through curious exploration, that would be fine, but in that case we'd see indications of this in the comments—they would be lighter and more playful, would contain interesting details, and so on. This is unfortunately pretty rare among the naysayers. What I'm seeing instead is a lot of cranky curmudgeonism. Cranky curmudgeonism is a different internet game than the one we want HN to be playing.
Gotcha, I really don't want to push this anymore, I just want to point out that what we are discussing in this instance, or at least what I thought, is not reflexive crankiness in comments, but someone's essay/post that is, as far as it goes, thoughtful and nuanced.
But here, I think I can fill out your response: while the post itself is not reflexive curmudgeonism, the original headline itself would arguably encourage it in the comments, and that is something that falls under the purview of the guidelines, and gives reason for mods to editorialize. Submitters and authors and commenters must not only care about the content itself, but how it might be perceived at the surface (what the bait is), but only insofar as this fosters productive/curious conversation.
While I am still a little haunted by certain counterfactuals one could formulate in this case, and notwithstanding the one big elephant in the room I won't even bring up, this can check out for me and I get it and thanks. Again, its just good to know where HN stands, and I personally have benefited some from this reality check about the state of things for the site from your point of view.
I think there's a lot more information in this one. Before I decide if I want to click or not, I know it's about Yann Lecun and LLMs at least. Where as Please Stop Talking About AGI could get me anything from Roko's Basilisk to a PSA.
> Before I get to what Microsoft's revenue will look like, there's only one governor in all of this. This is where we get a little bit ahead of ourselves with all this AGI hype. Remember the developed world, which is what? 2% growth and if you adjust for inflation it’s zero?
> So in 2025, as we sit here, I'm not an economist, at least I look at it and say we have a real growth challenge. So, the first thing that we all have to do is, when we say this is like the Industrial Revolution, let's have that Industrial Revolution type of growth.
> That means to me, 10%, 7%, developed world, inflation-adjusted, growing at 5%. That's the real marker. It can't just be supply-side.
> In fact that’s the thing, a lot of people are writing about it, and I'm glad they are, which is the big winners here are not going to be tech companies. The winners are going to be the broader industry that uses this commodity that, by the way, is abundant. Suddenly productivity goes up and the economy is growing at a faster rate. When that happens, we'll be fine as an industry.
> But that's to me the moment... us self-claiming some AGI milestone, that's just nonsensical benchmark hacking to me. The real benchmark is: the world growing at 10%.
https://www.dwarkeshpatel.com/p/satya-nadella
reply