Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Orion’s problems signaled to some at OpenAI that the more-is-more strategy, which had driven much of its earlier success, was running out of steam."

So LLMs finally hit the wall. For a long time, more data, bigger models, and more compute to drive them worked. But that's apparently not enough any more.

Now someone has to have a new idea. There's plenty of money available if someone has one.

The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.



The new idea is inference-time scaling, as seen in o1 (and o3 and Qwen's QwQ and DeepSeek's DeepSeek-R1-Lite-Preview and Google's gemini-2.0-flash-thinking-exp).

I suggest reading these two pieces about that:

- https://www.aisnakeoil.com/p/is-ai-progress-slowing-down - best explanation I've seen of inference scaling anywhere

- https://arcprize.org/blog/oai-o3-pub-breakthrough - François Chollet's deep dive into o3

I've been tracking it on this tag on my blog: https://simonwillison.net/tags/inference-scaling/


I think the wildest thing is actually Meta’s latest paper where they show a method for LLMs reasoning not in English, but in latent space

https://arxiv.org/pdf/2412.06769

I’ve done research myself adjacent to this (mapping parts of a latent space onto a manifold), but this is a bit eerie, even to me.


Is it "eerie"? LeCun has been talking about it for some time, and may also be OpenAI's rumored q-star, mentioned shortly after Noam Brown (diplomacybot) joining OpenAI. You can't hill climb tokens, but you can climb manifolds.


I wasn’t aware of others attempting manifolds for this before - just something I stumbled upon independently. To me the “eerie” part is the thought of an LLM no longer using human language to reason - it’s like something out of a sci fi movie where humans encounter an alien species that thinks in a way that humans cannot even comprehend due to biological limitations.

I am hopeful that progress in mechanistic interpretability will serve as a healthy counterbalance to this approach when it comes to explainability.. though I kinda worry that at a certain point it may be that something resembling a scaling law puts an upper bound on even that.


Is it really alien or is it more similar to how we think? We don't think purely in language, it's more a kind of soup of language, sounds, images, emotions and senses that we then turn into language when we communicate with each other.


> it’s like something out of a sci fi movie where humans encounter an alien species that thinks in a way that humans cannot even comprehend due to biological limitations.

I've increasingly felt this since GPT2 wrote that news piece about unicorns back in 2019. These models are still so mysterious, when you think about it. They can often solve decently complex math problems, but routinely fail at counting. Many have learned surprising skills like chess, but only when prompted in very specific ways. Their emergent abilities constantly surprise us and we have no idea how they really work internally.

So the idea that they reason using something other than human language feels unsurprising, but only because everything about it is surprising.


I remember (apocryphal?) Microsoft's chatbot developing pidgin to communicate to other chatbots. Every layer of the NN except the first and last already "think" in latent space, is this surprising?


Interesting paper on this. "Automated Search for Artificial Life" https://sakana.ai/asal/


> You can't hill climb tokens, but you can climb manifolds.

Could you explain this a bit please?


I imagine he means that when you reason in latent space the final answer is a smooth function of the parameters, which means you can use gradient descent to directly optimize the model to produce a desired final output without knowing the correct reasoning steps to get there.

When you reason in token space (like everyone is doing now) you are executing nonlinear functions when you sample after each token, so you have to use some kind of reinforcement learning algorithm to learn the weights.


I think there's a subtlety here about what makes (e.g. English) tokens different to points in latent space. Everything is still differentiable (at least in the ML sense) until you do random sampling. Even then you can exclude the sampling when calculating the gradient (or is this equivalent to the "manifold"?).

I don't see a priori why it would be better or worse to reason with the "superposition" of arguments in the pre-sampling phase rather than concrete realizations of those arguments found only after choosing the token. It may well be a contingent rather than necessary fact.


Links to Yan:

Title: "Objective Driven AI: Towards Machines that can Learn, Reason, and Plan"

Lytle Lecture Page: https://ece.uw.edu/news-events/lytle-lecture-series/

Slides: https://drive.google.com/file/d/1e6EtQPQMCreP3pwi5E9kKRsVs2N...

Video: https://youtu.be/d_bdU3LsLzE?si=UeLf0MhMzjXcSCAb


It's just concept space. The entire LLM works in this space once the embedding layer is done. It's not really that novel at all.


This was my thought. Literally everything inside a neural network is a “latent space”. Straight from the embeddings that you use to map categorical features in the first layer.

Latent space is where the magic literally happens.


Completely agree. Have you see this?

https://sakana.ai/asal/


kinda how we do it. language is just an io interface(but also neural obv) on top of our reasoning engine.


It’s not just a protocol buffer for concepts though (weak wharf Sapir, lakoff’s ubiquitous metaphors). Language itself is also a concept layer and plasticity and concept development is bidirectional. But (I’m not very versed in the language here re ‘latent space’) I would imagine the forward pass through layers converges towards near-token-matches before output, so you have very similar reason to token/language reasoning even in latent/conceptual reasoning? Like the neurons that nearly only respond to a single token for ex.


Seems a standard approach of AI research is to “move X into the latent space” where X is some useful function (eg diffusion) previously done in the “data” or “artefact” space. So seems very pedestrian not wild to make that step.


There are lots of papers that do this


> So LLMs finally hit the wall

Not really. Throwing a bunch of unfiltered garbage at the pretraining dataset, throwing in RLHF of questionable quality during post-training, and other current hacks - none of that was expected to last forever. There is so much low-hanging fruit that OpenAI left untouched and I'm sure they're still experimenting with the best pre-training and post-training setups.

One thing researchers are seeing is resistance to post-training alignment in larger models, but that's almost the opposite of a wall, they're figuring it out as well.

> Now someone has to have a new idea

OpenAI already has a few, namely the o* series in which they discovered a way to bake Chain of Thought into the model via RL. Now we have reasoning models that destroy benchmarks that they previously couldn't touch.

Anthropic has a post-training technique, RLAIF, which supplants RLHF,and it works amazingly well. Combined with countless other tricks we don't know about in their training pipeline, they've managed to squeeze so much performance out of Sonnet 3.5 for general tasks.

Gemini is showing a lot of promise with their new Flash 2.0 and Flash 2.0-Thinking models. They're the first models to beat Sonnet at many benchmarks since April. The new Gemini Pro (or Ultra? whatever they call it now) is probably coming out in January.

> The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.

You would probably enjoy this talk [0], it's by an independent researcher who IIRC is a former employee of Deepmind or some other lab. They're exploring this exact idea. It's actually not hard to tell when a model is "confused" (just look at the probability distribution of likely tokens), the challenge is in steering the model to either get back to the right track or give up and say "you know what, idk"

[0] https://www.youtube.com/watch?v=4toIHSsZs1c


> Not really. Throwing a bunch of unfiltered garbage at the pretraining dataset, throwing in RLHF of questionable quality during post-training, and other current hacks - none of that was expected to last forever. There is so much low-hanging fruit that OpenAI left untouched and I'm sure they're still experimenting with the best pre-training and post-training setups.

Exactly! LLama3 and their .x iterations have shown that, at least for now, the idea of using the previous models to filter out the pre-training datasets and use a small amount of seeds to create synthetic datasets for post-training still holds. We'll see with L4 if it continues to hold.


The problem is data.

GPT-3 was trained on 4:1 ratio of data to parameters. And for GPT-4 the ratio was 10:1. So to scale this out, GPT-5 should be 25:1. The parameter count jumped from 175B to 1.3T, which means GPT-5 should be 10T parameters and 250T training tokens. There is zero chance OpenAI has a training set of high quality data that is 250T tokens.

If I had to guess, they trained a model that was maybe 3-4T in size and used 30-50T high quality tokens and maybe 10-30 medium and low quality ones.

There is only one company in the world that stores the data that could get us past the wall.

The training cost of the above scaled GPT-5 is 150x GPT-4, which was 25k A100 for 90 days, which poor MFU.

Let’s assume they double MFU, it would mean 1M H100s. But let’s say they made algorithmic improvements, so maybe it’s only 250-500k H100s.

While the training cluster size was 100k and then grew to 150k, this cluster is suggestive of a smaller model and less data.

But ultimately data is the bottleneck.


We're also increasingly using synthetic data to train them on, though, and the race now is in coming up with better ways to generate it.


Links?


What wall? Not a week has gone by in recent years without an LLM breaking new benchmarks. There is little evidence to suggest it will all come to a halt in 2025.


Sure, but "benchmarks" here seems roughly as useful as "benchmarks" for GPUs or CPUs, which don't much translate to what the makers of GPT need, which is 'money making use cases.'


O3 has demonstrated that OpenAI needs 1,000,000% more inference time compute to score 50% higher on benchmarks. If O3-High costs about $350k an hour to operate, that would mean making O4 score 50% higher would cost $3.5B (!!!) an hour. That scaling wall.


I used to run a lot of monte carlo simulations where the error is proportional to the inverse square root. There was a huge advantage of running for an hour vs a few minutes, but you hit the diminishing returns depressingly quickly. It would not surprise me at all if llms end up having similar scaling properties.


And I suspect o3 is something like monte carlo: generates tons of CoTs, with most of them are junk, but some hit the answer.


Sounds plausible given I’ve recently observed a ton of research papers in the space that in some way or another incorporate MCTS


Yeah, any situation you need O(n^2) runtime to obtain n bits of output (or bits of accuracy, in the Monre Carlo case) is pure pain. At every point, it's still within your means to double the amount of output (by running it 3x longer than you have so far), but it gradually becomes more and more painful, instead of there being a single point where you can call it off.


I’m convinced they’re getting good at gaming the benchmarks since 4 has deteriorated via ChatGPT, in fact I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems. GPT4 is an amazing tool but the true capabilities are being hidden from the public and/or intentionally neutered.


> I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems

Just chiming in to say you're not alone. This has been my experience as well. The o# line of models just don't do well at coding, regardless of what the benchmarks say.


All the benchmarks provide substantial scaffolding and specification details, and that's if they are zero-shot at all, which they often are not. In reality, nobody wants to spend as much time providing so much details or examples just to get the AI to write the correct function, when that same time and effort you'd have used to write it yourself.

Also, those benchmarks often run the model K times on the same question, and if any one of them is correct, they say it passed. That could mean if you re-ran the model 8 times, it might come up with the right answer only once. But now you have to waste your time checking if it is right or not.

I want to ask: "Write a function to count unique numbers in a list" and get the correct answer the first time.

What you need to ask:

""" Write a Python function that takes a list of integers as input and returns the count of numbers that appear exactly once in the list.

The function should: - Accept a single parameter: a list of integers - Count elements that appear exactly once - Return an integer representing the count - Handle empty lists and return 0 - Handle lists with duplicates correctly

Please provide a complete implementation. """

And run it 8 times and if you're lucky it'll get it correct zero-shot.

Edit: I'm not even aware of a Pass@1, zero-shot, and without detailed prompting (natural prompting) benchmark. If anyone knows one let me know.


Wait a few month and they will have a distilled model with the same performance and 1% of the run cost.


100X efficiency improvement (doubtful) still means that costs grow 200X faster than benchmark performance.


Even assuming that past rates of inference cost scaling hold up, we would only expect a 2 OoM decrease after about a year or so. And 1% of 3.5b is still a very large number.


And to your point "past performance is not indicative of future results". The extrapolate to infinity approach is the mindfever of this field.


Not really. o3-low compute still stomps the benchmarks and isn't anywhere that expensive and o3-mini seems better than o1 while being cheaper.

Combine that with the fact that LLM inference has reduced orders of magnitudes in cost the last few years and hampering over the inference costs of a new release seems a bit silly.


If you are talking about ARC benchmark, then o3-low doesn't look that special if you take into account there are plenty of finetuned models with much smaller resources achieved 40-50% results on private set (not semi-private like o3-low).


- I'm not just talking about ARC. On frontier Math, we have 2 scores, one with pass@1 and another with consensus vote with 64 samples. Both scores are much better than previous Sota.

- Also apparently, ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.


> On frontier Math

that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.

> ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.

post says: Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details.

So, I guess we don't know.


>that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.

It will be verifiable when the model is released. Open ai haven't released any benchmark scores that were shown falsified later so unless you have an actual reason to believe they're outright lying then it's not something to take seriously.

Frontier Math is a private benchmark with its highest tier of difficulty Terrence Tao says:

“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”

Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.


> Open ai haven't released any benchmark scores

there are multiple research results demonstrating that various benchmarks are heavily leaked to GPT training data.

Is it intentionally or not, we can't figure out, but they have very strong incentive to cheat to get more investments.

> Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.

this is scientific methodology when results have to be reproduced or confirmed before believed.


Again, Frontier Math is private. Benchmarks leaked to GPT-4 are all public datasets on the internet. Frontier Math literally cannot leak that way.

If you don't want to take the benchmarks at face value then good for you but this entire conversation is pointless.


> Again, Frontier Math is private.

its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.

> If you don't want to take the benchmarks at face value then good for you but this entire conversation is pointless.

If you think this entire conversation is pointless, then why do you continue?


>its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.

They have logs of the questions probably but that's not enough. Frontier Math isn't something that can be fully solved without gathering top experts at multiple disciplines. Even Tao says he only knows who to ask for the most difficult set.

Basically, what you're suggesting at least with this benchmark in particular is far more difficult than you're implying.

>If you think this entire conversation is pointless, then why do you continue?

There's no point arguing about how efficient the models are being (the original point) if you won't even accept the results of the benchmarks. Why i'm continuing ? For now, it's only polite to clarify.


> Frontier Math isn't something that can be fully solved without gathering top experts

Tao's quote above referred on hardest 20% problems, they have 3 levels of difficulty, presumably first level is much easier. Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.

> There's no point arguing

Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.


The lowest set is easier but still incredibly difficult. Top experts are no longer required sure but that's it. You'll still need the best of the best undergrads at the very least to solve it.

>Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.

Open AI didn't have any hand in providing problems, why you assume they have the solutions I have no idea.

>Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.

Are you just bring obtuse or what ? I stopped arguing with you a couple responses ago. You have doubts? good for you. They don't make much sense but hey, good for you.

This is my last response here so have a nice day.


> You'll still need the best of the best undergrads at the very least to solve it.

Ok, so I hope you admit that OAI could manually solve them now?

> Open AI didn't have any hand in providing problem

And you know this exactly how?

> I stopped arguing with you a couple responses ago

sure, of course, lmao


It is still not economical: in Arc at least 20 usd for task vs ~3 usd for a human (avg mturker) for the same perf.


Not necessarily. And this is the problem with ARC that people seem to forget.

- It's just a suite of visual puzzles. It's not like say GSM8K where proficiency in it gives some indication on Math proficiency in general.

- It's specifically a suite of puzzles that LLMs have shown particular difficulty in.

Basically how much compute it takes to handle a task in this benchmark does not correlate with how much it will take LLMs to compute tasks that people actually want to use LLMs for.


If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading. And also the "representative" point is really moot from the discussion perspective (i.e. saying o3 stomps benchmarks, but the benchmarks aren't representative).

*I don't think that is the case as you can at least make relative conclusions (i.e. o3 vs o1 series, o3-low is 4x to 20x the cost for ~3x the perf). Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.

PS: I know there are more benchmarks like SWE-Bench and Frontier Math, but this is the only one showing data about o3-low/high costs without considering the CodeForces plot that includes o3-mini (that one does look interesting, though right now is vaporware) but does not separate between compute scale modes.


>If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading.

ARC is a very hyped benchmark in the industry so letting us know the results is something any company would do whether it had a direct representation on normal usage or not.

>Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.

Again, people care about ARC, they don't care doing the things ARC questions ask. That it is un-economical to pay the price to use o3 for ARC does not mean it would be un-economical to do so for the tasks people actually want to use LLMs for. What does 3x the performance in say coding mean? You really think companies/users wouldn't put up with the increased price for that? You think they have Mturkers to turn to like they do with ARC?

ARC is literally the quintessential 'easy for humans, hard for ai' benchmark. Even if you discard the 'difficulty to price won't scale the same' argument, it makes no sense to use it for an economics comparison.


In summary: so the "stomps benchmarks" means nothing for anyone trying to make decisions on that announcement (yet they show cost/perf info). It seems, hipey.


Unfortunately, the best they can do is "This is my confidence on what someone would say given the prior context".


What someone from the past would have said.


The new idea is already here and it's reasoning / chain of thought.

Anecdotally Claude is pretty good at knowing the bounds of its knowledge.


Anecdotally Claude is just as bad as every other LLM.

Step into more niche areas e.g. I am trying to use it with Scala macros and at least 90% of the time it is giving code that either (a) fails to compile or (b) is just complete gibberish.

And at no point ever has it said it didn't know something.


Yep, get into any sufficiently deep niche (i.e. actually almost any non-trivial app) and the LLM magic fades off.

Yeah sure you can make a pong clone in html/js and that's mainly because there the internet is full of pong clone demos. Ask how to constraint a statsmodels lineal model in some non-standard way? It will gaslight how it is possible and make you loss time in the process.


Making a pong clone by telling the LLM to make a pong clone is a cute trick that sometimes works, but that's not the way anyone who understands how to properly use these tools is using them. You don't describe and app and hope the LLM builds it correctly. You have to know how to architect an application and you use the LLM to build small pieces of code. For example, you tell it to build a function that does x, takes the inputs a, b, and c and returns z.

LLMs don't turn non-coders into coders. It gives actual coders superpowers.


No true scottsman fallacy. I know how to use them, but using them "correctly" still produces many errors.

They suck at non-trivial code outside of standard library usage and boilerplate coding: I gave an example and parent did as well. In that regard would at least change your phrase from "actual coders" to "actual senior coders", as any junior receiving bad advice (in eternal loops as LLMs normally like to do it) is only going to make them waste time and tokens.


My point is that while you do have to give them coding problems that would have appeared in their training set (I guess you could call that trivial), every coding problem becomes trivial when you break it down to it's constituent parts. As you know, the biggest applications are just a lot of very simple building blocks working together. The point of using LLMs to code is not to solve complex problems. It's just to write code you could have written yourself at the speed of light using a natural language interface.

The way you described using LLMs to code seems like the approach someone who doesn't know how to build software might take, which is why I used the wording I did. From that angle, I agree with you - I can't even get Sonnet to create a working prototype of a basic game from a prompt. That said, I'm using it to build a far more complex enterprise web app step by step by using it in the way I mentioned above. It does work for these things, but you have to already know how to do what the LLM is doing.


I mentioned the pong example because that is what non-coders LLM users show and what the industry is proposing as the future of software development: no coding experience necessary.

> It does work for these things, but you have to already know how to do what the LLM is doing.

Yes, we totally agree. But even then, using models "correctly" in my experience and breaking down the problems for them gets you so far, once you start using weird/niche APIs (probably even your own APIs when your project gets big enough and you are not working with much boilerplate anymore) the LLM will start getting single concepts wrong.

And don't get me wrong, I understand those as limitations of a tech that still is immensely useful in the correct hands. My only issue with that is how these products are actually being marketed: as junior devs copilots or even replacements.


As a coder with some noncoder friends who have made some very impressive things with chatGPT, you're selling it short.

It does both. It gives coders superpowers, and gives noncoders the ability to do things that would have previously taken them months, or another person.


Do you mind sharing what they've created with it?


They created a touchscreen GUI in tkinter with more-than-trivial behavior to use as a frontend for input for a device they created. They were able to describe what they wanted, and in less than two hours have it working. This is someone with no software experience.

Three years ago, if I had been asked to create something like that, it would have taken me more than two hours, just because I've never used tkinter and would have to spend time reading the docs and figuring out how to make the different input boxes and laying them out properly.

I looked at the code, and no, it's not great. It's not designed "well" and isn't very extensible. But it works for him, doesn't need to be extended, and all in half a morning.


Not even close. I’m a programmer but also a guitarist. I love asking it to tab out songs for me or asking it how many bars are in the intro of a song. It convincingly gives an answer that is always way off the mark.


> Now someone has to have a new idea. There's plenty of money available if someone has one.

I honestly do claim to have some ideas where I see evidence that they might work (and I do attempt to work privately on a prototype if only out of curiosity and to see whether I am right). The bad news: these ideas very likely won't be helpful for these LLM companies because they are not useful for their agenda, and follow a very different approach.

So no money for me. :-(

Let me put it this way:

Have you ever talked to a person whose intelligence is miles above yours? It can easily become very exhausting. Thus an "insanely intelligent" AI would not be of much use for most people - it would think "too different" from such people.

There do exist tasks in commerce for which an insane amount of intelligence would make a huge difference (in the sense of being positive regarding some important KPIs), but these are rare. I can imagine some applications of such (fictional) "super-intelligent" AIs in finance and companies doing some bleeding-edge scientific research - but these are niche applications (though potentially very lucrative ones).

If OpenAI, Anthropic & Co were really attempting to develop some "super-smart" AI, they were working on such very lucrative niche applications where an insane amount of intelligence would make a huge difference, and where you can assume and train the AI operator to have a "Fields-medal level" intelligence.


> So LLMs finally hit the wall. For a long time, more data, bigger models, and more compute to drive them worked

We can't say whether there is a wall, since we don't have anymore data to train on.


I’m wondering whether O3 can be used to explore its own improvement or optimization ideas, or if it hasn’t reached that point yet.


the new idea is the o series and clearly OpenAI’s main focus now. It’s advancing much faster than the GPT series


Seriously? All they do is produce a “confidence metric”


But how do they do that?


To output "don't know" a system needs to "know" too. Random token generator can't know. It can guess better and better, maybe it can even guess 99.99% of time, but it can't know, it can't decide or reason (not even o1 can "reason").




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: