Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What we can reasonably assume from statements made by insiders:

They want a 10x improvement from scaling and a 10x improvement from data and algorithmic changes

The sources of public data are essentially tapped

Algorithmic changes will be an unknown to us until they release, but from published research this remains a steady source of improvement

Scaling seems to stall if data is limited

So with all of that taken together, the logical step is to figure out how to turn compute into better data to train on. Enter strawberry / o1, and now o3

They can throw money, time, and compute at thinking about and then generating better training data. If the belief is that N billion new tokens of high quality training data will unlock the leap in capabilities they’re looking for, then it makes sense to delay the training until that dataset is ready

With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.

At this point I would guess we get 4.5 with a subset of this - some scale improvement, the algorithmic pickups since 4 was trained, and a cleaned and improved core data set but without risking leakage of the superior dataset

When 5 launches, we get to see what a fully scaled version looks like with training data that outstrips average humans in almost every problem space

Then the next o-model gets to start with that as a base and reason? Its likely to be remarkable



Great improvements and all, but they are still no closer (as of 4o regular) to having a system that can be responsible for work. In math problems, it forgets which variable represents what, in coding questions it invents library fns.

I was watching a YouTube interview with a "trading floor insider". They said they were really being paid for holding risk. The bank has a position in a market, and it's their ass on the line if it tanks.

ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces. If they don't solve that (and the problem is probably inherent to the architecture), they are, in some sense, polishing a turd.


> They said they were really being paid for holding risk.

I think that's a really interesting insight that has application to using 'AI' in jobs across the board.


This is underdiscussed. I don't think people understand just how worthless AI is in a ton of fields until it is able to be held liable and be sent to prison.

There are a lot of moral conundrums that are just not going to work out with this. Seems like an attempt to just offload liability and it seems like pretty much everybody has caught onto that as being it's main selling point and probably main thing that will keep it from ever being accepted for anything important.


> ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces.

What does it even mean? How do you imagine that? You want OpenAI to take on liability for the kicks of it?


If an LLM can't be left to do mowing by itself, but a human will have to closely monitor and intervene at every its steps, then it's just a super fast predictive keyboard, no?


But what if the human only has to intervene once every 100 hours, that’s a huge productivity boost.


The point is you don't know when of those 100 hours that is, so you still need to monitor the full 100 hour time span.

Can still be a boost. But definitely not the same magnitude.


And one might also wonder still if we need a general language model to mow the grass or just a simpler solution towards to problem of driving a mower over a fixed property line automatically. Something you could probably solve with wwii era technology, honestly.


Obviously not. I want legislation which imposes liability on OpenAI and similar companies if they actively market their products for use in safety-critical fields and their product doesn’t perform as advertised.

If a system is providing incorrect medical diagnoses, or denying services to protected classes due to biases in the training in the training data, someone should be held accountable.


Personal responsibility, not legal liability. In the way a child can be responsible for a pet.

Chatgpt was trained on benchmarks and user opinions - "throwing **** at the wall to see what sticks".

Responsibility means penalties for making mistakes, and, more importantly, having an awareness of those penalties (that informs its decision-making).


They would want to, if they thought they could, because doing so would unblock a ton of valuable use cases. A tax preparation or financial advisor AI would do huge numbers for any company able to promise that its advice can be trusted.


"With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field."

I highly doubt that. o3 is many orders of magnitude more expensive than paying subject matter experts to create new data. It just doesn't make sense to pay six figures in compute to get o3 to make data a human could make for a few hundred dollars.


Yes, I think they had to push this reveal forward because their investors were getting antsy with the lack of visible progress to justify continuing rising valuations. There is no other reason a confident company making continuous rapid progress would feel the need to reveal a product that 99% of companies worldwide couldn't use at the time of the reveal.

That being said, if OpenAI is burning cash at lightspeed and doesn't have to publicly reveal the revenue they receive from certain government entities, it wouldn't come as a surprise if they let the government play with it early on in exchange for some much needed cash to set on fire.

EDIT: The fact that multiple sites seem to be publishing GPT-5 stories similar to this one leads one to conclude that the o3 benchmark story was meant to counter the negativity from this and other similar articles that are just coming out.


Can SMEs deliver that data in a meaningful amount of time? Training data now is worth significantly more than data a year from now.


>churning out new thinking at expert level across every field

I suspect this is really, "churning out text that impresses management".


Seems to me o3 prices would be what the consumer pays, not what OpenAI pays. That would mean o3 could be more efficient in-house than paying subject-matter experts.


For every consumer there will be a period where they need both the SME and the o3 model for initial calibration and eventual handoff for actually getting those efficiencies in whichever processes they want to automate.

In other words if you are diligent enough, you should at least validate your o3 solution with an actual expert for some time. You wouldn't just blindly trust OpenAI your business critical processes, would you? I would expect at least 3 month - 6 months for large corps and even more considering change management, re-upskilling, etc.

With all those considerations I really don't see the value prop at those prices and in those situations right now. Maybe if costs decrease ~1-3 orders of magnitude more for o3-low, depending on the the processes being automated.


What is open ai margin on that product?


That’s an interesting idea. What if OpenAI funded medical research initiatives in exchange for exclusive training rights on the research.


It would be orders of magnitude cheaper to outsource to humans.


Not as sexy to investors though


Wait didn't they just recently request researchers to pair up with them in exchange for the data?


Someone needs to dress up Mechanical Turk and repackage it as an AI company…..


That’s basically every AI company that existed before GPT3


Unless the quality of the human data are extraordinary, it seems according to the TFA that it's not that easy:

> The process is painfully slow. GPT-4 was trained on an estimated 13 trillion tokens. A thousand people writing 5,000 words a day would take months to produce a billion tokens.

And if the human-generated data was so qualitatively good that it is smaller by three order of magnitudes, than I can assume it would be at least as expensive as o3.


Only a matter of time. The costs are aggressively going down. And with specialized inference hardware it will go further down.

Cost of coordination is also large. Immediate answers are an advantage/selling point.


> OpenAI’s next moat

I don't think oai has any moat at all. If you look around, QwQ from Alibaba is already pushing o1-preview performances. I think oai is only ahead by 3~6 months at most.


If their AGI dreams would come true it might be more than enough to have 3 months head start. They probably won't, but it's interesting to ponder what the next few hours, days, weeks would be for someone that would wield AGI.

Like let's say you have a few datacenters of compute at your disposal and the ability to instantiate millions of AGI agents - what do you have them do?

I wonder if the USA already has a secret program for this under national defense. But it is interesting that once you do control an actual AGI you'd want to speed-run a bunch of things. In opposition to that, how do you detect an adversary already has / is using it and what to do in that case.


How many important problems are there where a 3 month head start on the data side is enough to win permanently and retain your advantage in the long run?

I'm struggling to think of a scenario where "I have AGI in January and everyone else has it in April" is life-changing. It's a win, for sure, and it's an advantage, but success in business requires sustainable growth and manageable costs.

If (random example) the bargain OpenAI strikes is "we spend every cent of our available capital to get AGI 3 months before the other guys do" they've now tapped all the resources they would need to leverage AGI and turn it into profitable, scalable businesses, while the other guys can take it slow and arrive with full pockets. I don't think their leadership is stupid enough to burn all their resources chasing AGI but it does seem like operating and training costs are an ongoing problem for them.

History is littered with first-movers who came up with something first and then failed to execute on it, only for someone else to follow up and actually turn the idea into a success. I don't see any reason to assume that the "first AGI" is going to be the only successful AGI on the market, or even a success at all. Even if you've developed an AGI that can change the world you need to keep it running so it can do that.

Consider it this way: Sam Altman & his ilk have been talking up how dangerous OpenAI's technology is. Are risk-averse businessmen and politicians going to be lining up to put their livelihood or even their lives in the hands of "dangerous technology"? Or are they going to wait 3-6 months and adopt the "safe" AGI from somebody else instead?


Well that's the thought exercise. Is there something you can do with almost unlimited "brains" of roughly human capability but much faster, within a few days / weeks / months. Lets say you can instantiate 1 million agents, for 3 months, and each of them is roughly 100x faster than a human, that means you have the equivalent of 100 million human-brain-hours to dump into whatever you want, as long as your plans don't require building too many real world things that actually require moving atoms around, I think you could do some interesting things. You could potentially dump a few million hours into "better than AGI AI" to start off for example, then go to other things. If they are good enough you might be able to find enough zero-days to disable any adversary through software, among other interesting things.


Where does "almost unlimited" come into the picture though? I see people talking like AGI will be unlimited when it will be limited by available compute resources, and like I suggested, being 'first' might come at the cost of the war chest you'd need to access those resources.

What does it take to instantiate 1 million agents? Who has that kind of money and hardware? Would they still have it if they burn everything in the tank to be first?


> Where does "almost unlimited" come into the picture though

>> Like let's say you have a few datacenters of compute at your disposal and the ability to instantiate millions of AGI agents - what do you have them do?

> has that kind of money and hardware?

Any hyperscaler plus most geopolitical main players. So the ones who matter.


Once you have AGI you use it to collect resources to cripple competitors and to build a snowball effect to make yourself unbeatable. 3 months of AGI is enough in the right hands to dominate the world economically.


Only if the AGI is cheaper than a human, in the case the AGI is more expensive than a human there wont be any snowballing. And the most likely case is that the first AGI is more expensive to run than a human, a few months of having overly expensive human level AI bots wont disrupt the world at all.


That is why being #2 in technical product development can be great. Someone else pays to work out the kinks, copy what works and improve on it at a fraction of the cost. You see it time and time again.


I’m curious how, if at all, the plan to get around compounding bias in synthetic data generated by models trained in synthetic data.


Everyone's obsessed with new training tokens... It doesn't need to be more knowledgeable, it just needs to practice more. Ask any student: practice is synthetic data.


That leads to overfitting in ML land, which hurts overall performance.

We know that unique data improves performance.

These LLM systems are not students…

Also, which students graduate and are immediately experts in their fields? Almost none.

It takes years of practice in unique, often one-off, situations after graduation for most people to develop the intuition needed for a given field.


It's overfitting when you train too large a model on too many details. Rote memorization isn't rewarding.

The more concepts the model manages to grok, the more nonlinear its capabilities will be: we don't have a data problem, we have an educational one.

Claude 3.5 was safety trained by Claude 3.0, and it's more coherent for it. https://www.anthropic.com/news/claudes-constitution


Overfitting can be caused by a lot of different things. Having an over abundance of one kind of data in a training set is one of those causes.

It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.

> The more concepts the model manages to grok, the more nonlinear its capabilities will be

These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.

So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.

That leads to overfitting.


These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.

So, here's my hypothesis, as someone who is adjacent ML but haven't trained DNNs directly:

We don't understand how they work, because we didn't build them. They built themselves.

At face value this can be seen as an almost spiritual position, but I am not a religious person and I don't think there's any magic involved. Unlike traditional models, the behavior of DNNs is based on random changes that failed up. We can reason about their structure, but only loosely about their functionality. When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers. Given this, there will not be a direct correlation between inputs and capabilities, but some arrangements do work better than others.

If this is the case, high order capabilities should continue to increase with training cycles, as long as they are performed in ways that don't interfere with what has been successfully learned. People lamented the loss of capability that GPT 4 suffered as they increased safety. I think Anthropic has avoided this by choosing a less damaging way to tune a well performing model.

I think these ideas are supported by Wolfram's reduction of the problem at https://writings.stephenwolfram.com/2024/08/whats-really-goi...


Your whole argument falls apart at

> We don't understand how they work, because we didn't build them. They built themselves.

We do understand how they work, we did build them. The mathematical foundation of these models are sound. The statistics behind them are well understood.

What we don’t exactly know is which parameters correspond to what results as it’s different across models.

We work backwards to see which parts of the network seem to relate to what outcomes.

> When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers.

Isn’t this the exact opposite of reality?

They get better at drawing because we improve their datasets, topologies, and their training methods and in doing so, teach them to draw.

They get better at reasoning because the engineers and data scientists building training sets do get better at philosophy.

They study what reasoning is and apply those learnings to the datasets and training methods.

That’s how CoT came about early on.


> We do understand how they work, we did build them. The mathematical foundation of these models are sound. The statistics behind them are well understood.

We don't understand how they work in the sense that we can't extract the algorithms they're using to accomplish the interesting/valuable "intellectual" labor they're doing. i.e. we cannot take GPT-4 and write human-legible code that faithfully represents the "heavy lifting" GPT-4 does when it writes code (or pick any other task you might ask it to do).

That inability makes it difficult to reliably predict when they'll fail, how to improve them in specific ways, etc.

The only way in which we "understand" them is that we understand the training process which created them (and even that's limited to reproducible open-source models), which is about as accurate as saying that we "understand" human cognition because we know about evolution. In reality, we understand very little about human cognition, certainly not enough to reliably reproduce it in silico or intervene on it without a bunch of very expensive (and failure-prone) trial-and-error.


> We don't understand how they work in the sense that we can't extract the algorithms they're using to accomplish the interesting/valuable "intellectual" labor they're doing. i.e. we cannot take GPT-4 and write human-legible code that faithfully represents the "heavy lifting" GPT-4 does when it writes code (or pick any other task you might ask it to do).

I think English is being a little clumsy here. At least I’m finding it hard to express what we do and don’t know.

We know why these models work. We know precisely how, physically, they come to their conclusions (it’s just processor instructions as with all software)

We don’t know precisely how to describe what they do in a formalized general way.

That is still very different from say an organic brain, where we barely even know how it works, physically.

My opinions:

I don’t think they are doing much mental “labor.” My intuition likens them to search.

They seem to excel at retrieving information encoded in their weights through training and in the context.

They are not good at generalizing.

They also, obviously, are able to accurately predict tokens such that the resulting text is very readable.

Larger models have a larger pool of information and that information is in a higher resolution, so to speak, since the larger better preforming models have more parameters.

I think much of this talk of “consciousness” or “AGI” is very much a product of human imagination, personification bias, and marketing.


>We know why these models work. We know precisely how, physically, they come to their conclusions (it’s just processor instructions as with all software)

I don't know why you would classify this as knowing much of anything. Processor instructions ? Really?

If the average user is given unfettered access to the entire source code of his/her favorite app, does he suddenly understand it ? That seems like a ridiculous assertion.

In reality, it's even worse. We can't pinpoint what weights, how and in what ways and instances are contributing exactly to basic things like whether a word should be preceded by 'the' or 'a' and it only gets more intractable as models get bigger and bigger.

Sure, you could probably say we understand these NNs better than brains but it's not by much at all.


> If the average user is given unfettered access to the entire source code of his/her favorite app, does he suddenly understand it ? That seems like a ridiculous assertion.

And one that I didn’t make.

I don’t think when we say “we understand” we’re talking about your average Joe.

I mean “we” as in all of human knowledge.

> We can't pinpoint what weights, how and in what ways and instances are contributing exactly to basic things like whether a word should be preceded by 'the' or 'a' and it only gets more intractable as models get bigger and bigger.

There is research coming out on this subject. I read a paper recently about how llama’s weights seemed to be grouped by concept like “president” or “actors.”

But just the fact that we know that information encoded in weights affects outcomes and we know the underlying mechanisms involved in the creation of those weights and the execution of the model shows that we know much more about how they work than an organic brain.

The whole organic brain thing is kind of a tangent anyway.

My point is that it’s not correct to say that we don’t know how these systems work. We do. It’s not voodoo.

We just don’t have a high level understanding of the form in which information is encoded in the weights of any given model.


> If the average user is given unfettered access to the entire source code of his/her favorite app, does he suddenly understand it ? That seems like a ridiculous assertion. And one that I didn’t make. I don’t think when we say “we understand” we’re talking about your average Joe. I mean “we” as in all of human knowledge.

It's an analogy. In understanding weights, even the best researchers are basically like the untrained average joe with source code.

>There is research coming out on this subject. I read a paper recently about how llama’s weights seemed to be grouped by concept like “president” or “actors.”

>But just the fact that we know that information encoded in weights affects outcomes and we know the underlying mechanisms involved in the creation of those weights and the execution of the model shows that we know much more about how they work than an organic brain.

I guess i just don't see how "information is encoded in the weights" is some great understanding ? It's as vague and un-actionable as you can get.

For training, the whole revolution of back-propagation and NNs in general is that we found a way to reinforce the right connections without knowing anything about how to form them or even what they actually are.

We no longer needed to understand how eyes detect objects to build an object detecting model. None of that knowledge suddenly poofed into our heads. Back-propagation is basically "reinforce whatever layers are closer to the right answer". Extremely powerful but useless for understanding.

Knowing the Transformer architecture unfortunately tells you very little about what a trained model is actually learning during training and what it has actually learnt.

"Information is encoded in a brain's neurons and this affects our actions". Literally nothing useful you can do with this information. That's why models need to be trained to fix even little issues.

If you want to say we understand models better than the brain then sure but you are severely overestimating how much that "better" is.


> It's as vague and un-actionable as you can get.

But it isn’t. Knowing that information is encoded in the weights gives us a route to deduce what a given model is doing.

And we are. Research is being done there.

> "Information is encoded in a brain's neurons and this affects our actions". Literally nothing useful you can do with this.

Different entirely. We don’t even know how to conceptualize how data is stored in the brain at all.

With a machine, we know everything. The data is stored in a binary format which represents a decimal number.

We also know what information should be present.

We can and are using this knowledge to reverse engineer what a given model is doing.

That is not something we can do with a brain because we don’t know how a brain works. The best we can do is see that there’s more blood flow in one area during certain tasks.

With these statistical models, we can carve out entire chunks of their weights and see what happens (interestingly not much. Apparently most weights don’t contribute significantly towards any token and can be ignored with little performance loss)

We can do that with these transformers models because we do know how they work.

Just because we don’t understand every aspect of every single model doesn’t mean we don’t know how they work.

I think we’re starting to run in circles and maybe splitting hairs over what “know how something works” means.

I don’t think we’re going to get much more constructive than this.

I highly recommend looking into LoRas. We can make Loras because we know how these models work.

We can’t do that for organic brains.


The thing that you are handwaving away as just "which parameters correspond to what results" is precisely the important, the inexorable thing which defines the phenomena, and it is exactly the thing which we don't have access to, and which we did not and could not design, plan or engineer, but which emerged


> which we did not and could not design, plan or engineer, but which emerged

We literally designed, planned, and engineered the environment and mechanisms which created those weights.

It’s just code. We can train models by hand too, it’d just take a lot longer.

It’s literally something we made, just from a higher order place.

To understand which exact weights correspond to what output will vary from model to model. There is research going into this subject for llama.

it’s not like we’re in the dark as to the principles that allow LLMs to make predictions.

My whole point is that to say “we don’t know how AI works” is just not true


Please, read the Wolfram blog


I gave it a fair skim, but I didn’t really feel like it refuted what I said.

Is there a specific section that comes to mind?


Other than we don't tell it how to get the right answer, or understand how it eventually computes correct answers?


I don’t really think you’re understanding my argument…


And who will tell the model whether its practice results are correct or not? Students practice against external evaluators, it’s not a self-contained system.


synthetic data is fine if you can ground the model somehow. that's why the o1/o3's improvements are mostly in reasoning, maths, etc., because you can easily tell if the data is wrong or not.


That makes a lot of sense.

Binary success criteria has very little room for bias.


> With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.

Even taking OpenAI and the benchmark authors at their word they said that it is consuming at least tens of dollars per task to hit peak performance, how much would it cost to have it produce a meaningfully large training set?


That's the public API price isn't it?


There is no public API for o3 yet, those are the numbers they revealed in the ARC-AGI announcement. Even if they were public API prices we can't assume they're making a profit on those for as long as they're billions in the red overall every year, its entirely possible that the public API prices are less than what OpenAI is actually paying.


I completely don't understand the use for synthetic data. What good it's it to train a model basically on itself?


The value of synthetic data relies on having non-zero signal about which generated data is "better" or "worse". In a sense, this what reinforcement learning is about. Ie, generate some data, have that data scored by some evaluator, and then feed the data back into the model with higher weight on the better stuff and lower weight on the worse stuff.

The basic loop is: (i) generate synthetic data, (ii) rate synthetic data, (iii) update model to put more probability on better data and less probability on worse data, then go back to (i).


But who rates the synthetic data? If it is humans, I can understand that this is another way to get human knowledge into it, but if it's rated by AI, isn't it just a convoluted way of copying the rating AI's knowledge?


Many things are more easily scored than produced. Like it's trivial to tell whether a poem rhymes, but writing one is a comparatively slow and difficult task. So hopefully since scoring is easier/more-discerning than generating, the idea is you can generate stuff, classify it as good or bad, and then retrain on the good stuff. It's kindof an article of faith for a lot of AI companies/professionals as well, since it prevents you from having to face a data wall, and is analogous to a human student practicing and learning in an appealing way.

As far as I know it doesn't work very well so far. It is prone to overfitting, where it ranks highly some trivial detail of the output eg "if a summary starts with a byline of the author its a sign of quality" and then starts looping on itself over and over, increasing the frequency and size of bylines until it's totally crommed off to infinity and just repeating a short phrase endlessly. Humans have good baselines and common sense that these ML systems lack, if you've ever seen one of those "deep dream" images it's the same kind of idea. The "most possible dog" image can be looks almost nothing like a dog in the same way that the "most possible poem" may look nothing like a poem.


This is the bit I've never understood about training AI on its own output; won't you just regress to the mean?


It's not trained on its own output. You can generate infinite correctly worked out math traces and train on those.


Thanks, that makes a lot more sense.


This is a good read for some examples https://arxiv.org/abs/2203.14465

> This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers

But there are a few others. In general good data is good data. We're definitely learning more about how to produce good synthetic version.


One issue with that is that the model may learn to smuggle data. You as a human think that the plain reading of the words is what is doing the reasoning, but (part of) the processing is done by the exact comma placement and synonym choice etc.

Data smuggling is a known phenomenon in similar tasks.


I don't think data smuggling is relevant in star style scenarios. You're still validating the final output. If it works on test data, what could be even smuggled.


> What good it's it to train a model basically on itself?

If the model generates data of variable quality, and if there's a good way to distinguish good data from bad data, then training on self-generated data might "bootstrap" a model to better performance.

This is common in reinforcement learning. Famously, AlphaGo Zero (https://en.wikipedia.org/wiki/AlphaGo_Zero) learned exclusively on self-play, without reference to human-played games.

Of course, games have a built-in critic: the better strategy usually wins. It's much harder to judge the answer to a math problem, or decide which essay is more persuasive, or evaluate restaurant recommendations.


If we get to a point where we have a model that when fed a real world stream of data (YouTube, surveillance cameras, forum data, cell phone conversations etc.) and can prune out a good training set for itself then you’re at the point where the LLM is in a feedback loop where it can improve itself. That’s AGI for all intents and purposes.


There is an enormous "iceberg" of untapped non-public data locked behind paywalls or licensing agreements. The next frontier will be spending money and human effort to get access to that data, then transform it into something useful for training.


ah yes the beautiful iceberg of internal documentation, legal paperwork, and meeting notes.

the highest quality language data that exists is in the public domain




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: