Hacker News new | past | comments | ask | show | jobs | submit login
Will we run out of ML data? Evidence from projecting dataset size trends (2022) (epochai.org)
66 points by kurhan on April 23, 2023 | hide | past | favorite | 121 comments



The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.


> If you point Whisper at YouTube, that's a whole new source of all sorts of text.

a lot of YT videos already has autogenerated english subtitles, which is actually available as a vtt download, so don't even need to use Whisper on a video to obtain it!


But how much more data is required to make a big difference? Is doubling the dataset considered a dramatic improvement? Or is increasing the dataset by 10x needed?


Also quality is likely important will the models get better if we train them on YouTube comments.


Brute force approaches always hit some wall. ML will be no different. In the decades to come it us quite likely that algorithms will develop in directions orthogonal to current approaches. The idea that you improve performance by throwing gazillions of data into gargantuan models might be even come to be seen as laughable.

Keep in mind (pun) that the only real intelligence here is us, and we are pretty good at figuring out when a tool has exhausted its utility.


We won't hit the wall.

Somewhat counterintuitively, scaling datasets is the lazy and economical approach. If you have the compute already, might as well dig an OOM more text tokens.

But there are other sources of data, and slightly different ways to utilize it. Multimodality, in very large training runs, will almost inevitably increase sample efficiency (for obvious reasons of context richness), synthetic data is already very effective [1], and there are and will be discovered other ways to do more in the condition of diminishing raw text resources. But a thorough abandonment of the scaling strategy is very unlikely.

Sutton's Bitter Lesson [2] points at a very powerful rule of thumb: we shouldn't turn AI engineering into a contest of smartness, we should allow complex smartness to emerge from generic low-level algorithms. What will be seen as laughable in decades to come is not the scaling strategy, but the Godlike conceit of people who thought they can devise generally applicable rules of reasoning from first principles.

1: https://arxiv.org/abs/2304.08466 2: http://www.incompleteideas.net/IncIdeas/BitterLesson.html


I don't think your "synthetic data on ImageNet" reference shows "synthetic data is already very effective". Since many people won't read the paper, here's what it says:

Training ResNet-50 on real ImageNet gives 73.09% top-1 accuracy, while training it on synthetic data (same resolution, same number of images) generated by this work gives 64.96%, which is SOTA compared to previous work's 63.02%. Therefore, synthetic data is worse than real data for now.

But synthetic data is not useless, because training on real data plus synthetic data is a bit better than both real data and synthetic data. (Accuracy here is different due to different methodology.) Using 1:1 real data and synthetic data improves accuracy from 76.39% to 77.61%. But using 1:2 is worse than 1:1 (77.16%), even if dataset became 50% larger. With 1:4, result is worse than not using synthetic data at all. So synthetic data at best can enlarge dataset by 5x, more likely just 2x.


I wonder how much you can improve that scaling factor by using data augmentation techniques (noise, rescaling, recropping, rotation, changing colors, using normal maps, etc).


You are masquerading personal preferences (and possibly professional interests) as rules of nature. If anything, Godlike conceit definetely applies to some ML accolytes.

In any case, with your last point "we should allow complex smartness to emerge" you essentially agree with my point that new levels will emerge from orthogonal (new) directions.

The good thing about brute force is that it summons so many resources it primes the way for smarter approaches.

For those not conceited the objective is not some deus-ex-machina but "algorithms that work".


It is interesting that you don't even hide having strong personal emotional preference at stake. Now, does this not suggest that your predictions are a priori less credible, by your logic?

No, I don't think "orthogonal" directions will be fruitful.

I also disagree on evaluations. What you call brute search is not brute search at all, nor a deux ex machina, it is a lawful and honest method of algorithmic discovery of true regularities. "Smarter approaches", meanwhile, usually amount to stilted expressions of narcissism of researchers overly proud with having come up with shallow tricks aping some aspect of explicit human reasoning. They're not actually smart, nor do they work far outside of the toy distribution for which they were developed.


But we can devise generally applicable rules of reasoning from first principles. It's called logic. I am pretty sure the next step is to properly combine machine learning and logic properly.


Seems unlikely, that never worked in the past. And humans don't actually use logic (especially formal logic) to come up with anything. They just use it to justify what they came up with.

Not even mathematicians think in terms of logic when trying to solve problems.


Of course mathematicians also think in terms of logic. It’s what you learn when you study mathematics, you soak it up automatically, although few study logic explicitly. And before 2015 a machine beating worlds best go player also seemed pretty unlikely.


I've studied mathematics.

You only do (formal) logic as an afterthought when communicating your proofs to other people or writing them down. Otherwise it's mostly intuition.


I've studied mathematics, too. Yes, formal logic is an afterthought when you do mathematics. But formal logic is just an explicit representation of what goes on internally in a mathematician. Or at least that's how I approach formal logic (most logicians don't). I would describe these internal processes inside a mathematician (and outside, when used for communication) as intuition + "logic to keep intuition in check". Sounds like ML + logic to me.


There are already tons of systems (for example Google Translate) that combine rule-based reasoning with probabilistic reasoning. Looks to be working to me.


Interesting. Do you have any sources on Google Translate using rule-based reasoning?


Machine Translation, by Thierry Poibeau, 2017.


Alas, that was around the time Google Translate switched to Neural Networks:

See https://blog.google/products/translate/found-translation-mor... and https://en.wikipedia.org/wiki/Google_Neural_Machine_Translat...

It doesn't look like they are still using any rule-based reasoning?

The blog post says:

> With this update, Google Translate is improving more in a single leap than we’ve seen in the last ten years combined. [...]

Which seems pretty strong evidence to me that moving away from rule-based reasoning or even a hybrid approach that includes rule-based reasoning, was a clear win?


It's not "strong evidence" but a strong claim.


First principles don't work in the space of systems geared towards extreme generalization such as LLMs. You need to be ready to compare anything with anything and build bridges between many principles. In fact there is a deep link between the progress of structuralism in mathematics culminating with homotopy type theory and its parallel (r)evolution in the humanities with the discovery of manuscripts by the founder of structural linguistics, Ferdinand de Saussure.

Identity is what provides the irreducible basis, in the sense that we cannot enter into the consideration of specific facts that are placed under this identity, and it is this identity that becomes for us the true concrete fact, beyond which there is nothing more.

...

For example, for a musical composition, compared to a painting. Where does a musical composition exist? It is the same question as to know where 'aka' exists. In reality, this composition only exists when it is performed; but to consider this performance as its existence is false. Its existence is the identity of the performances.

...

For each of the things we have considered as a truth, we have arrived through so many different paths that we confess we do not know which one should be preferred. To properly present the entirety of our propositions, it would be necessary to adopt a fixed and defined starting point. But what we are trying to establish is that it is false to admit in linguistics a single fact as defined in itself. There is, therefore, a necessary absence of any starting point, and if some reader is willing to follow our thoughts carefully from one end to the other of this volume, they will recognize, we are convinced, that it was, so to speak, impossible to follow a very rigorous order. We will allow ourselves to present, up to three or four times in different forms, the same idea to the reader because there really is no starting point more appropriate than another on which to base the demonstration.

...

As language offers no substance under any of its manifestations, but only combined or isolated actions of physiological, physical, and mental forces, and as nevertheless all our distinctions, our terminology, and all our ways of speaking are based on this involuntary assumption of a substance, we cannot refuse, first and foremost, to recognize that the most essential task of the theory of language will be to untangle what our primary distinctions are all about.

...

There are different types of identity. This is what creates different orders of linguistic facts. Outside of any identity relationship, a linguistic fact does not exist. However, the identity relationship depends on a variable point of view that one decides to adopt; therefore, there is no rudiment of a linguistic fact outside the defined point of view that presides over distinctions.

Source: http://www.revue-texto.net/docannexe/file/116/saussure255_6....

TL;DR: identity is equivalent to equivalence


There is no reason why logic cannot follow various different threads of reasoning, interweave them, merge them, split them again, etc. Logic constitutes a first principle of utmost generality, actually I cannot imagine anything more general. Identity is not equivalent to equivalence, equivalence is a quotient of identity, consisting of two classes: Those values which are identical to True, and those which are not.


> Identity is not equivalent to equivalence

When talking about identity/equivalence of types in the context of homotopy type theory, yes. This is literally what the univalence axiom states.

Auggierose, I'm curious about your thoughts on how we can provide more rigor to LLMs when it comes to large-scale program transformations and proof synthesis. Given the complexity and versatility of these systems, what kind of foundational framework do you believe would enable GPT and similar models to synthesize and execute proofs rigorously? How can we ensure that they are both reliable and adaptable while dealing with various mathematical and logical domains?

More importantly, how whould this relate to NLP tasks such as: alright, the story is good, but can you rewrite it in the style of Auggierose ?


I am not a fan of HOTT, as nobody managed to explain to me its supposed advantages in terms that didn't border on mysticism.

Anyway, your question is very interesting! :-)


AI had a winter of many decades because the hardware wasn't there and there were better alternatives, especially for neural nets. Now ChatGPT etc comes out, with unbelievable results, decades in the making. And a couple months we're already writing it off because of the next limitation? Maybe let's give it more than a month or two to figure out if we even need all that data. I heard they're already talking about trying to significantly reduce the model hyper parameters size even though a large model size increase apparently the reason GPT 4 was so much better than 3. Give it a minute IMHO before making generalizations like this so soon


Well I imagine the commenter actually understands the domain, the techniques, and is making an informed opinion.

It is possible to form opinions by knowing the domain, rather than drawing an exponential curve of newspaper headlines which trails off "..."


On this note, the data-set available if you start collecting today is tainted with experimental AI content. Not the biggest issue right now but as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.


I see this take a lot and I think it's quite wrong, not fully, but at least missing a couple big points.

I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.

And aside from spam, we're not just blindly posting AI content either. We're putting in meaningful prompts, rejecting answers we don't like, and editing answers we do.


I think "high quality, well-prepared sources" would include blogs and articles, which are likely to become heavily influenced by AI (blogs and articles are high quality compared to Reddit posts for example, which were included in the past).

In fact, there's no reason to think that academic papers won't start using language models to write better.

Tainting your text with AI can be as simple as pasting a paragraph in and asking if there's anything to improve.


I honestly don’t understand why “tainting” is such a big deal. Can someone explain it to me?

I see two possible reasons, but neither seems to be worth the purity concern. The first is that AI can be wrong, make stuff up, be confidently incorrect. Anyone who has been on the internet knows this isn’t exactly a game changer.

Second is that we won’t be training AI to be like humans, but like humans + AI. Also doesn’t seem like a big deal. We’re already humans + writing + computers + internet and so on. This cutoff matters for anthropology, but I don’t see how it matters for trying to make a bot that can do my taxes.


I think the best explanation is to look at Google. Google's basic algorithm was that it could look how people organically interacted on the web and use that as a heuristic for quality - if lots of are linking to you, you're probably high quality and you'll appear at the top of google. But that started to break down, (a) because people were gaming that metric for "SEO" and (b) the internet centralized so the organic interactions started to disappear, and (c) because people stopped clicking through links from different sites - why do that when you can just google what you want! Google basically broke this metric by using it.

In the same way, AI is trying to generate text that looks like its training data, but if its training data is AI generated text then it's simply being taught to be more like itself. It slowly starts to work less like a human and more like whatever its own idiosyncrasies are. It's a larger sort of version of the hallucinations it has today. If 50% of all the text on the internet becomes some part AI generated, then a huge part of the training for the next generation of AI will be the shortcomings of the current iteration of AI. And this will get worse as non-AI content moves to exclude itself from training.


> Second is that we won’t be training AI to be like humans, but like humans + AI.

LLMs weren't training AI to be like humans. They were training AI to be able to predict what humans (and other sources of common crawl data) will write next in their texts. This might seem like a small difference but it's not. Consider for example someone whose career is to research ant behavior. Their job in some sense is to be able to predict what an ant will do. Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?


> Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?

If they act out these predictions and are rewarded based on their accuracy, then yes. They're being trained to be like ants. Not entirely like ants in every way, but like them in specific ways.

There's a big difference with your analogy. Predicting tokens is essentially the same as generating tokens. There's no meaningful objective difference between the activities (I'm ignoring philosophy and focusing on observables). They both lead to a stream of tokens.

For contrast, consider any sport, maybe baseball. I could predict the winner of a game but not be able to win it myself. I could predict the next pitch but not be able throw it or hit it. There's an execution aspect you can fail at. Being like an ant would also have this aspect. Token prediction doesn't have this, or if it does (maybe turning a vector into an API response?) it's a trivial part of the whole problem.

Maybe I'd be more clear to say "write like humans" instead of "be like humans", though.


Um, really? You think your average 'growth hacker' who is using ChatGPT to exponentially increase the amount of SEO junk they can churn out is checking each answer before they press publish?

Purity, accuracy and relevance of data collected from the internet is going to a very hard problem.


It always has been, the internet is full of garbage, there are ways of finding the data that is useful to humans, like upvotes


YouTube has tons of data that's mostly untainted by SEO spam.


I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.

If by that you mean Common Crawl, Wikipedia etc, that's hardly "high quality, well prepared", and very subject to the biases and flaws of the creators who will vary widely in expertise, intelligence and ability.


> as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.

If we build a system where we feed the exhaust of an AI to another one at each step, should we call it the AI Centipede, like in the movies? https://m.imdb.com/list/ls064583741/


100% -- I also see this as a big and emerging problem that future researchers and practitioners will have to deal with.

Posted some thoughts previously here --

https://news.ycombinator.com/item?id=32577822

https://news.ycombinator.com/item?id=33869402


Text on the Internet isn't just tainted by the output of recent relatively smart models.

We had computers spit out text (especially spam) for ages now. You'd have to filter those out, too, if tainting actually was a problem.


We just are not thinking wide enough:

* Train on all of television history, and streaming content.

* Train on YouTube.

* I suspect at some point we'll have a recording of most of people's lives, e.g. live-streaming: https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting


Exactly, put bots into the world with cameras and you have infinite training. Humans also need a ton of data to train on and have way more parameters than the biggest ML model today


You can also gather arbitrarily more video data by just turning on some webcams and pointing them at the world.

In addition you can also feed your system from video games.


Microphones, too.


Yes. When I wrote video, I meant audio and visual.


If we play our cards right, AI could free people up for more valuable pursuits, and the pace of human information production would increase by orders of magnitude


> free people up for more valuable pursuits

It won't roll like that. AI will empower people to be more productive but won't free people up because it makes mistakes, can't help itself, and cannot function autonomously. There is no LLM application that is safe for autonomous usage today. How can we go from 0 to 1? I don't see a path. Self driving cars still can't reach L5 to completely remove the need for driver.

But maybe this is a blessing in disguise. It will make AI more like a new ability of humans than of the companies. Companies need people to unlock AI efficiencies. And AI tends to become open sourced so everyone has access to the same. AI is not a moat for companies and human ability to hand-held it is tied to individuals. That would make the transition easier. Solving that last 1% accuracy might encounter exponential friction and last for a while.


Functioning autonomously is not the level needed to free up people.

If your department gets a bunch of entry-level hires or interns, that frees up people in your organization even if they make mistakes, require supervision and can't function autonomously. Similarly, if an AI system can do half of a particular job under human supervision, it can free up (or make redundant) half of the people doing that job.


Yeah, it's not that I think we'll get all the way there, it's a utopia. My expectation is that within 30 years we reduce the work week by a day or two for most people, compensate for our education system's decline, and avoid energy and food crises, and nothing else fundamentally changes


I can’t see us going from Microsoft and Open AI stealing everyone’s work and selling it without attribution or respect for GPL (for example ) to technological utopia anytime soon.


Especially given the average prognosis of technological progress north of a certain point is dystopia. It's like that thing where you ask everyone how many jellybeans there are in the jar and when you average the guesses the average guess is accurate. Except in this case on a societal level you average the guesses and they come out as "dystopia" but the underlying distribution contains plenty of wildly inaccurate guesses of "eutopia".


Assuming this happens without any violence, which I truly doubt. Huge socioeconomic changes like this always come as a result of violence and uproar.


Why? Eg in the latter half of the 20th century the US integrated women into the workforce (almost doubling the population eligible for participation in the labourforce), without violence or uproar.

There was also remarkably little uproar nor violence when the Czech Republic escaped the Iron Curtain and embraced capitalism.


Integrating women into the workforce is not even close to abolishing private capital and the need to work.

The violence would be between billionaires with infinite automation making infinite money, and common folk with no way to eat.


What you're responding to doesn't propose the abolition of obligatory work or private capital, it proposes a decrease in labor hours commensurate with, or conservative in comparison to, an expected increase in productivity


Yeah sorry about that, but do you think its realistic? I mean productivity has been going high since a long time yet we are still 5 or 6 workweek.


Now that's a reasonable debate we can have!

Gains from increases in productivity in the last hundred years[0] seem to be spread between more consumption, shorter working hours[1].

Some people expected that most gains would go towards decreased working hours instead of the spread we have actually seen. Not sure there's much significance behind that?

[0] Or any span of time you might want to pick.

[1] And bigger bureaucratic overheads, but you can count that either as a weird form of consumption or as just productivity not having increased quite as fast.


> Self driving cars still can't reach L5 to completely remove the need for driver.

This will probably be (or already has been) solved by large transformer models or their successor architectures.

What was missing was common sense reasoning about what they see. We now have that.


> the pace of human information production would increase by orders of magnitude

you mean boilerplate and spam right?


> free people

As opposed to what? Being "captive" in jobs for paying bills?


I mean... Yes.

What would you suggest as the alternative?


Erm not causing mass unemployment by stealing data? Also people go freely where there's pay. Seems like there aren't many opportunities and there will fewer.


So hopefully we play our cards right, by extracting benefits from AI that overcompensate for the negative impacts like mass unemployment and democratization of intellectual property. If the spoils are distributed in such a way that people's standard of living is maintained or improved, people have more liesure time, which the social sciences have shown will not mean people will just stop working--they'll work less, but with higher productivity on things promising a greater benefit to family, community, and society.

Forgive me if I'm misreading, but I'm having trouble with your line of reasoning. Your first reply to me scarequoting "captive" strongly implies an argument that the imperative to seek employment for survival is not a limiting factor on how people spend their time, and therefore that my suggestion that giving people more choice over how they apply their talents could be a good thing is irrelevant; but your child reply implies a concern that AI taking over some human labor will cause mass unemployment and explicitly states choice is declining.

I'm advocating that, since the genie is out of the bottle, AI could be used to free people from toil, just as other labor innovations like machinery and the 40-hour work week have done. Why the dismissive snark? In the abstract, do we not want the same thing?


Unfortunately real life doesn't work the way you describe. People won't be "free" to enjoy more "leisure" time, by "democratizing" the results of their work. Instead, all of this stolen data, will be used to "free" them from jobs and to consolidate corporate control. Say bye to microbusinesses, to freelancers, to indie developers. You know, those people that have been truly free. Similarly, office workers will be "free" to lose the jobs that they chose to perform, and will have nowhere else to go but unemployment lines. All thanks to "democratizing" ip by stealing data.


> All thanks to "democratizing" ip by stealing data.

That horse is already dead. Large models can learn everything, there's nothing that can be done to stop them from learning. It's too easy for them to do it. We can't hold any meaningful IP when models can generate 100 variations only different enough to pass the test. IP is dead. But on its corpse there will grow a new world of applications. We all got new skills, depends on us if we use them or not.


Sure. In before people used to say that currency is dead because crypto currency has replaced money already, and already people are using them, and already [insert marketing statement]. I see they now moved on to ai.


>currency is dead because crypto currency has replaced money already

Literally no one said that, you are being ridiculous


I agree! I actually wrote a blogpost about this recently[0], but the TLDR is that ownership is nothing without enforcement, and it has become increasingly difficult to enforce ownership of intellectual property in the modern world — first digital files, then the sharing of those digital files over the Internet, and now generative models that allow people to create high-quality ripoffs of any IP for zero marginal cost. The sheer volume will just be too much to contend with, because you can't sue everyone. In my view, this is a good thing and a long time coming!

[0] https://blog.kaichristensen.com/p/generative-ai-is-the-final...


I agree that not playing our cards right is the default scenario.


As a technology ai can indeed free people in a productive manner. But it would appear that we started on the wrong footing. Power will be consolidated in the hands of a few at a scale we haven't seen yet. It all depends tho on whether we can regulate how data is collected, at least at the basic level of not infringing copyright.


I wonder if the better question is not how we get more training data but:

If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture?

Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).


There is still an order of magnitude more organic text. Ilya Sutskever recently said it was still ok. After that, we got to use reinforcement learning (agent GPTs with tools) to generate and self-validate more examples.

One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts. Then create an inverted index, with each fact and its references. This will allow us to generate a wikipedia-like corpus of exhaustive fact research. We can say if a fact is known or not, we can tell if it is settled or controversial, and if it is a preference we can tell what is the distribution. This has got to help with factuality and generate lots of text to feed the model. Basically only costs electricity and GPU. It nicely side-steps the problem of truth by simply modelling the empirical distribution in an explicit way. At least the model won't hallucinate outside the known facts.


After the low hanging fruit - the high quality data such as scientific papers, libgen, stackexchange, wikipedia, etc — has been exhausted, that’s it. There’s no more data of that kind. There’s not 9 other wikipedias or 9 other libgens. There is only a certain quantity of high-quality codified knowledge in existence and models need to be able to deal with that constraint. Feeding it more and more lower quality text is not going to improve performance because we already fed it all the text that we use. There’s a reason that PhDs don’t involve reading tumblr all day.


> There’s not 9 other wikipedias

By the way, I wonder how much you could get from "history" data: wikipedia history pages, talk pages, commits diffs on github, pull request discussions, etc.

AFAIK so far we've only been using the finished code "artifacts", but if we're desperate for more tokens to train on, we might get a lot of mileage from just "all different versions of this dataset over time".


There's a reason there are so many review papers - which are just synthesis of a topic in a certain period of time. Second order analysis is useful content, not junk. It can cross reference facts and detect inconsistencies. Combining multiple sources can lead to new insights and learning the trends.


>After that, we got to use reinforcement learning (agent GPTs with tools) to generate and self-validate more examples.

How would you "self-validate" against hallucinated facts?

What makes self-validation possible are hard external rules that can be evaluated independently and automatically. Like the rules of Chess or Go.

We don't have anything like that for LLMs and what people want to use them for.


RLHF seems to suggest that human feedback to tune the model after plain textual data pretraining is quite potent per sample. There might be some optimal ratio of data+model size:rlhf size that works quite favorably for us in getting hallucinations to a minimum. Furthermore there might be some “there” there, in the hallucinations, that has yet to be identified as valuable in itself. Either way it seems like our ability to wrangle these models is getting better


> There is still an order of magnitude more organic text.

Posing this as a thought experiment, agree we still have more data to go. That we are wondering about this suggests that the current approach may be inadequate, i.e. it should not take petabytes of data for a LLM to match the performance of a high school student (for the LLM = AGI folks).

> One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts.

Agree, KG+LLM is a good next step to explore and should address some hallucination issues (see DRAGON from Leskovec and Liang groups). But we're already now talking about architectural changes as I posited.

In any case, where do we get such knowledge graphs (or index of facts)? Some already exist (e.g. Wiki, UMLS) and were created by humans but are clearly inadequate in coverage.

The proposition of using GPT-like models to generate these (i.e. GraphGPT) seems conceptually flawed as GPT does not itself know if a statement is factual or not which is problematic even for humans.

Settled vs controversial is orders of magnitude more complex, how on earth do we do this without human annotation? You can't rely on frequency (i.e. some things were facts for 100 years but all of a sudden they're not anymore and this is not controversial by definition).

The only reason LLMs work as well as they do now is because sheer volume of data (and NTP) makes the noise seem hidden and by definition an autoregressive model should be somewhat impervious to singular factoids (vs a model being grounded by the garbage dump that is CommonCrawl/the internet).

> At least the model won't hallucinate outside the known facts.

Not sure this is a given, even if a model acts as a natural language database of factoids it is probable that it will hallucinate links unless you're strictly grounding output in which case we've just built a colossally over-engineered IR/STS tool.

> One "simple" application

I think what you've posited is actually harder to build than anything that's been achieved thus far with LLMs.


I think the model is almost ready for real-world learning, for example in programming - sure you can have a knowledge base built on documentation, public code, etc.

But at some point you can just give the model access to tools, tell it to solve some problems, build plans, generate logs of each approach and train on those outputs. Programming is ripe for this - all the tools are easily accessible to a digital actor, everything is suited to text based model, there's plenty of tooling to provide feedback and explanations for errors geared towards humans.

No need to fumble with robotics and physical world - you can create a superhuman programmer. Then make it build out the infrastructure for physical world learning. AGI apocalypse here we come !


We also used to train neural networks over multiple 'epochs' of the same data.

Can't we keep doing that again?

We had techniques like drop-out and data augmentation to help.


I don't think that the hallucinations have anything to do with the architecture, rather they come from optimizing a cost function where saying "I don't know" is as bad as being wrong. I do not think that RLHF as currently understood can fix this, since the reward model would struggle to distinguish fact from fiction.


I think you are mixing up layers of abstraction.

The network is most likely trained with something like a categorical cross entropy loss function. Those totally punish being wrong a lot more than saying "I don't know". See https://www.v7labs.com/blog/cross-entropy-loss-guide

It's just that saying "I don't know" means that your model is spreading the probability of what the next token in the text stream might be over many different outcomes. A very 'uniform' probability distribution, instead of sharp prediction.

That looks very different to GPT literally outputting the words "I don't know".


Sorry if I was unclear. I know that the model is incentivised to accurately predict the probability distribution of the next token. I mean that the model is not being incentivised to literally produce the output tokens corresponding to "I don't know" when asked a question where it is uncertain.


Yes, exactly.

What I wanted to emphasize is that the training _does_ actually incentivize the model to say "I don't know" but on a lower level.


If only the OpenAI api gave us the token probabilities like it used to.


This take is a bit silly in that they are implying the problem training models will be that we will run out of data. It's more likely that the problem is that the current models require too much data to reach convergence.

We've been trying to speed run neural networks science for the past decade but we still don't fully understand how they work. It's like being a bad programmer who doesn't understand algorithms so you compensate by spending money on hardware to make your programs run faster. At some point we will reach a limit where you can't buy your way out of the problem with more data or money and we'll all be forced to return to studying the foundations of the science rather than just trying to scale the existing models up.

I am certain when we get to that point everyone will realize we've been trying to feed these models too much data. It makes more sense that our current architectures are just not effective at assimilating the data they have.


There's gotta be entire libraries that haven't been digitized that can be mined for data.


This analysis misses the impact of AI models being deployed, like is happening rapidly right now. Production applications built on AI will provide ample (infinite?) additional training data to feed back into the underlying models.


Not sure that synthetic or LLM-generated training data is as useful as human generated text.

It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified.

It's been shown repeatedly that garbage in = garbage out for training data.


Agree about synthetic data. My point is that AI-powered applications that are deployed in production generate more _real_ data which can be used for training. For example, self-driving cars generate tons of data about how their models perform, as a result of the cars driving around. Similarly, code-writing AI applications will generate feedback in the form of errors, logs, etc. which is can be fed back into the models as training data.


I love how "running out of data" implies that AI companies have access to all the text we ever wrote on all platforms out there. I mean it's probably true...


This is dumb. An individual human takes in more data than modern LLMs do.

https://open.substack.com/pub/echoesofid/p/why-llms-struggle...


Blind kids don't though, and they still end up being smarter than GPT-4


They still take in a lot of sensory data, eg related to touch and proprioception.


True but surely that doesn't come to terabytes of data


I would expect similar amount of data (or more) compared to usual visual inputs.

After all, your skin is a pretty big organ. And your sense of balance and proprioception in all the joints is quiet a few different channels and pretty high temporal resolution.


>Just the vision data of a baby’s first year easily adds up to petabytes

What encoding is this??


Uncompressed 2x 8k by 8k 24bpp 24FPS video. Comes at about 500GB per hour.


It’s always interesting to think about the exact technological analog for our biological sensors, but I believe that our vision would be way less than that in terms of raw data. We have a super high-res area at the center of vision (the fovea), but the rest is extremely low resolution (but with high movement and light sensitivity).

I think we could reasonably say that if an optical nerve has 1mm neurons on average, and they can fire at 250Hz at the most, that’s 250mbps or ~31mb/s per eye of uncompressed data as an upper bound.


I don't think you correctly calculate bandwidth in this case. You assume 1 bit per neuron per tick, but time when it fires within the tick also matters, and that information is missing from multipliers.

Also, there's no reason to use data from optical nerves as input, as it is already precompressed. You should be counting optical receptors instead (120 000 000).


I don’t think it matters that much. The firing itself takes a couple of milliseconds, and there’s a refactory period of a millisecond. I’m approximating 250hz as the maximum rate of firing. You’re arguing that the neuron can encode more information with the phase (e.g. fire, recover, wait 2ms, fire) but I think information theory tells us the 250hz actually still bounds the information. Maybe there’s a small constant factor, but I don’t think it changes the order.

I don’t believe it is precompressed as it hasn’t been processed by the visual cortex yet, no? Aren’t the optical receptors simply an artifact of the “sensor design”? E.g. if the refractory period of an optical receptor is 100x that of the neuron (or you simply need to cover a certain area, as you probably have tons of receptors attached to a single neuron outside the fovea and a small number per neuron inside the fovea), you’d hook up 100 optical receptors per neuron to use its full capacity. I think this is less compression and more combining a bunch of low information channels into a higher information channel.

All we really care about here is the amount of information reaching the brain not what your physical eye is capable of receiving, so I think using the nerve makes the most sense. There’s an interesting direct analogy: we don’t really care about the number of CCD sensors in the camera that took the image, we only care about how much information is in the video coming from the camera.


I disagree with the later point, as unlike camera sensors the cells in question already include trainable parameters for every single one of 100M+ inputs.

But it matters little as even with 100x reduction the estimate blows GPT out of the water in the first year, making it very sample inefficient in comparison.

As for signal I am a layman in its most extreme here (only mist-like idea about information theory and frequency relationship), but don't the bandwidth limits only apply to fixed rate measurements? E.g. there's basically infinite (sans plank limits) number of values between 4ms and 5ms and as long as the receiver can separate them, they can encode information?

To put it in other words, if the neurons can control the impulse peak delay down to a nanosecond, then shouldn't the limit be measured based on 10^9Hz of that control vs 250Hz of max firing rate?


I don’t think the connections between the receptors (rods and cones) and the ganglion cells are “trainable”? You seem to be assuming Brain-like learning functionality inside the eye. I’m not a biologist, but I don’t think this is the case unless you’re considering evolution as training. If it were true, wouldn’t people have wildly varying fovea? I feel that these connections are anatomical and not learned in exactly the same way the number of arms, legs or teeth is not trainable.

Regarding the nanosecond point — I don’t believe that’s how information works, and there should be many obvious problems with the idea of an infinite information channel not to mention the obvious practical ones (propagation variability, lack of a reference point, etc.). There may be some optimizations, but generally the frequency (or frequency bandwidth, which is where the generic computing term comes from) determines the information capacity, and phase modulation doesn’t magically change this (it is actually what is used in many radio systems).


31mb/s = 14GB/hr (bits to bytes). 81TB per year, assuming 16 hours awake per day. Fits snuggly on a large SSD ;)


Maybe if we store text data as sequences of 10k x 10k PNGs (one for each letter) and add an image recognition layer it would improve LLM perf


> > Just the vision data of a baby’s first year easily adds up to petabytes

Just to add to this, the human brain also encodes quite a lot of evolutionary lessons. We didn't have to learn edge detectors.


Only if you rely on dumb learning where it’s learning pure pattern matching rather than interacting and reinforcement learning based on the responses.


No.

AI can generate as much synthetic data as we need, on demand.

Many SOTA models, in fact, are already being trained with synthetic AI-generated data.

See https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines


> AI can generate as much synthetic data as we need, on demand.

I don't think this is right.

Can I take an untrained LLM (a neural network with random parameters), and have it start generating garbage, and then train the network to produce more of the same and then have it bootstrap itself to intelligence? Of course not.

What if I train it just a little bit first? What if I train it until it produces gibberish, but does occasionally string two words together that are spelled correctly. Can I have it produce petabytes of gibberish and then train on that to reach GTP4's level?

You seem to argue that at some point, the AI is able to improve by training on its own output. At what point does that arrive? Because so far we've never seen an AI improve based on its own output. (As far as I know?)


> Because so far we've never seen an AI improve based on its own output.

Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.

Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).

Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].

Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.

[1] https://arxiv.org/abs/2101.06329

[2] https://stats.stackexchange.com/questions/364584/why-does-us...

[3] https://arxiv.org/abs/2304.08466

[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)


This really only works well in resource limited settings and/or semisupervised tasks.

I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.

I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.


Boosting is reminiscent of the wisdom of the crowd effect.


AlphaZero in fact improves based on its own output, but I agree it is a special case and probably not generalizable.


It's RL though. Its output comes, in part, from interaction with an environment. It also has a well defined objective (win games). GTP doesn't have a clear objective other than "do more of this".


My generative melody models have done it for over a year but that's with human curation, so it's not self-improving. It's typically easier to curate than to generate, and it's especially true for music and images. It's much simpler to recognize a good melody than to compose a new one. The same applies to writing, but to a lesser extent.


You're just sampling from an already sampled distribution.

This is not the same thing. There will still be value for fine tuning, but it's no substitute.


It's not one way consumer just like humans are not. It can direct long term evolution of reason. For starters it can be used to denoise/dedup/optimise training set to be closer to optimum (to create smaller "copies" of itself).

There are instances of things that happened (history, what Paris Hilton did say on 22nd of April etc, big database of mostly irrelevant facts) and truths (math, physics, chemistry etc) where AI can enhance discoveries by helping us to see what we have not yet realised.

Both seem endless tbh but personally I'm more interested in latter.


To my knowledge no SOTA model has been trained on a significant proportion of synthetic data, has this changed?

The best examples I know of are instruction tuning sets but that is a minute amount of data compared to the unsupervised training data.


Lots of reasons this isn't universally true - it only works if you know enough about the data to simulate it, and your stuck within some distribution + human guesses space that's not all encompassing.

The easiest counterexample is training LLMs, how are you going to synthesize useful language examples if you want more. Some version of this is true for most applications.


Yeah the issue is you can generate data, but it won’t be good data. Training over random strings won’t make you learn language, but it’s technically data.


> AI can generate as much synthetic data as we need, on demand.

Doesn't work in majority of domains. You need to know the generating process (e.g. game rules) and build a realistic simulation environment that emulates that, in order to generate data that is useful. Both of these things are out of reach for most applications.

I believe the next large step will be multi-modal, where text is contextualized by video so the LLM will be able to concretize what "sitting on a chair" actually means with a single example, without needing to see thousands of textual associations to infer the meaning from the text.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: