Hacker News new | past | comments | ask | show | jobs | submit login

On this note, the data-set available if you start collecting today is tainted with experimental AI content. Not the biggest issue right now but as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.



I see this take a lot and I think it's quite wrong, not fully, but at least missing a couple big points.

I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.

And aside from spam, we're not just blindly posting AI content either. We're putting in meaningful prompts, rejecting answers we don't like, and editing answers we do.


I think "high quality, well-prepared sources" would include blogs and articles, which are likely to become heavily influenced by AI (blogs and articles are high quality compared to Reddit posts for example, which were included in the past).

In fact, there's no reason to think that academic papers won't start using language models to write better.

Tainting your text with AI can be as simple as pasting a paragraph in and asking if there's anything to improve.


I honestly don’t understand why “tainting” is such a big deal. Can someone explain it to me?

I see two possible reasons, but neither seems to be worth the purity concern. The first is that AI can be wrong, make stuff up, be confidently incorrect. Anyone who has been on the internet knows this isn’t exactly a game changer.

Second is that we won’t be training AI to be like humans, but like humans + AI. Also doesn’t seem like a big deal. We’re already humans + writing + computers + internet and so on. This cutoff matters for anthropology, but I don’t see how it matters for trying to make a bot that can do my taxes.


I think the best explanation is to look at Google. Google's basic algorithm was that it could look how people organically interacted on the web and use that as a heuristic for quality - if lots of are linking to you, you're probably high quality and you'll appear at the top of google. But that started to break down, (a) because people were gaming that metric for "SEO" and (b) the internet centralized so the organic interactions started to disappear, and (c) because people stopped clicking through links from different sites - why do that when you can just google what you want! Google basically broke this metric by using it.

In the same way, AI is trying to generate text that looks like its training data, but if its training data is AI generated text then it's simply being taught to be more like itself. It slowly starts to work less like a human and more like whatever its own idiosyncrasies are. It's a larger sort of version of the hallucinations it has today. If 50% of all the text on the internet becomes some part AI generated, then a huge part of the training for the next generation of AI will be the shortcomings of the current iteration of AI. And this will get worse as non-AI content moves to exclude itself from training.


> Second is that we won’t be training AI to be like humans, but like humans + AI.

LLMs weren't training AI to be like humans. They were training AI to be able to predict what humans (and other sources of common crawl data) will write next in their texts. This might seem like a small difference but it's not. Consider for example someone whose career is to research ant behavior. Their job in some sense is to be able to predict what an ant will do. Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?


> Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?

If they act out these predictions and are rewarded based on their accuracy, then yes. They're being trained to be like ants. Not entirely like ants in every way, but like them in specific ways.

There's a big difference with your analogy. Predicting tokens is essentially the same as generating tokens. There's no meaningful objective difference between the activities (I'm ignoring philosophy and focusing on observables). They both lead to a stream of tokens.

For contrast, consider any sport, maybe baseball. I could predict the winner of a game but not be able to win it myself. I could predict the next pitch but not be able throw it or hit it. There's an execution aspect you can fail at. Being like an ant would also have this aspect. Token prediction doesn't have this, or if it does (maybe turning a vector into an API response?) it's a trivial part of the whole problem.

Maybe I'd be more clear to say "write like humans" instead of "be like humans", though.


Um, really? You think your average 'growth hacker' who is using ChatGPT to exponentially increase the amount of SEO junk they can churn out is checking each answer before they press publish?

Purity, accuracy and relevance of data collected from the internet is going to a very hard problem.


It always has been, the internet is full of garbage, there are ways of finding the data that is useful to humans, like upvotes


YouTube has tons of data that's mostly untainted by SEO spam.


I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.

If by that you mean Common Crawl, Wikipedia etc, that's hardly "high quality, well prepared", and very subject to the biases and flaws of the creators who will vary widely in expertise, intelligence and ability.


> as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.

If we build a system where we feed the exhaust of an AI to another one at each step, should we call it the AI Centipede, like in the movies? https://m.imdb.com/list/ls064583741/


100% -- I also see this as a big and emerging problem that future researchers and practitioners will have to deal with.

Posted some thoughts previously here --

https://news.ycombinator.com/item?id=32577822

https://news.ycombinator.com/item?id=33869402


Text on the Internet isn't just tainted by the output of recent relatively smart models.

We had computers spit out text (especially spam) for ages now. You'd have to filter those out, too, if tainting actually was a problem.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: