On this note, the data-set available if you start collecting today is tainted wi...

furyofantares · on April 23, 2023

I see this take a lot and I think it's quite wrong, not fully, but at least missing a couple big points.

I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.

And aside from spam, we're not just blindly posting AI content either. We're putting in meaningful prompts, rejecting answers we don't like, and editing answers we do.

qingdao99 · on April 23, 2023

I think "high quality, well-prepared sources" would include blogs and articles, which are likely to become heavily influenced by AI (blogs and articles are high quality compared to Reddit posts for example, which were included in the past).

In fact, there's no reason to think that academic papers won't start using language models to write better.

Tainting your text with AI can be as simple as pasting a paragraph in and asking if there's anything to improve.

travisjungroth · on April 23, 2023

I honestly don’t understand why “tainting” is such a big deal. Can someone explain it to me?

I see two possible reasons, but neither seems to be worth the purity concern. The first is that AI can be wrong, make stuff up, be confidently incorrect. Anyone who has been on the internet knows this isn’t exactly a game changer.

Second is that we won’t be training AI to be like humans, but like humans + AI. Also doesn’t seem like a big deal. We’re already humans + writing + computers + internet and so on. This cutoff matters for anthropology, but I don’t see how it matters for trying to make a bot that can do my taxes.

SilverBirch · on April 24, 2023

I think the best explanation is to look at Google. Google's basic algorithm was that it could look how people organically interacted on the web and use that as a heuristic for quality - if lots of are linking to you, you're probably high quality and you'll appear at the top of google. But that started to break down, (a) because people were gaming that metric for "SEO" and (b) the internet centralized so the organic interactions started to disappear, and (c) because people stopped clicking through links from different sites - why do that when you can just google what you want! Google basically broke this metric by using it.

In the same way, AI is trying to generate text that looks like its training data, but if its training data is AI generated text then it's simply being taught to be more like itself. It slowly starts to work less like a human and more like whatever its own idiosyncrasies are. It's a larger sort of version of the hallucinations it has today. If 50% of all the text on the internet becomes some part AI generated, then a huge part of the training for the next generation of AI will be the shortcomings of the current iteration of AI. And this will get worse as non-AI content moves to exclude itself from training.

ftxbro · on April 23, 2023

> Second is that we won’t be training AI to be like humans, but like humans + AI.

LLMs weren't training AI to be like humans. They were training AI to be able to predict what humans (and other sources of common crawl data) will write next in their texts. This might seem like a small difference but it's not. Consider for example someone whose career is to research ant behavior. Their job in some sense is to be able to predict what an ant will do. Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?

travisjungroth · on April 23, 2023

> Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?

If they act out these predictions and are rewarded based on their accuracy, then yes. They're being trained to be like ants. Not entirely like ants in every way, but like them in specific ways.

There's a big difference with your analogy. Predicting tokens is essentially the same as generating tokens. There's no meaningful objective difference between the activities (I'm ignoring philosophy and focusing on observables). They both lead to a stream of tokens.

For contrast, consider any sport, maybe baseball. I could predict the winner of a game but not be able to win it myself. I could predict the next pitch but not be able throw it or hit it. There's an execution aspect you can fail at. Being like an ant would also have this aspect. Token prediction doesn't have this, or if it does (maybe turning a vector into an API response?) it's a trivial part of the whole problem.

Maybe I'd be more clear to say "write like humans" instead of "be like humans", though.

te_chris · on April 23, 2023

Um, really? You think your average 'growth hacker' who is using ChatGPT to exponentially increase the amount of SEO junk they can churn out is checking each answer before they press publish?

Purity, accuracy and relevance of data collected from the internet is going to a very hard problem.

mountainriver · on April 23, 2023

It always has been, the internet is full of garbage, there are ways of finding the data that is useful to humans, like upvotes

istjohn · on April 23, 2023

YouTube has tons of data that's mostly untainted by SEO spam.

mellosouls · on April 23, 2023

I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.

If by that you mean Common Crawl, Wikipedia etc, that's hardly "high quality, well prepared", and very subject to the biases and flaws of the creators who will vary widely in expertise, intelligence and ability.

Qem · on April 23, 2023

> as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.

If we build a system where we feed the exhaust of an AI to another one at each step, should we call it the AI Centipede, like in the movies? https://m.imdb.com/list/ls064583741/

albert_e · on April 23, 2023

100% -- I also see this as a big and emerging problem that future researchers and practitioners will have to deal with.

Posted some thoughts previously here --

https://news.ycombinator.com/item?id=32577822

https://news.ycombinator.com/item?id=33869402

eru · on April 23, 2023

Text on the Internet isn't just tainted by the output of recent relatively smart models.

We had computers spit out text (especially spam) for ages now. You'd have to filter those out, too, if tainting actually was a problem.