Hacker News new | past | comments | ask | show | jobs | submit login

It's pretty easy to confirm that copywritten material is in the training data. See the NYT lawsuit against OpenAI for example.



Part of that back-and-forth is the claim "this specific text was copied a lot all over the internet making it show up more in the output", and that means it's not a useful guide to things where one copy was added to The Pile and not removed when training the model.

(Or worse, that Google already had a copy because of Google Books and didn't think "might training on this explode in our face like that thing with the Street View WiFi scanning?")




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: