Hacker News new | past | comments | ask | show | jobs | submit login

> LLMs feed on content and data scraped from web pages...

Not exclusively, no. As we march forward I'd expect:

- the number of web pages generated by AI will outpace the number generated by humans, probably on the order of a few magnitudes, meaning that human-generated content will drown in a sea of machine generated content. Furthermore, ad-supported web sites will all be AI generated, cutting off advertising funding for human-generated sites

- this will make web pages a very poor source of information to scrape overall, since it'll mostly be LLM output

- sophisticated AI builders will start signing licensing deals with content creators that gives them exclusive access for their AI. Think: Medical and legal journals, large archives of historical works, stock market historical data, technical manuals, etc. This content isn't very consumer friendly as-is, but could be used to generate consumer friendly content that is technically more accurate.

> It's weird how you paint that it's inevitable, just to tell them "tough luck"? They don't need to fight "AI" they only need to make sure their work isn't stolen, and there are many options they can adapt to.

I think you're suggesting that people can continue publishing their content on publicly accessible web sites and that they're somehow able to detect that their site is being visited by an AI web scraper, and in those cases they feed "garbage" to the scraper. I'm suggesting that this will be a loosing battle, that trying to detect bots already is a forever-war that can't be won, while at the same time consumer behavior will change such that publishing pages on public web sites won't be an effective way to reach an audience because of the mountains of AI-generated content.






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: