> An LLM can be trained to find relevant knowledge online.
Why do you think chatGPT lost its Web Search plugin lately? Copyright lawsuits. You can't even use copyrighted content in the prompt because it will make the model makers liable.
But this doesn’t make sense - how is using ChatGPT to find information different from using a search engine? Especially if ChatGPT clearly lists its sources?
Given the already huge cost of training, and the evident lack of concern the LLM folks seem to have for copyright, why wouldn't the AI groups purchase subs to scrape the paywalled content?
The would possibly need to apply some effort to appear human, but that should only throttle the rate, not stop their scraping all together.
It's more difficult to scrape pay walled content no?
Clearly, places like Reddit have wised up to this and are making API usages non-free for example, so while it's not impossible, you can see the limitations being put into place already. Twitter is another one.
It seems like all this data is now considered gold and people lock up gold?