Given the already huge cost of training, and the evident lack of concern the LLM folks seem to have for copyright, why wouldn't the AI groups purchase subs to scrape the paywalled content?
The would possibly need to apply some effort to appear human, but that should only throttle the rate, not stop their scraping all together.
It's more difficult to scrape pay walled content no?
Clearly, places like Reddit have wised up to this and are making API usages non-free for example, so while it's not impossible, you can see the limitations being put into place already. Twitter is another one.
It seems like all this data is now considered gold and people lock up gold?
The would possibly need to apply some effort to appear human, but that should only throttle the rate, not stop their scraping all together.