My understanding is that the current web scraping situation is this:
* Web scraping is not a CFAA violation. (EF Travel v. Zefer, LinkedIn v. hiQ).
* Scraping in spite of clickthrough / click-in ToS "violation" on public websites does not constitute an enforceable breach of contract, chattel trespass (ie - incidental damage to a website due to access), or really mean anything at all. This is not as clear once a user account or log-in process is involved. (Intel v. Hamidi, Ticketmaster v. Tickets.com)
* Publishing or using scraped data may still violate copyright, just as if the data had been acquired through any means other than scraping. (AP v. Meltwater, Facebook v. Power.com)
So this boils down to two fundamental questions that will need to get answered regardless of "scraping" being involved: "is GPT output copyrightable" and "is training a model on copyrighted data a copyright infringement."
Is training a model on second-hand data laundering copyright? Second-hand data is data generated from a model that has been trained on copyrighted content.
Let's say I train a diffusion model on ten million images generated by diffusion models that have seen copyrighted data. I make sure to remove near duplicates from my training set. My model will only learn the styles but not the exact composition of the original dataset. So it won't be able to replicate original work, because it has never seen any original work.
Is this a neat way of separating ideas from their expression? Copyright should only cover expression. This kind of information laundering follows the definition to the letter and only takes the part that is ok to take - the ideas, hiding the original expression.
If openAI tries to legally claim against this, they will be reminded that their model is trained on tons of unlicensed , scraped without consent content. If their training is legal, then this one is legal too