I would think as a matter of practice AI companies would attempt to detect long ...

I would think as a matter of practice AI companies would attempt to detect long strings that appeared frequently in their corpus and dedup them out. There isn’t any value in training over and over again on the same data, and the copyright danger of being able to exactly reproduce your training set is obvious. Perhaps they did it intentionally, using the ability to reproduce copyrighted material as a way to get customers early on, knowing they would have to pay a paltry fee for it later.