Hacker News new | past | comments | ask | show | jobs | submit login

No. Making a single copy for your own use is still a copyright violation. There are exceptions (fair use, nomitive use etc) but just because people are rarely sued for personal copying doesnt equate to that copying being permitted. And trademark issues, such as the other commenter generating the superman logo, are subject to a host of other rules.



Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff. There’s no copy of the original, even if it’s able to produce a similar product to the original. That’s the specific thing - the aggregation and the lack of direct reproduction in any form is fundamentally not reproducing or copying the material. The fact it can be induced to produce copyright material, as you can induce a Xerox to reproduce copyright material, doesn’t make the original model or its training a violation of copyright. If it’s sole purpose was reproduction and distribution of the material or if it carried a copy of the original around and produced it on demand, that would be a different story. But it’s not doing any of that, not even remotely. All this said, it’s a highly dynamic area - it depends on the local law, the media and medium, and the question hasn’t been fully explored. I’m wagering though when it comes down to it, the model isn’t violating copyright for these reasons, but you can certainly violate copyrights using a model.


Copying into RAM during training is making a copy, and can be a copyright violation.

https://en.wikipedia.org/wiki/MAI_Systems_Corp._v._Peak_Comp....

However, it seems that there is a later case in the 2nd circuit:

https://en.wikipedia.org/wiki/Cartoon_Network,_LP_v._CSC_Hol....


MAI v. Peak was obviously wrong. It would mean whenever you use someone else's computer, and run licensed software, you're committing copyright infringement. The decision split hairs distinguishing between the current user and the licensee for purposes of legality of making transient copies in memory as part of running the program.

Peak was a repair business. MAI built computers (as in assembled/integrated; I think they were PCs) and had packaged an OS and some software presumably written or modified in-house along with the computer. MAI serviced the whole thing as a unit. So did Peak. MAI sued Peak for copyright infringement because Peak was taking computer repair/maintenance business away from MAI, under the theory that Peak employees operating their clients' MAI computers and software was copyright infringement. (There were other allegations of Peak having unlicensed copies of MAI's software internally, but that's not central to the lawsuit.)

If you have a piece of IP to use to train an IP model with, and you have legal right of access to use that piece of IP (for private purposes), MAI v. Peak doesn't cleanly apply.

MAI v. Peak is also 9th circuit only, and even without the poor reasoning, it should automatically be in doubt because the 9th circuit is notoriously friendly to IP interests, given that it covers Los Angeles.


I agree that MAI v Peek is crazy.

I was only pointing out that the law is of the opinion that a copy is a copy is a copy, regardless of where it's made, or how long it exists for.

Other decisions come into play to save us, like Authors Guild v Google, where they said search engines could make copies, bringing Fair Use into the picture.

Personally, I think that creating the model is Fair Use, but anything produced by the model would need to be checked for a violation. I would treat it the same as if I went to Google Book Search, and copied the snippet it returned into my new book.

The license associated with the training data then becomes insanely important. Having the model reference back to the source data is even more important.

For example, training data with a CC BY license would be very different to CC BY-SA and CC BY-ND, and they all require the work produced by the model to have credit back to the original source to be publishable.

https://creativecommons.org/licenses/


The difference is that the copy is authorized, unless the work is being pirated.

When an artist displays their work on DeviantArt or Artstation or whatever, they are allowing the general public to load it into memory. It's part of the license agreement they sign when they sign up for these services.


The copy isn't authorized, the copy is allowed under Fair Use. There's a huge difference between the two.


Wrong.

Fair Use applies to instances that would otherwise be copyright violations, i.e. unauthorized distribution.

When you sign up for a social media site you EXPLICITLY grant the site the rights to distribute it. You have expressly permitted it. It's a big difference!


The sources used for training these AIs are publicly available sources like Common Crawl. If having a copy in RAM is a copyright violation, then there are copyright violations occurring well before any AI ever sees it.


it is and the same reason Blizzard can sue cheat makers because they are violating copyright law by using the memory of the game etc


How do search engines exist? The internet archive? Caching of image results? Web browser caches? CDNs?


Copies made by search engines don't need authorization, and can be unauthorized copies. Search engines are allowed to make copies under Fair Use since they are transformative - see Authors Guild, Inc. v. Google, Inc.

There hasn't been an explicit decision for ML training, but everyone's assuming that Authors Guild v Google applies.

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.....

CDNs operate under the control of the copyright owner, so they would be authorized.

Web browser caches are under the control of the recipient who has authorization to make a copy.


Depends on the training. Copilot can output training code verbatim. And even if not an exact reproduction, using a small training set could often produce insufficiently transformative work that could still be legally considered a derived work. (IANAL)


> Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff.

Devil’s advocate: That sounds like a derivative work, which would be infringement.


converting the original media into a statistical aggregate

Devil's advocate: That sounds transformative, which wouldn't be infringement.


Good point. I’d imagine we’ll see arguments in both directions, given how grey the line is between purely derivative and transformative.

I think it’s fair to say that generative AI trained on copyrighted content will be an unmitigated win for IP attorneys all around.


> No. Making a single copy for your own use is still a copyright violation.

In some jurisdictions, perhaps, but not in all of them. There isn't one set of universal copyright law in the world. Eg in New Zealand you are allowed to make a single copy of any sound recording for your own personal use, per device that you will play the sound recording on. I'm sure there are other examples in other countries.

https://www.consumer.org.nz/articles/copyright-law


This is the same in the UK (and not only for sound). If you own the copy, you can make personal copies. You can't share them, and you have to own the original.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: