Conflating training a model with human learning is wrong. When training a model ...

bastawhiz · on June 1, 2023

I don't fundamentally disagree with you, but what you are saying doesn't hold water.

> a copy is made and reproduced numerous times when training.

Casually browsing the web creates millions of copies of what are likely the same images and text that models are trained on. Computers cannot move information, they can only copy it and delete the original. Splitting hairs over the semantics of what it means to "copy" isn't a strong argument.

> where it is an authorized viewing

What exactly is an unauthorized viewing of a publicly accessible piece of content online that has been hyperlinked to? If we assume things like robots.txt are respected, what makes the access of that data improper?

> it may output material that competes with the original

An art student could create a forgery. I could craft for myself a replica of a luxury bag. But that's not a crime unless it's done with the intention of deceiving someone or profiting from the work. Intent, after all, is nine tenths of the law.

It's an important right that you should be able to do and create things, even if the sale or distribution of the outputs of those things are prohibited. The ability for a model to produce content which couldn't be distributed shouldn't preempt its existence.

> So you may have copyright violation in distribution of the dataset or a model's output

And neither of those things are the act of training or distributing the model itself!

carom · on June 1, 2023

There is quite a bit of precedent for "making copies of digital things is copyright infringement". Look at lawsuits from the Napster era. [1]

What makes the use improper? Licenses. Terms of service. Mostly licenses though. For example, all the images on Flickr that were uploaded under Creative Commons licenses (e.g. non-commercial) have now been used in a commercial capacity by a company to create and sell a product.

Similarly, code is on Github with specific licenses with specific terms. Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

The reason I mention competition with the original is the fair use test (USA). When courts decide whether something is fair use they consider a few aspects. Two important ones are whether it is commercial, and whether it is a substitute for the original. When art models output something in the style of a living artist, it is essentially a direct substitute for that person.

Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Training the model may very well be a copyright issue. The images have been copied, they are being used. Whether that falls under fair use will likely be determined on a case by case basis in court. I do not believe closed commercial models like Copilot or Dall-e will pass a fair use test.

There is a lot of money involved here though, so we will need to wait for years before we have answers.

1. https://www.theguardian.com/technology/2012/sep/11/minnesota...

bastawhiz · on June 1, 2023

> to create and sell a product.

This is not model training.

> Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

But the very act of training copilot is not problematic. And in fact, if GitHub never did anything with Copilot, the physical act of training the model is not problematic at all. And that's what at issue here. How Copilot is used is orthogonal to the article.

> Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Yes. And training the model isn't the part where you sell it. It's the part where you make it.

> Training the model may very well be a copyright issue. The images have been copied, they are being used.

What do you think "being used" means here? If I work for a company and download a bunch of text and save it to a flash drive, have I violated copyright? Of course not. If I put that data in a spreadsheet, is it copyright infringement? Of course not. If I use Excel formulas on that text is it infringement? Still no.

And so how can you claim in any way that the creation of a model is anything more than aggregating freely available information?

I don't disagree with you about the use of a model. But training the model is just taking some information and running code against it. That's what's important here.

kelnos · on June 1, 2023

I'm glad you brought this up, as this tendency for people to anthropomorphize a learning algorithm really bothers me. The model training process is a mathematical function. It is not a human engaging in thought processes or forming memories. Attempting to equate the two feels wrong to me, and trying to use the comparison in arguments like this just feels irrelevant and invalid.

lmm · on June 1, 2023

> When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

How's that any different from what happens inside a human's brain when learning?

> The model is not walking around a museum where it is an authorized viewing.

The training data could well be from an online museum. And the idea that viewing something public has to be "authorized" is very insidious.

> The further issue is that it may output material that competes with the original.

So might a human student.

carom · on June 1, 2023

It is different from a human brain in that it is not a human brain. It is a statistical function that produces some optimized outputs for some inputs.

I have made no mention of things being authorized in public. In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

If you use this copyrighted material to produce a product for commercial gain you will likely face a fair use test in court. If you use it for a non-commercial cause with public benefit you could probably pass that fair use test. Open source will do very well because of this.

The model is not a human though, and very often these are not "public" works that it is trained on.

lmm · on June 1, 2023

> It is a statistical function that produces some optimized outputs for some inputs.

So is a human mind.

> In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

How so? What non-public training data are they using, and why does it matter?

> The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

It does mean you can read the code and learn from it without concern for the license (morally, if not legally).

bryanrasmussen · on June 1, 2023

>> When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

>How's that any different from what happens inside a human's brain when learning?

I don't know, nor does anyone else. So let me ask you - how is that the same as what happens inside a human's brain when learning?

lmm · on June 1, 2023

> I don't know, nor does anyone else.

We don't know the details. But it's pretty implausible that the process of learning wouldn't involve the brain having some representation of the thing it's learning, or wouldn't involve repeatedly "copying" that representation. Every way we know of processing data works like that. (OK, there are theoretical notions of reversible computation - but it's more complex and less effective than the regular kind, so it seems very unlikely the brain would operate that way)

And a human who has learned to perform a task has certainly "derived a function that takes some input and produces an output".

yazaddaruvala · on June 1, 2023

> But it's pretty implausible that the process of learning wouldn't involve the brain having some representation of the thing it's learning, or wouldn't involve repeatedly "copying" that representation.

I think you can easily make a stronger statement:

We do know that art students spend many hours literally tracing other images in order to learn to draw. We do know that repetition is how the brain improves over time.

"Learn to draw better by copying." - https://www.adobe.com/creativecloud/illustration/discover/le...

Based on that, seems pretty clear to me that the other commenters here would agree (regardless what the brain does internally) that at a minimum, art students are violating copyright many, many, times in order to learn.