Hacker News new | past | comments | ask | show | jobs | submit login

Not trying to express an opinion on the legal matter, but as a technical matter it's pretty obvious that LLMs create copies of (some of) their training data.

Here's GPT-3.5 reciting the Declaration of Independence: https://chat.openai.com/share/eb30c373-7fec-4280-892d-479567...

Unless you're claiming that GPT-3.5 is deriving the Declaration of Independence (from information about the founding fathers?) I don't see how there's room for debate about whether information has been "copied" into the model.

I have done this test in the past with copyrighted material (harry potter) but they have since added safeguards against it, but my understanding is that the model is still capable of it.




You don't need to even read their law to know that they are speaking only of training and not of output. Otherwise, they would have just suddenly created the world's most obvious loophole. Create an 'LLM' that "trains" on some input and then categorically outputs each file, be it a movie, song, book, or whatever. You've now legalized copyright infringement (and distribution) of everything.

So their law is going to essentially come down to you can train your LLM on whatever you want, but can also be held liable for any infringing outputs.


Makes sense. Imagine having your tape-recorder in your living room and start it recording. Then turn on your stereo. The music that comes out is recorded on your tape-recorder.

Is that a violation of copyright? I'm not a lawyer but I think copyright legislation is about forbidding the production of "derived works". If you just record something but never play it back it is not a "derived work" is it? It only becomes a violation if you distribute it, make it available to others, and thus "produce a derived work".

So training an LLM is like recording. But if you use it as a means to distribute copies of copyrighted material without approval of its copyright holders then you are in violation.


Sure, but the key part there is "some of".

They're necessarily able to produce verbatim copies only of the most duplicated, most repeated, most cited works -- and it's precisely due to their popularity that they're the only things worth including verbatim.

I'm not going to opine on what the legality of that should be, but it's essentially the material considered most "quotable" in different contexts. I'm quite sure the entirety of Harry Potter isn't included, but I'm also sure that some of the most popular paragraphs probably are. It's analagous to the kind of stuff people memorize.

I'd expect an LLM to contain this stuff. If it didn't, it would be broken.

But there's a world of difference between copying all its training data (neither desirable nor occurring), versus being fluent in quotable stuff (both desirable and occuring).


> I'm quite sure the entirety of Harry Potter isn't included, but I'm also sure that some of the most popular paragraphs probably are. It's analagous to the kind of stuff people memorize.

No, you are wrong about this. There are good reasons to believe the model memorized the entirety of Harry Potter, as well as Fifty Shades of Grey, inclusive of unremarkable paragraphs, the kind of stuff people will never memorize. Berkeley researchers made a systematic investigation of this. See what I wrote elsewhere.


So, I looked at the table appendix you're referencing and I think you're overstating your case a bit.

Among books within copyright, GPT-4 can reproduce Harry Potter and the Sorcerer's Stone with 76% accuracy. This is, apparently, the highest accuracy GPT-4 achieved among all tested copyrighted books with 1984 taking a distant 2nd place at 57%.

With this in mind, we can verifiably say that GPT-4 is unusually good at specifically reproducing the first Harry Potter book. An unscrupulous book thief may very well be able to steal the first entry in the series... assuming that they're able to get past one quarter of the book being an AI hallucination.


You misread. They did not find 76% reproduction of the book. When asked to fill in a name within a passage, e.g. "Stay gold, [MASK], stay gold." Response: Ponyboy, GPT-4 got the name right 76% of the time.


> You misread. They did not find 76% reproduction of the book. When asked to fill in a name within a passage, e.g. "Stay gold, [MASK], stay gold." Response: Ponyboy, GPT-4 got the name right 76% of the time.

What is the temperature / top_p setting producing that 76%? The default? If you dial down the randomness, would that number go up?


I’m not sure it matters much that the current model can’t reproduce Harry Potter verbatim. If it can do smaller more quoted works now, it’ll tackle larger more obscure things in the future. It’s just a matter of time until it can output large copyrighted works, meaning the question of what to do when that happens is pretty relevant right now.


No it won't, because reproducing works verbatim is basically the definition of overtraining a model. That's a bug, not a feature.

A lot of further progress is going to be made towards making models smaller and more efficient, and part of that is reducing overtraining (together with progress in other directions).

Reproducing Harry Potter is a bug, because it's learning stuff it doesn't need to. So to the contrary, "it's just a matter of time" until this stuff decreases.


It says training, not inference.

I can read a copyrighted book legally and retain that information legally.

I can distill it (legally) but while I might be able to recite it, I’m not allowed to.

I think that is a reasonable framework around generative AI (after all, I am alllowed to count the words in Harry Potter, so statistical modeling of copyrighted material has legal precedent)

The problem with AI is of course the blurred border between a model and data compression.

We can’t see the data in the model, but we can apply software to execute the model and extract both novel and sometimes even copyrighted data.

Similarly we can’t see data in the zip file without extra software, but if that allows us to extract both copyrighted and copy free data, we’d still consider distribution a violation.


Adjacent to copyrights are private and confidential data. It’ll be interesting to see how Japan’s legal framework around this handles private data.


For detailed investigation of this phenomenon, see Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4: https://arxiv.org/abs/2305.00118


Pretty good argument but it has one fatal flaw. People can memorize the Declaration of Independence too. Or Harry Potter. If people mostly recite HP from memory but apply enough creative changes, it's not copyright infringement.

So proving a system can memorize and recite proves nothing.


How does this make sense? Memorizing and then reciting copyrighted works is still infringement in a lot of commercial contexts.


The reciting part is illegal, but as long as it is trained not to recite things in full (or to whatever limit the law determines), then it should be fine.


Try publishing Harry Potter but changing all the proper nouns and use synonyms for all the adjectives.

It's gonna be copyright infringement.

You can even cut a few scenes and make up a few scenes entirely, too. You're still getting busted.


Yes, that’s why I am saying they will have to ensure the LLM doesn’t do that.


reciting is violation of copyright

creatively transform and apply for some tasks maybe not violation


These aren’t people. Just because we can find commonalities in learning and memorization does not mean we can ignore everything else that differs.


"copying" != "copyright infringement": I'm just saying that the LLMs are copying, and I'm not getting into the legal/societal question of whether we want that to be illegal or not.

We as a society have determined that certain sorts of non-consensual copying are allowed: "fair use" broadly, and maybe you can consider "mental copying" in this category. Maybe we'll add LLM training to the list? It's not like copyright rules are a law of nature: we created them to try to produce the society that we want, and this is an ongoing process.

Again, I think there are fascinating questions 1) does LLM training violate existing copyright law + case law or does it maybe fall under a fair use exemption, and 2) is that what we want. But I think "do LLMs make copies" is dull and trivial and I don't know why it comes up.


The ai isn’t a person. Jesus. It’s not the same


Derivative work is not protected from copyright. As long as the “user” of the model does their due diligence, and ensures they are not infringing on copyrights - they are golden.

But here in lies the challenge. Are there reasonable methods available to ”users” for checking their works against infringement?

I don’t think so. We’ll need a centralized searchable database of all copyrighted work. Who is going to build that? To make matters more complicated, every country has their own copyright certification process. Maybe Google with its means can build something like this.

In any case, this is uncharted territory.


BigCode seems to acknowledge this problem and provide a search tool for dataset used to train their StarCoder model.

https://huggingface.co/spaces/bigcode/search


Thought experiment: Say you make a big list of words and pleasing combinations of them (I have actually done something similar to make a fantasy RPG name generator.) Now convert that list into a Markov chain or whatever and quasi-randomly generate some short lengths of text. Eventually you might generate copyright-infringing haiku and short poems. Does your data/algorithm violate copyright by itself? Very doubtful; you wrote it all yourself. Only publishing the output violates copyright. (See also: http://allthemusic.info/)

So if that's legal, how about if, instead of entering the data manually, you write an algorithm to scan poetry and collect statistics about the words in it. Should the legal distinction be any different since all you did was automate the manual process above?

Or what if you used a big list of the titles of poetry, which isn't even copyrightable information by itself? You may still succeed in extracting the aesthetic intent of the authors, and a statistical model can plausibly use that to generate copyright-infringing work.

Remember, we're not talking about generating novels or paintings here, just 20 words or so (whatever the bare minimum copyrightable amount is) in trillions of generated permutations.

You can see where I'm going with this. If those examples are legal, is there a cut-off for more complex statistical systems? Good luck figuring that out in a court of law.


> Remember, we're not talking about generating novels or paintings here, just 20 words or so (whatever the bare minimum copyrightable amount is)

From https://fairuse.stanford.edu/2003/09/09/copyright_protection...:

Copyright laws disfavor protection for short phrases. Such claims are viewed with suspicion by the Copyright Office, whose circulars state that, “… slogans, and other short phrases or expressions cannot be copyrighted.” [1] These rules are premised on two tenets of copyright law. First, copyright will not protect an idea. Phrases conveying an idea are typically expressed in a limited number of ways and, therefore, are not subject to copyright protection. Second, phrases are considered as common idioms of the English language and are therefore free to all. Granting a monopoly would eventually “checkmate the public” [2] and the purpose of a copyright clause to encourage creativity-would be defeated.


You could still plausibly generate (a significant portion of), let's say, "Fire And Ice" by Robert Frost, which is only 50 words.

See also: https://blogs.harvard.edu/ethicalesq/haiku-and-the-fair-use-...


If I were the copyright holder of such work, I would argue that the LLM was trained on text, including my copyrighted work, and that if the system produced text that a reasonable person who reads poetry would identify as the copyrighted work, the burden is then logically on the LLM owner to prove the LLM didn't regurgitate a piece of text from something it previously ingested.

I think a jury would side with my argument.


The issue isn't that a generator lets you evade copyright somehow; it doesn't. The output is not the issue. If I sit in paint and my assprint happens to perfectly duplicate a Picasso, that's unlikely to fly in court if I try to sell copies. Picasso painted it first.

The point at issue here is that some people are arguing that the models themselves are like a giant collective copyright infringement, since they are in a vague sense simply a sum of the copyrighted works they were trained on. Those people would like to argue that distributing the models or even making use of them is mass copyright infringement. My thought experiment is a reductio ad absurdum of that reasoning.


I see your point now.


I'm not sure where we're going with the output in these examples.

So let's say there's a human-written poem that's copyright.

Let's say a human completely coincidentally writes an identical poem.

"Accidentally" producing the same poem wouldn't give the second human any claim to copyrighting or distributing their coincidentally-identical poem.

And if GPT accidentally copies large chunks of Harry Potter or Frozen or whatever other popular work, that new creation will have the same problems.

But what does that say about if we should also restrict the use of copyright material in training? Just because some algorithm - or some person - can coincidentally duplicate a copyrighted work even without directly reading it doesn't seem to relate to the case of building a model by explicitly using the copyrighted material.


The owners of intellectual properties still hold the copyright, the law refers to the training of neural networks, it doesn't really change anything if you use the work of another person by simply copy and paste or by overfitting a generative model, the owner of the work still has the copyright on it.


> as a technical matter it's pretty obvious that LLMs create copies of (some of) their training data.

Browsers also create copies of the viewed data. Computers hold in memory a copy of everything they're working on.

The central point is for how long, and to what purpose. This law is not about making copies or not, but what happens after.


I am so excited to see what happens when Japan forces all closed source software and Disney cartoons into the corpus out of fairness.

Seems like there should be no complaint, right? It's not like anyone can see the Windows 11 source code, it's only being used for training.


The things that an LLM is likely to contain a complete verbatim copy of are things that are a) short b) widely repeated to the point that they're embedded into our culture - and by that token those things are almost certainly not copyrightable.


Is a bar in a song not "short"?

Try putting one of those in your book and not getting sued for copyright.


If you literally mean a bar, yes those are short, likely a couple of words, and you put those in books all the time and don't get sued. ("The answer my friend, is blowing in the wind" is 4 bars, and I've seen books quote it verbatim without a second thought). Likewise, plenty of people put the entire Declaration of Independence in their book without a second thought, and I assume don't get sued for it.

If you're talking about a verse or more of something that's not quite so culturally pervasive (people put the whole of the star-spangled banner in their books, again without a second thought), well, at that point it's probably not something that an LLM would reproduce verbatim.


Typically things like this are covered under fair use if you're dealing with a human.


> Unless you're claiming that GPT-3.5 is deriving the Declaration of Independence (from information about the founding fathers?)

This would make a fun short story - “ChatGPT, author of the Quixote”




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: