Size of LLM: <64Gb. Size of training data: fuck knows, but The Pile alone is 880...

pbhjpbhj · 2025-06-20T07:38:58 1750405138

>The initial acquisition of the data clearly involved massive copyright infringement).

I don't find this to be true in USA. Because Google already covered this ground and the doctrine of transformative Fair Use was born.

flir · 2025-06-20T11:26:41 1750418801

It's the download. I don't think you can download The Pile without infringing.

17 U.S. Code § 106 covers reproduction, not just redistribution (IANAL).

As I said, I'm separating the acquisition on the data and the training of the data, because I believe the first is an infringing act, while the other is (in the general case) not.

dragonwriter · 2025-06-20T07:42:01 1750405321

> Because Google already covered this ground and the doctrine of transformative Fair Use was born.

Fair Use, and the way whether a work is transformative is a factor in it, is much older than Google; I'm not sure what specific Google precedent you think is relevant here.

jrm4 · 2025-06-20T03:21:03 1750389663

You'd have a very hard time legally distinguishing this from "compressing a copyrighted work" though.

vintermann · 2025-06-20T05:44:07 1750398247

To get out the original data from a compressed file, you just need to know the algorithm used (and for almost all formats, the file tells you).

To get out the original data from an LLM, you need to supply... the original data. Or at least, a big chunk of it.

The actually copyrightable chunk of it, arguably, since what a LLM can generate on its own is only its most predictable, unoriginal, generic chunks. Things it's seen a thousand times.

Which may turn out to be an uncomfortably high % of most creative works.

flir · 2025-06-20T11:30:44 1750419044

I wouldn't, because the vast majority of the copyrighted works it was trained on are not present in the model, and the model can't be persuaded to spit them out at any reasonable level of fidelity (as the paper points out).

The comparatively few that are should be fixed.

If you want to argue that the act of training is in itself infringing, even if it doesn't result in a copy... well, I'd enjoy seeing you make that argument.

jrm4 · 2025-06-20T13:53:16 1750427596

I'd be happy to take "reasonable level of fidelity" + yes, the act of training itself as infringing to a jury? I feel like it's going to look way more like "feeding into a copy machine" than "teaching a toddler" or whatever.

flir · 2025-06-20T14:19:22 1750429162

"the act of training is in itself infringing, even if it doesn't result in a copy"

"I'd be happy to take "reasonable level of fidelity" + yes, the act of training itself as infringing to a jury"

They're not the same thing. At all. The comparatively few that are [extractable] should be fixed. I already said that.

The vast majority of texts CANNOT BE EXTRACTED. Are they still infringing?

jrm4 · 2025-06-20T19:56:20 1750449380

Oh, I'm aware that they're not the same. I suppose I'm thinking more like a "real-life lawyer."

At this stage, you can't just declare "infringing or not," thats the point of trials.

What I'm saying is -- you make good points -- but I like my chances in front of a jury with my explanation against yours.

flir · 2025-06-26T00:05:17 1750896317

This might be of interest, if you haven't already seen it: https://www.bbc.co.uk/news/articles/c77vr00enzyo

Could be reversed by a higher court of course, but it seems like it establishes that pirating and training are two different "crimes". (Or three - see the bit about "infringing knock-offs").