Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Size of LLM: <64Gb.

Size of training data: fuck knows, but The Pile alone is 880Gb. Public github's gonna be measured in Tb. A Common Crawl is about 250Tb.

There's physically not enough space in there to store everything it was trained on. The vast majority of the text the chatbot was exposed to cannot be pulled out of it, as this paper makes clear.

I'm guessing that the cases where great lumps of copyright text can be extracted verbatim are down to repetition in the training data? There's probably a simple fix for that.

(I'm only talking about training here. The initial acquisition of the data clearly involved massive copyright infringement).



>The initial acquisition of the data clearly involved massive copyright infringement).

I don't find this to be true in USA. Because Google already covered this ground and the doctrine of transformative Fair Use was born.


It's the download. I don't think you can download The Pile without infringing.

17 U.S. Code § 106 covers reproduction, not just redistribution (IANAL).

As I said, I'm separating the acquisition on the data and the training of the data, because I believe the first is an infringing act, while the other is (in the general case) not.


> Because Google already covered this ground and the doctrine of transformative Fair Use was born.

Fair Use, and the way whether a work is transformative is a factor in it, is much older than Google; I'm not sure what specific Google precedent you think is relevant here.


You'd have a very hard time legally distinguishing this from "compressing a copyrighted work" though.


To get out the original data from a compressed file, you just need to know the algorithm used (and for almost all formats, the file tells you).

To get out the original data from an LLM, you need to supply... the original data. Or at least, a big chunk of it.

The actually copyrightable chunk of it, arguably, since what a LLM can generate on its own is only its most predictable, unoriginal, generic chunks. Things it's seen a thousand times.

Which may turn out to be an uncomfortably high % of most creative works.


I wouldn't, because the vast majority of the copyrighted works it was trained on are not present in the model, and the model can't be persuaded to spit them out at any reasonable level of fidelity (as the paper points out).

The comparatively few that are should be fixed.

If you want to argue that the act of training is in itself infringing, even if it doesn't result in a copy... well, I'd enjoy seeing you make that argument.


I'd be happy to take "reasonable level of fidelity" + yes, the act of training itself as infringing to a jury? I feel like it's going to look way more like "feeding into a copy machine" than "teaching a toddler" or whatever.


"the act of training is in itself infringing, even if it doesn't result in a copy"

"I'd be happy to take "reasonable level of fidelity" + yes, the act of training itself as infringing to a jury"

They're not the same thing. At all. The comparatively few that are [extractable] should be fixed. I already said that.

The vast majority of texts CANNOT BE EXTRACTED. Are they still infringing?


Oh, I'm aware that they're not the same. I suppose I'm thinking more like a "real-life lawyer."

At this stage, you can't just declare "infringing or not," thats the point of trials.

What I'm saying is -- you make good points -- but I like my chances in front of a jury with my explanation against yours.


This might be of interest, if you haven't already seen it: https://www.bbc.co.uk/news/articles/c77vr00enzyo

Could be reversed by a higher court of course, but it seems like it establishes that pirating and training are two different "crimes". (Or three - see the bit about "infringing knock-offs").




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: