Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are many competing providers of commercial LLMs with equal capabilities, so another vendor would probably be happy to serve a leading Western market of 83 million people.


Yeah? Which commercial provider’s model do you think was trained without using lyrics?


The point is that some other vendor will do the work to implement the filtering required by Germany even if OpenAI doesn't.


I would imagine providers who want to comply will scan the LLM's output and pay a license fee to the owner if it contains lyrics.


They scan for commercial work already. Isn’t the law about training, not output?


Perhaps; I didn't read the court ruling.

But I'd be surprised if that was generally the case. It's easy to see why ChatGPT 1:1 reproducing a song's lyrics would be a copyright issue. But creating a derivative work based on the song?

What if I made a website that counts the number of alliterations in certain songs' lyrics? Would that be copyright infringement, because my algorithm uses the original lyrics to derive its output?

If this ruling really applied to any alogrithm deriving content from copyright protected works, it would be pretty absurd.

But absurd copyright laws would be nothing new, so I won't discount the possibility.


> But creating a derivative work based on the song?

1. it wouldn't matter as derivative work still needs the original license

2. expect if it's not derivative but just inspired,

and the court case was about it being pretty much _the same work_

OpenAIs defense also wasn't that it's derived or inspired but, to quote

> Since the output would only be generated as a result of user inputs known as prompts, it was not the defendants, but the respective user who would be liable for it, OpenAI had argued.

and the court oder said more or less

- if it can reproduce the song lyrics it means it stored a copy of the song lyrics somehow somewhere (memorization), but storing copies requires a license and OpenAI has no license

- it it outputs a copy of the song lyrics it means it's making another copy of them and giving them to the user which is copyright infringement

and this makes sens, if a human memorizes a song and then writes it down when asked it's still is and always has been copyright infringement (else you could just launder copy right by hiring people to memorize things and then write them down, which would be ridiculous).

and technically speaking LLMs are at the core a lossy compressed storage of their training content + statistic models about them. And to be clear that isn't some absurd around five corners reasoning. It's a pretty core aspect of their design. And to be clear this are things well know even before LLMs became a big deal and OpenAI got huge investment. OpenAI pretty much knew about this being a problem from the get to go. But like any recent big US "startup" following the law doesn't matter.

it technically being a unusual form of lossy compressed storage means it makes that the memorization counts as a copyright infringement (with current law)

but I would argue the law should be improved in that case, so that under some circumstances "memorization" in LLMs is treated as "memorization" in Humans (i.e. not a illegal copy, until you make it one by writing it down). But you can't make it all circumstances because like mentioned you can use the same tech to bascially to lossy file compression and you don't want people to launder copy right by training an LLM on a a single text/song/movie and then distributing that...


That seems like a really broad interpretation of "technically memorization" that could have unintended side effects (like say banning equations that could be used to generate specific lyrics), but I suppose some countries consider loading into RAM a copy already. I guess we're already at absurdity


> but I suppose some countries consider loading into RAM a copy already. I guess we're already at absurdity

FYI most do. Have a look at many software licenses. In particular Microsoft (who as we know invested lots into OpenAI), will argue it is so.

I would also say it makes sense. If it wasn't the case we can just load a program into lots of computers using only a single license/installation medium.


I think it's absurd. In my opinion the copy is for copying the usable part (e.g. installation).

Is running a program making a copy? If I run it on some distributed system is it then making more copies than allowed? This gets insane quickly.

I think it's just a bandaid for fixing removable drive installations. These should have had their own laws/rules/etc.

It has knock-on effects like being able to enforce other IP law to someone you just licensed your software to.

Similarly I think this is more an "interpret words to get the desired outcome instead of the likely spirit or meaning of the words".


It _really_ isn't absurd.

The law doesn't care what technical trickery you use to encode/compress copyrighted material. If you take data and then create a equation which contains it based on it it which can reproduce the data trivially then yes, IMHO obviously, this form of embedding copyrighted data still is embedding copyrighted data.

Think about it if that weren't the case I could just transform a video into an equation system and then distribute the latest movies, books, whatever to everyone without permission and without violating copy right even through de-facto I'm doing exactly what copy right law is supposed to prevent... (1)

Just because you come up with a clever technical trick to encode copyrighted content doesn't mean you can launder/circumvent copyright law, or any law at that. Law mostly doesn't care about technical tricks but the outcomes.

Maybe even more importantly LLMs under hood the are basically at the core compression systems where by not giving them enough entropy to store information you force to generalize and with that happen to create a illusion of sentience.

E.g. what is the simplest case of training a transformer? You put in data to create the transformer state (which has much smaller entropy) and then output it from that state and then you find a "transformation" where this works as well as possible for a huge amount of different data. That is a compression algorithm!!! And sure in reality it's more complex you don't train to compress a specific input but more like a dictionary of "expected" input->output mappings where the output parts need to be fully embedded i.e. memorized in the algorithm in some form.

LLMs are basically obscure multi layered hyper dimensional lossy compression systems which compress a simple input->output mapping (i.e. database) defined by all entries in it's training data. A compressed mapping Which due to forcing a limited entropy needs to do compression through generalization....

And since when is compression allowing you to avoid copyright??

So if you want it to be handled differently by law because it's isn't used as a compressed database you have to special case it in law.

But it is used as a compressed database, in that case e.g. it was used to look up lyrics based on some clues. That's basically a lookup in a lossy compressed obscure database system no matter how you would normally think about LLMs.

(1): And in case it's not clear this doesn't mean every RNG is a violation because under some unknown seed it probably would reproduce copyrighted content. Because the RNG wasn't written "based on" the copy righted content.


In regards to "Because the RNG wasn't written "based on" the copy righted content."

Does that mean I can distribute the seed if I find one and this RNG wasn't trained on that content?

Does it prevent me from sharing that number on the internet?

It seems like theres a lot of subjective intent here that I'm extremely skeptical

For an LLM also:

If it's lossy enough that it needs RAG to fix the results is that okay?

-------------------

In my opinion I think actually getting the output is where the infringement happens. Having and distributing the LLM weights shouldn't be infringment (in my head) because of the enforcability of results. Otherwise you risk banning RNGs or them all being forced to prove they didn't train on copyrighted content


> If it's lossy enough that it needs RAG to fix the results is that okay?

but then the only way RAG can "fix" the result is if the RAG system stored the song text in it's vector data base

in which case the law case and solutions to fix the issue are much more clear

in a certain way a a LLM which only encodes language but now knowledge and then uses RAG and similar is the most desirable (not just for copyright reasons but also e.g. update-ability, traceability, remove-ability of misinformation etc.)

sadly AFIK it doesn't work as language and knowledge details are too much interleaved

> Does that mean I can distribute the seed if I find one and this RNG wasn't trained on that content?

honestly I think this falls outside of situations copyright law considers. But also if you consider that copyright law mostly doesn't care about technical implementation details and that the "spirit of law" (intent of law maker) matters if unclear cases I think I also have a best guess answer:

Neither the RNG nor the seed by them self are a copyright violation but if you spread them with the intend to spread non licensed copy you still do a copyright violation and in that context the seed might be idk. taken down from sharing sites even if by itself it isn't a copyright violation.

The thing is in the end you can transform _any_ digital content into

- "just a number"

- or "just a equation", "equation system" etc.

- or an image, matrix, graph, human readable text , or pretty much anything

so fundamentally you can't have a clean cut between what can and can't be a copyright violation

which is why it matters so much that law acts on a higher abstraction level then what exactly technical happens.

And why intent of law (in gray area cases) matters so much.

And why law really shouldn't be a declarative definition of strict mathematics rules.


>But creating a derivative work based on the song?

You need a license to create derivative works.


they clearly didn't do that properly, or we wouldn't have the current law suite

the lawsuit was also not about weather it is or isn't copy right infringement. It was about who is responsible (OpenAI or the user who tries to bait it into making another illegal copy of song lyrics).

A model outputting song lyrics means it has it stored somehow somewhere. Just because the storage is in a lossy compressed obscure hyper dimensional transformation of some kind, doesn't mean it didn't store an illegal copy. Or it wouldn't have been able to output it. _Technical details do not protect from legal responsibilities (in general)_

you could (maybe should) add new laws which in some form treat LLM memorized things the same as if a human did memorize it, but currently LLMs have no special legal treatment when it comes to them storing copies of things.


No, it’s specifically about (mostly) verbatim producing big chunks of lyrics in the output. The court PR specifically mentioned memorization, retaining training data, multiple times.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: