> I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.
Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Suppose a company stores the whole of stack exchange in plaintext, then implements a chat-like interface that fuzzy matches on question, extracts answers from plain-text database, fuzzes top-rated/accepted answers together and outputs something, not necessarily quoting one distinct answer, but pretty damn close.
How much "fuzziness" is required for this to stop being copyright violation? LLM-advocates try to say that LLMs are "fuzzy enough" without clearly defining what that enough means.
>Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Right so sort of like a search engine that caches thumbnails of copyrighted images to display quick search results? Something I have been using for years and have no issues with, where the legal arguments are framed more about where the links go, and how easy the search engine makes it for me to acquire the original image?
Would your argument be the same if it was a human? If a person memorizes a book verbatim, however uses safety/common sense not the transcribe the book for others because it is a copyright infringement disallow him from using the information memorized whatsoever because he can duplicate it?
I’m saying that it doesn’t matter what humans do this machine isn’t a human.
There is no reason to believe that humans and machines should be the same under the law.
The clearest example of this is that in the US it’s already been decided that ai generated art can’t be copyrighted because it was made by a computer rather than a person. Same as for the monkey selfie.
Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Suppose a company stores the whole of stack exchange in plaintext, then implements a chat-like interface that fuzzy matches on question, extracts answers from plain-text database, fuzzes top-rated/accepted answers together and outputs something, not necessarily quoting one distinct answer, but pretty damn close.
How much "fuzziness" is required for this to stop being copyright violation? LLM-advocates try to say that LLMs are "fuzzy enough" without clearly defining what that enough means.