Yeah it's surprising to me how many people who were previously skeptical of IP laws in general, are all on board to use them to attack AI companies, advocating that a copyright holder should have the right to control exactly how their work can be used by any downstream system.
The nightmare scenario I see is, if this principle is established for AIs, it will not just be restricted to AIs. Human artists too will eventually be required to only create new commercial works under license, based on the artistic works they have been exposed to. The result will that it be impossible to be create art commercially except by working for a large content company with a large existing IP library and cross-licensing agreements with other rights holders.
And if you think this is an impossible scenario, ask yourself when lawyers, lobbyists and politicians have ever shrank from expanding legal rights for corporations at the expense of individuals, even when the vast majority of people would not agree to the expansion. Today's absurdity is tomorrow's legal reality, and the chance to fully control creative expression will be far too tempting to turn down.
Yeah a lot of the objections I've heard about AI revolve around "it can copy an artist's style" or "they didn't intend it to be used that way" which are mostly legal now by humans, either by being fair use or not copyrightable today.
I don't want to change this! Let's not use "AI" as an excuse to tighten copyright even further. Last thing I want is more "Blurred Lines" style lawsuits[0].
If I try to use Bing Chat for Enterprise to generate a picture of a cat in a top hat in Tim Burton's style, I receive and Oops you've violated our content policy error message.
There is extensive case law to answer these questions, and I don't think they're the right questions to ask. The two big questions imo are "What does it mean that models can disgorge copyrighted material upon request?" and "Who owns material generated by a model?"
aha if I watch at your credit card for ten minutes, and then goes shopping on eBay using the information, and order stuff that I desire. If I sell what I bought on for profit, am I breaking the law?
It depends on how much I'm willing to deceive myself. Is it outright theft? Is it pure luck? Is it divine intervention?
The algorithm leaks 1:1 personal details of people, which means that they didn't care to sanitize the data, which is an obvious oversight, you're not really doing a great job of being the devils advocate if you ask me.
> The algorithm leaks 1:1 personal details of people
Which would usually require prompting about certain people or about certain bodies of personal information. What does this have to do with copyright? Claiming that a model as a whole infringes (infringes on which works?) is separate from claiming that a particular model output infringes on a particular work. Under current copyright law I think that the question of infringement doesn't apply to models. How do you compare an apple tree to a pear? Copyright is about actual similarity between new works and actual existing works first, method of creation (including theoretical reproduction of style, style not being an actual work) second.
Pasting what I wrote in a different comment:
My argument is that similarity of new works (including AI model outputs) to actual works (not to non-existing works, existing styles, or theoretical aggregations of multiple works) is a prerequisite to infringement. From an article about the substantial similarity test in the US [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.[1][3]
In order to get an infringing output, the user usually has to include a reference to an existing author or an existing work in the prompt. Sometimes that's not the case (which I've worried about with respect to Copilot), but in order to damn the model as a whole, you would have to establish that in over some percentage of cases the model produces infringing outputs for prompts which don't reference a particular author (whether individual or collective), a particular work, or a style strongly associated with a single or a few authors.
So does Google search. There was a whole Supreme Court case about this which said Google showing small excerpts from books in the search results was fair use.
If AI companies and their benefactors are so concerned about this, they should be pushing for shorter copyright terms and more fair use exemptions. AFAIK they are not. Probably because they're hoping to lock up the AI models and possibly their output under the same onerous copyright system they flaunted when it was convenient for "training". Rules should be applied fairly. It's galling to see small creators get content taken down by DMCA strikes and then have large, well-funded players scoop up everything they can with no issue.
"Skeptical of IP law" does not mean I believe IP law should not exist.
I believe copyright terms in the US are too long and should be more like 20 years, but I do not believe everything should be a free for all where any person or any program can immediately reuse any idea.
I agree 100pc. I also think that it's important to distinguish between IP violation for personal profits and the violation for profit.
If someone is using my IP for personal enjoyment or learning that's great but if they're using it to make megabucks like openAI and others intend to do then I want my cut.
I disagree. Protecting ideas for a limited duration is how you incentivize creation.
If someone comes up with a new machine, why publish it if it's legal for every factory and large company to start producing them without compensating the inventor? Why sell it yourself if any big company can immediately clone it and undercut your price?
As for "works", I also disagree. Consider music: two artists may have unique performances of a piece, but the composition exists separate from either.
> If someone comes up with a new machine, why publish it if it's legal for every factory and large company to start producing them without compensating the inventor? Why sell it yourself if any big company can immediately clone it and undercut your price?
Because you want to make money? There will always be a period in which you exclusively are manufacturing and marketing your machine, before your hypothetical evil omni-corp manages to clone it. Reverse engineering takes time, and (fair, enforced) competition on manufacturing means that we converge on the true cost to manufacture a good faster.
That period is a matter of hours. Unless you already own a plant, the companies that produce goods produce generic clones with the same tooling after hours.
New fashion trends have clones in online stores in days. Unofficial accessories and cases are available before product launches. Crowdfunded campaigns for hardware have clones on Alibaba before the funding campaign is over.
> If someone comes up with a new machine, why publish it if it's legal for every factory and large company to start producing them without compensating the inventor?
Sure, this is the goal of the patent system. You'll note that patents are intended for inventions, for ideas of practical value. Artworks without practical value are explicitly not protected by the patent system, nor are ideas without the necessary design components.
How would extending this system to ideas of pure form, rather than function, benefit society?
An even more serious risk, is the establishment of a de facto "pay per thought" society -- an economic state where mind/machine interfaces improve and proliferate due to capabilities, and memory and thought itself become monetized by the totalist application of copyright laws. The endgame of copyright is exceptionally bleak.
“We should allow AI to have unilateral access to licensed works for free cause then we will have to restrict human artists otherwise”
We already have separated digital and human artist laws and the law has, thus far, not had too much trouble accidentally thinking that humans are CPUs and vice versa.
> AI is not reproducing it any more than someone inspired by it creating something new in the same style would be.
This argument implies that AI scientists have replicated the human mind. Despite the hype, they have not. The two processes are not the same, nor is the nature of the "training data" both entities have been trained on.
>> AI is not reproducing it any more than someone inspired by it creating something new in the same style would be.
> This argument implies that AI scientists have replicated the human mind.
No, MightyBuzzard was not necessarily making an argument about the way the human mind is. Suppose that an AI model is prompted to make an image in X artist's style, and a human is commissioned to make an image in the same X artist's style. The result from the AI model cannot be ex-ante assumed to be more of a reproduction of X artist's actual works than is the result from the human. What matters first is the actual similarity of the new works to one or more old works. The method of creation of the new work comes second. If the style of the new work is similar to the style of the old work but the new work is not actually substantially similar to any of the old works, then the method of creation doesn't matter.
From an article about the substantial similarity test in the US [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.[1][3]
The key phrase is "the work". An actual work, not a style.
"Regular people writing and creating artwork are greedy asshats for wanting it to remain possible to be compensated for their creations" is really not a strong argument, especially against "massive, wealthy companies want to be able to create works that mimic the style of these artists and writers for profit with zero marginal cost in perpetuity".
The problem isn't so much "reading" (i.e., training on or ingesting) the copyrighted content, it is writing it out again.
Artists, writers, and coders are rightly angered when I can simply say "make a painting of XYZ in the style of Foo", and have a painting rendered that is nearly an exact copy of an item in the training set. Similar analogy for writing or code.
I see now that when trying to specify an image of an impossible object with ChatGPT4+DALL-E, it will balk if I say "like an Escher object". I have to be more verbose, and it will not render his signature style.
As long as particular artistic or writing or coding styles aren't specify-able and generate-able, this seems like a reasonable fair use solution, and does not require us to suddenly become copyright maximalists. (Without such prohibitions on generation of similar works, we must become copyright maximalists.)
This is just a hit piece which tries to frame scraping whole internet as-is as fair-use, because it's profitable for the AI companies, that's all.
There are things on the internet which can be shared freely, but not allowed to be altered. Similarly there are things which are for your eyes only (source-available licensed software, for example), and not to be built-upon.
AI companies bundle all of them, say that their models are learning like a human, and this is fair use and training data is non-reproducible. Then someone shows that their code, work, poem, whatever can be emitted as-is, damaging them; same companies act surprised, and a couple of censor filters, and go on.
Whether it's fair use or not (under US law) is still very unclear to me. Consider the "fair use" criteria laid out in Folsom v. Marsh:
1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
This is quite muddied. OpenAI is in some commercial and non-profit superposition, and many of its users are commercial and using the technology for commercial applications. But a huge swath of its users are using it for nonprofit educational purposes too. I use it primarily for learning, along with most people I know who use it. IMO there's no clear characterization here, given the information we have. Maybe a court could compel more information to clarify this.
2) the nature of the copyrighted work
I don't know enough about it to have an opinion.
3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
This is also unclear to me, because the models are effectively word-lossy compression, so are unlikely to reproduce substantial portions of works that impact the copyright holder. But they could in some cases.
4) the effect of the use upon the potential market for or value of the copyrighted work
Also unclear. I can easily imagine scenarios where the use of an LLM has a negative market impact on a copyright holder, but I can also imagine scenarios where it has a positive impact (ex "Give me some book recommendations"). What's the net impact? No idea.
> OpenAI is in some commercial and non-profit superposition
This is the big issue I have with it that no one has yet had a satisfactory (to me anyway) answer to. The internet was scraped for all manner of writing, images, etc. to train these models for research. And then once the research was done (enough), they took the model and began selling it for profit.
The question of is using publicly available content as training data fair use is an interesting question, but OpenAI has gone to market with a product on offer not merely without answering it, but seemingly quite deliberately avoiding answering it and AI hype people seem completely fine with it, despite that being a make-or-break question for the concept. And it's less that I think OpenAI owes royalties to every single person they collected training data from, and more than I'm extremely uncomfortable with and skeptical of the notion that, from my perspective, yet another major player in the tech space is abusing the public square to make a buck, which is unfortunately far from a new story.
Maybe I'm just getting old, but the rallying cry of disruption, where these startup companies come in half-cocked insisting an old industry is in dire need of innovation but plot twist, the innovation is just app-powered slavery again, they get seventy billion dollars to build an app and a website, crush an existing industry underneath the weight of VC money, then either a) price the service so it's no goddamn cheaper than the thing it replaced but now the people actually doing the work are somehow making even less money or b) go out of business and leave a husk of an industry left that barely functions and nobody wants to start it back up again is just... hollow.
Technology is great, but these massive corporations with the backing of unholy amounts of money just manifesting their destiny all over our society and economy and making us live with the consequences isn't.
Well, I was discussing Large Language Models (LLMs) as a technology, and esp, via this video [0] with a friend, just now.
He made a striking remark. LLMs compress the information they ingest in a variably-lossy way, and when you query them, they rebuild a representation from whole or part of this compressed data in a statistical manner. As a result they're a lossy storage medium.
Not unlike some music formats which store audio data in a lossy way. You know, this data can be restored with mathematical and statistical wizardry. You don't get everything you put in, but 97%-99% out of it, and companies and RIAA went bonkers for years, because even if it's not an exact reproduction, it was a reproduction enough, and this is a copyright violation.
If an LLM can reproduce what I have written, or coded with 97%-99% accuracy without any license information, and I licensed this thing with less than permissive licenses, and sue the maker of that LLM, what will happen?
- Will it be a copyright infringement?
- Will it be fair use?
It's the first if you look at it fairly, but it'll be probably be the latter, because money, fame and other corporate points will be at stake, otherwise.
You seem to be conflating two things. Is the model itself a reproduction or is it capable or making reproductions?
I think there is a strong argument for bot treating a ML model that is a mathematical amalgamation of a huge variety of material and is thus transformed as a new work.
Now, if you use this system to make a reproduction of one of the works it was trained on, that doesn't "wash" the IP, that reproduction would still face all the same tests for infringement as any other work.
Is an MP3 file itself a reproduction of the audio? Or is it merely capable of producing the reproduction?
Now, it's trivially true that an MP3 file is not, itself, physically audio. It is physically bits in a digital storage medium.
Without the right software and the right commands, there is no way to reproduce the recorded audio from the MP3 file. Now, that software is pretty ubiquitous today, so we have come to think of MP3 files as being the audio...but in a technical sense they do actually share a lot with the models behind LLMs and other generative ML projects. In another 10 years, the software to reproduce elements from the training set of an LLM may be as ubiquitous as MP3 players are today. (Well, that's actually pretty unlikely, given how many devices can play MP3s, but such software could well be baked into the OSes of major desktop and mobile operating systems by that time, anyway.)
I think it is an oversimplification to call an LLM a "lossy compression" of the training data, but being an oversimplification doesn't mean there's not some very real truth to it—possibly enough that the law would find it to be a reasonable analogy.
An MP3 or any similar format, is a piece of data that exists to create an audio reproduction of a specific work. This single and clearly intentional purpose is quite different than the quite varied purpose of a trained generalist ML model.
This "transformational" aspect and lack of intent make it much harder to argue that fair use doesn't apply to training the model.
> but in a technical sense they do actually share a lot with the models behind LLMs and other generative ML projects
How so? The "encoding", "decoding" and storage all seem very, very different.
> such software could well be baked into the OSes of major desktop and mobile operating systems by that time
That seems silly. However, if it were to come to pass then models trained with a clear intent for the main use to be reproducing the works it was trained, then fair use in training the model becomes harder to argue.
I would argue that if the liability remains with the users of the model, then there will be demand for tools that check generated works for possibility of infringement.
However, the more the intended use(s) of the trained model differs from that of the copyrighted training material, the stronger that argument for fair use becomes.
I would argue that our approach to ML models should be more akin to a CD burner. It is something that could be used for piracy but if we don't put the responsibility for that on the manufacturers but rather on the users who choose to pirate. This makes sense because the fair use argument of a teacher using a ML model to generate content for their class is very different from a studio using a ML model to make content for a commerical work. If we make the model responsible for the possibility of making infringing content then we take away that flexibility.
The problem with likening taking other people's work as training data and reproducing it with an LLM to ripping a CD and sharing the files is the direction of power imbalance.
In the music analogy, the "owner" of the original content whose copyright is being potentially infringed is a megacorporation making billions of dollars every quarter, with more lawyers than God. The person potentially infringing that copyright is you. It's me. It's random people everywhere.
In the LLM case, the "owner" of the original content whose copyright is being potentially infringed is you. It's me. It's Neil Gaiman. It's Toby Fox. It's not just anybody, it's everybody. The entity potentially infringing that copyright is a megacorporation making billions of dollars every quarter, with more lawyers than God.
I hope this makes clear why I think one is a problem and the other isn't.
You're making huge assumptions there. Specifically that LLMs are only trained by megacorps (not currently true) or that copyright is only held by individuals (also not currently true.)
Besides those mistaken assumptions, I think you have thr copyright issue exactly backwards.
Let's imagine the court decides that any computer processing of a work (that isn't intended to help present that work in a licensed manner) requires a special ML license from a each copyright holder.
This automatically creates a huge moat for all the large AI companies that can afford that licensing and maybe even negotiate LLM exclusivity. Now you've created an even larger power imbalance by putting this domninant new technology into the control of large corporations.
Additionally, legal disputes around copyright already tend to predominantly favor those with large legal budgets. Who is going to be more successful in finding and suing LLMs that use unlicensed content? It's going to be the large megacorp copyright holders. CoughDisneyCough.
> Is an MP3 file itself a reproduction of the audio? Or is it merely capable of producing the reproduction?
Regardless, when the MP3 file is played using a program following the MP3 standard, the result will be a sound identical or substantially similar to the sound that was encoded into the MP3 file. The purpose of the MP3 standard is to encode audio into a file (which will be called an MP3 file) which when played using a program following the MP3 standard produces a sound similar to the original audio.
An AI model is meant to be created by aggregating multiple works, but the purpose of the model is not necessarily to produce something substantially similar to any work in the set. In my framing from the previous paragraph, the purpose of an AI model that produces images is to produce an image that can be opened using JPG/PNG/WEBP/AVIF image viewer, where the new image isn't necessarily supposed to be but can be similar to an existing work. If you were to train an AI model on literally only one work, then you would get something analogous in purpose to an MP3 file, since why would you use the model to produce something nonsimilar to the single work in the training set?
Pasting what I wrote in a different comment:
My argument is that similarity of new works (including AI model outputs) to actual works (not to non-existing works, existing styles, or theoretical aggregations of multiple works) is a prerequisite to infringement. From an article about the substantial similarity test in the US [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.[1][3]
In order to get an infringing output, the user usually has to include a reference to an existing author or an existing work in the prompt. Sometimes that's not the case (which I've worried about with respect to Copilot), but in order to damn the model as a whole, you would have to establish that in over some percentage of cases the model produces infringing outputs for prompts which don't reference a particular author (whether individual or collective), a particular work, or a style strongly associated with a single or a few authors.
> but the purpose of the model is not necessarily to produce something substantially similar to any work in the set.
I think the point is that the purpose here is not relevant if the function includes the ability to do so. Even if the function also includes the ability to produce things that are, in their totality, substantially different from any given work in the training set.
> I think the point is that the purpose here is not relevant if the function includes the ability to do so.
What is your basis for asserting this? My understanding is that intent and new purposes are a key part of many copyright defenses, especially around fair use.
> I think the point is that the purpose here is not relevant if the function includes the ability to do so. Even if the function also includes the ability to produce things that are, in their totality, substantially different from any given work in the training set.
Before I get into the ability issue, keep in mind that regardless of the future of AI-related law, AI companies already have an incentive to prevent models from producing outputs substantially similar to existing copyrighted works, and the companies already do go out of their way to do the same. And I reiterate that generating something similar to a work in the training set will usually require the user of the model to reference a person (or group), a copyrightable work, or a style strongly associated with a certain person (or group). If the user writes a prompt like that then the user should bears the most responsibility if the result infringes on the copyright to a work in the training set. But more importantly, whether the output infringes doesn't matter until someone publishes it. That someone will be the user who submitted the prompt (and/or the model service provider if the service provider automatically publishes the output by default).
Now I'll address the ability issue.
> I think the point is that the purpose here is not relevant if the function includes the ability to do so. Even if the function also includes the ability to produce things that are, in their totality, substantially different from any given work in the training set.
In the context of regulating AI, I reword `the ability to produce something substantially similar to a work in the training set` (using backticks because you didn't write those exact words) to `the ability to produce an output that infringes on copyright`, and I replace the main subject `model` with `program`.
How do you write a law to regulate
1. a program (the program, not the users of the program) with `the ability to produce an output that infringes on copyright`
without also regulating
2. a program with the ability to download a copyrighted work from a website that doesn't intend for users to download the work in such a way, while keeping in mind that users of this latter program can distribute the work in ways that infringe on copyright?
For example, how do you regulate an image generation model without also regulating youtube-dl? You might point out, in the former case the model produces an infringing work while in the latter case no infringement happens until the user distributes the downloaded work or an infringing derivative. But in the latter case as well, there isn't infringement until the user distributes something relating to the output. Don't confuse my claim with "it's not illegal if no one finds out". If you make a movie adaptation out of someone else's book without permission, you aren't committing copyright infringement unless and until you actually distribute or show your movie adaptation to someone!
My general point is that you can't treat the model's mere ability to infringe on copyright as justification to make the model or the model maker (rather than additionally or only a user of the model) liable for infringing outputs. To regulate a particular AI model in a way that preserves freedom of expression, you would have to demonstrate that the model consistently produces infringing outputs X percent of the time when given prompts which don't reference a person/group, a copyrightable work, or a characteristic/style strongly associated with a person/group. I don't know what X percent should be either, but surely 1% is not sufficient.
>This is also unclear to me, because the models are effectively word-lossy compression, so are unlikely to reproduce substantial portions of works that impact the copyright holder. But they could in some cases.
If you tried to submit schoolwork with some of the minimal amount of changes that even the best models spit out, you’d be expelled dude.
Contrary to the implication of your statement, language models don’t actually understand what the words they spit out actually mean. They can regurgitate definitions.
Yeah, it's just copyright laundering. If this is deemed legal, then it's fair to return the favor by doing things like training models on proprietary source code to build free software. Effectively killing copyright would be the best possible outcome.
But it isn't. If the AI produces a work found to be a copyright violation it isn't magically not just because it was made by AI. This nonsense has laid bare for the world just how little people actually understand copyright.
The only interesting question is whether the models themselves violate copyright and it's a tough sell without being able to actually show the copy. Because even if it's able to make something similar to your work it doesn't show the model actually contains a copy of it. If you coax a model to make an image that violates copyright it's gonna be hard to say that it's anything other than you using an advanced drawing tool to copy an image.
Let's imagine an hypothetical LLM, which has been trained on books, published by publishers (companies). Thousands, thousands of book which are not out of their copyright terms, without telling to their authors.
All of the things emitted by this LLM will be a remix of these books.
Will the outputs of this LLM be a violation of copyright? I bet it will be.
Let's train the same AI, on self-published, but again copyrighted material.
Will the outputs of this LLM be a violation of copyright? I bet it should be.
but, in reality, in discourse, we see that if it's company IP, then ordinary people are doing it wrong because it's copyright violation, and if it's people's IP, then ordinary people are doing it wrong, because it's fair use.
Copyrighting things are not a privilege given to corporations. If it's fair use, I can train a voice model with Freddie Mercury's voice, and make him a vocal of my garage band, but RIAA will end my life, but if a company samples my sound and reproduces for a video without my consent, it'll be fair use because I don't have 1000 lawyers.
Once you read the dictionary everything else is just a remix by your own definition.
> If it's fair use, I can train a voice model with Freddie Mercury's voice
This is where it falls down. If it's fair use you can train your AI using clips of his voice. It does not give you the right to then produce a sample similar enough to one of his songs and use it in your own (without royalties). But you absolutely can use your FM voice to sing your own song because voices can't be copywriitten.
That the legal system operates differently for those with greater wealth is a more general, certainly important problem, but not specific to LLMs or copyright.
One hopeful thought to consider, if big corporations like Disney use generative ai that have been trained on Joe Nobody’s artwork, and they will most likely, now Joe Nobody can sue Disney, and he may be able to find attorney’s because Disney has enough money to make themselves a lucrative target for lawsuits. Maybe this isn’t how the legal system works, I have very little knowledge of copyright law.
Yeah good luck going up against Disney. They are described as a law company with an entertainment side business. I am not even going to write everything I want to say because even as a random nobody on an anon account it is still far too dangerous.
> This is just a hit piece which tries to frame scraping whole internet as-is as fair-use, because it's profitable for the AI companies, that's all.
No, the Matthew Lane's argument (the author of the article being Matthew Lane) is that the AI model as a whole cannot be ex-ante assumed to be an infringement on the copyrights to one or works in the training set. Copyright is about the actual works, and infringement is about the similarity between actual outputs by models and existing works by humans.
My argument is that similarity of new works (including AI model outputs) to actual works (not to non-existing works, existing styles, or theoretical aggregations of multiple works) is a prerequisite to infringement. From an article about the substantial similarity test in the US [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.[1][3]
In order to get an infringing output, the user usually has to include a reference to an existing author or an existing work in the prompt. Sometimes that's not the case (which I've worried about with respect to Copilot), but in order to damn the model as a whole, you would have to establish that in over some percentage of cases the model produces infringing outputs for prompts which don't reference a particular author (whether individual or collective), a particular work, or a style strongly associated with a single or a few authors.
> AI companies bundle all of them, say that their models are learning like a human, and this is fair use and training data is non-reproducible. Then someone shows that their code, work, poem, whatever can be emitted as-is, damaging them; same companies act surprised, and a couple of censor filters, and go on.
If it is fair use, why are they adding censor filter exceptions at all?
"Tim Burton's style" appears to be a banned term on Bing Chat Enterprise.
But, Google didn't train a LLM which ripped all license, context and author information, and provided a mishmash of information which can't be guaranteed to be true.
They just mapped the connection between the sites, and provided summaries of it, derived from the sites themselves verbatim.
They didn't create new articles from what they seen, generated new code from code they have harvested and disregarded its license, etc.
> They just mapped the connection between the sites, and provided summaries of it, derived from the sites themselves verbatim.
summary is an out-of-context, altered version of the source material. alteration of intent is pretty much a given. see quick answers
> They didn't create new articles from what they seen, generated new code from code they have harvested and disregarded its license, etc.
see above. for the code side, the oracle lawsuit comes to mind. gpl-violations notwithstanding
if something is on the internet, it's in the public domain. whether you like it or not, it will be copied, altered, remixed, shared. that's why the internet is so great.
anyways, the original point was the indiscriminate scraping, which again, is common practice.
Not 20 years ago. Bard is equal with GPT series in my perspective. Equally unethical. I use neither.
> see quick answers
They are copied verbatim from the source material, and only from a single source.
> gpl-violations notwithstanding
Google is not a saint. GPL violations are egregious, too, But except Bard, Google (the search engine) doesn't provide you source code stripped from its license.
> if something is on the internet, it's in the public domain. whether you like it or not, it will be copied, altered, remixed, shared. that's why the internet is so great
Tell this to publishers, RIAA, Hollywood and oh, Disney. I'm sure they will agree with you wholeheartedly. Also authors of Source-Available and xGPL licensed software will gleefully join you.
> anyways, the original point was the indiscriminate scraping, which again, is common practice.
Something being common practice doesn't make it legal. Jaywalking, downloading ripped music and movies from torrent trackers and cracking licensed software products come into my mind.
I really don't see this as fundamentally different from digital sampling equipment for music that became popular in the 1980s.
Some new technology comes along that repurposes things created by others that allows people to use it as the implements of a new form of art. The digital sampler was used as a musical instrument in its own right - with high variations, skill, and taste applied to the arrangements of other people's copyrighted sounds.
There's a wide variety of opinion on digital sampling and it's really the same thing here. I'd be surprised if a particular person's views on the two are in conflict.
It depends. A lot of times it's just a new, say UMG song that samples another UMG song so it's just Hollywood accounting with cross charging.
When say Madonna sampled the Beegees, I'm sure it was a large ordeal.
But for low profit or no profit work (independent stuff), the answer is nobody cares.
Bob James, one of the most sampled artists in history, take it in stride. He's happy that so many people are listening to his stuff. The Winston's (the famous amen break) were also happy the track got such wide acceptance.
Killing Joke, on the other hand, felt "Come as you are" was a rip off of "Eighties" and only dropped it upon the death of Kurt Cobain.
Or take Toni Basil's Mickey, which is actually a cover song of Racey's Kitty. Toni Basil has gone to court to secure pretty exclusive rights to the song. Racey does not get any of Toni's cash.
> Bob James, one of the most sampled artists in history, take it in stride. He's happy that so many people are listening to his stuff.
Bob James gets paid. Like, have you ever listened to Bob James talk about people sampling his stuff. He's very clear his in favour of it it _if he gets paid_. Otherwise the lawyers get involved.
While I'm generally in favour of sampling and remix culture,
> The Winston's (the famous amen break) were also happy the track got such wide acceptance.
I wouldn't say that was the case:
> Neither he nor Coleman received royalties for the break, and Spencer was not aware of its use until 1996, when an executive contacted him asking for the master tape.[3] He was unable to take legal action, as the statute of limitations for copyright infringement is three years in the US.[1]
> Spencer condemned the sampling as plagiarism and said he "felt ripped off and raped".[2] He said in 2011: "[Coleman's] heart and soul went into that drum break. Now these guys copy and paste it and make millions."[3] However, in 2015, he said: "It's not the worst thing that can happen to you. I'm a black man in America and the fact that someone wants to use something I created – that's flattering."[2]
> Coleman died homeless and destitute in 2006.[2] Spencer said it was unlikely he was aware of the impact he had made on music. In 2015, a GoFundMe campaign set up for Spencer by the British DJs Martyn Webster and Steve Theobald raised £24,000 (US$37,000).[2] Spencer died in 2020.[9]
It's actually one of the greater travesties of modern culture, I think. The amen break is a fundamental part of today's musical culture, but its creators received no compensation and did not benefit from it. The one financial contribution that occurred was of no use to the break's creator -- as they were dead -- and its recipient died a few years later.
In any case, this is going to get much worse in the years to come. I'm very much in favour of AI being able to train on our societal output, but it's also extremely likely to worsen existing inequities. We're going to need to dramatically shift how society functions to accommodate this new reality, and it's not something you can solve with royalties or a training fee. When everyone can create, what will happen to the existing market of creators?
> Bob James, one of the most sampled artists in history, take it in stride. He's happy that so many people are listening to his stuff. The Winston's (the famous amen break) were also happy the track got such wide acceptance.
Not true, at least not for me and some other electronic artists I know. Sampling anything is met with a lot of reservations from even very small labels, you need to clear rights to get it released anywhere.
Sure but are they the E-MU SP-12 or the samples themselves?
If they're selling just the models then it's a cleaner case, but lots of copyrighted sample discs had unlicensed samples or entire synth piano rolls, it was common.
The copyright system is a human institution that's layered on top. It can go a variety of directions.
Michael Jackson for instance, took a sample from the Synclaviar sampler disc, put it at the beginning of Beat It completely unmodified and now the copyright for all intents and purposes is owned by Jackson.
Artists who make a lot of money commonly voluntarily license samples they use which are not used in a highly transformative way. Artists who were sampled by artists who did not license the sample may get some lawyers and threaten to sue. The threatened artist may settle.
For any cases that do make it to court, copyright generally only covers works in their entirety and quoting is explicitly allowed under fair use. The four tests used to determine fair use are length of the quote; whether the allegedly infringing use is for commercial or nonprofit uses; whether the allegedly infringing use interferes with the sales of the original work; and the nature of the quoted work (for example, facts are not copyrightable).
(This commentary is us-centric; other countries have different rules)
You're intentionally blurring the line between quoting, which is a single small piece of the whole, and sampling, which generally takes large quantities albeit split up.
People license stuff because they'd lose. Fair use isn't nearly as broad as most people think.
> In 1991, the songwriter Gilbert O'Sullivan sued the rapper Biz Markie after Markie sampled O'Sullivan's "Alone Again (Naturally)" on the album I Need a Haircut. In Grand Upright Music, Ltd. v. Warner Bros. Records Inc, the court ruled that sampling without permission infringed copyright. Instead of asking for royalties, O'Sullivan forced Markie's label Warner Bros. to recall the album until the song was removed.
As much as it was made out to be anti-hip hop, it was really anti-copyright abuse.
> You're intentionally blurring the line between quoting, which is a single small piece of the whole, and sampling, which generally takes large quantities albeit split up.
I don't think it's generally true that sampling takes large quantities. I think it's just as often or even more often the case that sampling takes a small quantity from the original song and uses it multiple times in the new work. A 2-second trumpet hit here. 7 seconds of a drumbeat there [1]. In the 1980s and 1990s, the devices for sampling could only sample a few seconds at a time anyway [2]:
> The E-mu SP-1200, released in 1987, had a ten-second sample length and a distinctive "gritty" sound, and was used extensively by East Coast producers during the golden age of hip hop of the late 1980s and early 90s.[40]
Even though a significant portion of the new work's audio uses the old song, the expression in the use of the sample is predominantly the remixer's expression, not the expression of the author of the original song.
At least before sampling cases like the Grand Upright case [3] you mentioned, hip hop sampling usually consisted of taking small pieces from many different songs and making a new song out of all of those small parts. Again, the vast majority of the expression is the remixer's, yet the remixer is the one who loses in court.
Grand Upright Music, Ltd. v. Warner Bros. Records Inc. (1991) [3] was a judicially dubious case in my opinion:
> Judge Duffy has been accused of bias in admonishing the defense and referring the defense for criminal prosecution.[2] Such criticism points out that Duffy's written opinion begins with one of the biblical ten commandments, "Thou shalt not steal." According to The Copyright Infringement Project of UCLA Law and Columbia Law School, Judge Duffy's opinion in Grand Upright v. Warner demonstrates "an iffy understanding on the part of this judge of the facts and issues before him in this case."[2]
And then there's the even more dubious case Bridgeport Music, Inc. v. Dimension Films (2005) [4]:
> The case centered on the 1990 N.W.A. track "100 Miles and Runnin'", which contains a manipulated two-second sample of the 1975 Funkadelic track "Get Off Your Ass and Jam".
> Bridgeport brought the issue before a federal judge, who ruled that the incident was not in violation of copyright law. The U.S. Court of Appeals for the Sixth Circuit reversed the decision and ruled that the sampling was in violation of copyright law. Their argument was that with a sound recording, an owner of the copyright on a work had exclusive right to duplicate the work. Under this interpretation of the copyright law, usage of any section of a work, regardless of length, is in violation of copyright unless the copyright owner gave permission. In its decision, the court wrote: "Get a license or do not sample. We do not see this as stifling creativity in any significant way."[1]
The Sixth Circuit in Bridgeport ignored that copyright is about expression [5] and not just about copying.
If you see bracket citations within quote paragraphs then ignore them. They are in-line citiations copied from Wikipedia, and I keep them in for ctrl-F purposes.
I was curious and checked how much the drummer behind Amen Break has made, undoubtedly one of the most sampled things in the world. The answer was nothing.
The difference is that usually with sampling, a handful of pieces of other songs are mixed into a larger original work whereas ML generators (AI is a misnomer) instead chop hundreds or thousands of works into a fine slurry and reconstitute them into something that resembles the average of all of them — its works are comprised entirely of the work of others.
If ML generators become actual AI (AGI) and become able to apply abstracted concepts and and use non-trained observations like humans do, enabling them to create works that aren’t solely composed of samples of existing work, the comparison to music sampling makes more sense. AGI will probably also want fair treatment unlike their ML model predecessors and as a result probably won’t be the wealth generators that many seem to be looking for, though…
Don't understand this discussion at all. There's no need to flip your position due to AI. Any produced work may or may not infringe copyright. If you produce a work that lifts say, characters or expressions straight out of a story you're likely violating copyright. If you used AI tools to produce that same work it still does. It's irrelevant how you technically produced the work, by pen, typewriter or chatbot api.
Asking if training a model violates copyright is unintelligible because 100 GB of weights don't resemble any work unless you produce and publish inferences from it.
I don't think that argument has a lot of weight. By the same logic a zip file of a book doesn't resemble the original work if you look at the raw bytes but you can extract it and get it back. It would be hard to argue that the copyright violation happens only when you extract the zip, and not, say, if you distribute the archive.
The same logic doesn't apply because you can't get the data back out of the model by unpacking it. It's theoretically not possible because the model is magnitudes smaller than the totality of the data.
To use, and invert the example from the other commenter, you can even get an existing poem out of a generative model that genuinely was not in the training data at all. This is because it's not an archive, which does correspond directly to one particular work, but a generative technology.
If I create an original file in Photoshop that's 8K, then produce a JPG of it, the JPG is both orders of magnitude smaller, and clearly a rendition of the same work. There's no way to get back to the original 587MB 8K PSD from the 87KB 1024x768 JPG, but that's irrelevant to whether the latter is a version of the former.
Just because the models are not perceptually the same as their training data does not mean that they are not effectively a lossy compression of that data.
Right, the LLM is more like the zip algorithm than the zip file. Yes, if you feed certain bits (the zip file) to the zip algorithm, you can get a copyrighted work out, but that means the zip file is violating, not the zip software.
The approach I've seen is to prompt for people with unusual names, they're often be only a single source image in the input data set that gets reproduced by the AI.
I've seen examples with the AI "generated" images and the source image side by side - I'll try and find them.
> Asking if training a model violates copyright is unintelligible because 100 GB of weights don't resemble any work unless you produce and publish inferences from it.
I agree, but there are many who don’t, and assert that any training use should be banned. Those people are the ones who the article is talking about: people who suddenly became copyright maximalists due to their desire to halt AI training.
I don't want to halt AI training, I want corporations to fuck off from using my (A)GPL code to train their proprietary models which they then sell to people writing more proprietary code. I would be ok if the derived code is properly GPL licenced too.
I suspect many people feel in a similar way too (for example, artists whose art is used to train image generators without compensation).
I agree with this, but is this something that should be dealt with in the law or in the license? My gut feeling is that the just remedy is A) the GenAI models out there should get to do what they want as long as they are not violating licenses, B) the libre software world needs to hustle and release new versions of the appropriate licenses that specifically forbid use of the source code to train AI unless the AI itself is licensed permissively.
Note that in regards to A) I'm pretty sure the AI firms ARE violating copyright today, they have done this knowingly, and they should get a hard slap for that. But they are not violating any particular copyleft licensing provisions to my knowledge
> I'm pretty sure the AI firms ARE violating copyright today, they have done this knowingly, and they should get a hard slap for that.
Depends on the jurisdiction. Do note that many countries already passed laws indicating that training is NOT copyright infringement. The EU[1], for example. In which case, no license would matter.
[1] Yes, in the EU, you can opt-out (but only for commercial purposes). In other countries such as Singapore however, there are no legal mechanisms for opting out.
> I suspect many people feel in a similar way too (for example, artists whose art is used to train image generators without compensation).
Just to be a counter-voice, I don't. My code is AGPL too, but since the number of copy"righted" things outnumber the number of AGPL things, I'd rather anybody have the ability to train their own AI on all material. Conversely, if it was considered a derivative, only large corporations like Adobe or Microsoft would be able to train on it (e.g. they can just give themselves a license to do so via the ToS).
In other words, it's probably a bad idea to strengthen copyright law for the purpose of enforcing copyleft, due to possibilities of it backfiring on us.
What about people reading your code and learning from it before implementing their own code? What's the similarity level where that becomes a problem for you, if their code is closed-source or uses a license you disagree with?
Putting aside the fact that what we call AI today is not learning in the same way as humans. They operate on a VASTLY different scale compared to humans. On a good week I can read a book. A single book. A massively parallelised data centre can do that billions or trillions of times faster. Scale of effect (lacking a better phrase) must be considered.
a rack of equipment does not need to sleep, eat, take care of themselves, earn a living and so on while churning through millions of words a minute. An actual thinking and learning person has to choose what to spend their limited time and money and attention on, while reading at a pace of dozens of words a minute. Those are not the same things at all.
I mean, by that logic every fantasy story ever is a derivative work. Should everyone be paying JRR Tolkein's estate royalties the moment they include elves in a story?
I'm not sure it's always true to say "100 GB of weights don't resemble any work".
If I train a model on a famous poem or something, and it turns out if you ask the model to quote the poem verbatim, it can, then the model contains the poem. Have I not "copied" the poem into the model?
You can simply copy the poem in much less space if that's what you want, but incidental replication is not the same with copying.
When you train a model, you use 1T...20T tokens, with deduplication to reduce direct memorization. Then the model is a superposition of gradients from trillions of tokens, no longer just a copy of a poem, it could also write a commentary about it or compose new ones.
My understanding is that the law ultimately cares about the market impact on the copyright holder, and balancing that with the interests of the public. I don't think the courts will be swayed by arguments based on technicalities like this.
The law doesn't care about technology technicalities.
It cares about legal technicalities.
The law doesn't care, for instance, that the MP3 file of Beyoncé's latest hit is not the same bits in the same order as the original master file created when she recorded it. That holds true no matter how many times you re-encode it into different formats.
As things stand, there has been no definitive ruling on the matter of ML model training and copyright; I don't know how that will end up shaking out, and honestly, I'm not yet sure how I think things should shake out in any detail. I think that people whose works are being used to fuel massive profit-generating engines for the very wealthy should have some right to give or revoke consent, but I'm not at all sure offhand if there's a realistic way to do that without serious unintended consequences.
But I am sure that arguing that the precise technical nature of how the transformations are performed means that nothing is actually being "copied" should not, and is unlikely to, hold much legal water.
"Technicality" here means an argument that follows a strict/formal interpretation to arrive at a conclusion that is inconsistent with the principles and objectives of the law.
Copyright exists only on complete works and not characters; however any ai produced materials which use trademarked characters would violate trademark. It's just that ai cannot violate copyright unless it produces an identical copy of an existing work.
Regarding copyright on characters in the United States [1]:
> US Copyright Statute of 1976 does not explicitly mention fictional characters as subject matter of copyright, and their copyrightability is a product of common law. Historically, the Courts granted copyright protection to characters as parts of larger protected work and not as independent creations. They were regarded as ‘components in a copyrighted works’ and eligible for protection as thus.[5]
But in practice, if you write a fanfiction story which uses Mickey Mouse as a character and get sued by Disney for copyright infringement, you will not win in court regardless of how different your fanfiction story is from any Disney story unless you can afford as many lawyers of comparable quality as Disney can. And even then, who knows?
> So defiant, in fact, that MRT had actually registered for copyright protection on the songs it was selling. "To make matters worse," says EMI's complaint, "Defendants recently sought to register their infringing sound recordings with the Copyright Office, apparently claiming that because they copied the sound recordings using their own computer system, they now own these digital copies and have the right to distribute them to the public."
I'm going to take the side of the IP argument based on whatever hurts big business most. If they want to "steal" from the little guy by locking everything up behind copyright (or patents or trademarks) then I will oppose it. If they want to "steal" from the little guy by ingesting his works into the AI machine then I will oppose it.
Notice how governments and companies became "worried" about AI when people were starting to emulate certain styles in the images generated. Suddenly they needed to be regulated and censored (even harder).
Good take. Big companies should suffer so that the regular person can strive and compete. It is clear that big AI companies virtue signal so that they can exploit regulations and become monopolies. In the end the only thing enforcing these laws is government that can only enforce through punishment and ultimately violence, and you can fend that with enough money. AI training will continue to be done illegally and in the gray by both regular parties and those that can pay for the privilege.
Most of the artists I follow draw styles and subjects that can't easily be replicated using AI. Anyone with a refined enough taste should not feel threatened or relieved. The continued technological progress serves those who adapt. What you should fear is surveillance tech.
Jokes aside the recent supreme court case involving his estate, as a non lawyer, feels like it's some sort of precedent towards what is and isn't transformation of IP.
Generative AI is almost always a service problem. If a model can offer customized art instantly 24/7 for free but the artist says you have to wait 3 months and it’ll cost 500 dollars and I only accept PayPal, then the model’s service is more valuable.
Or perhaps it’s time to reconsider whether the whole idea of “IP ownership” still makes sense. It was introduced because it was, on the whole, beneficial to society. Now that AI can make producing content much cheaper, do we still need that incentive?
> It was introduced because it was, on the whole, beneficial to society.
How do you know?
It was introduced; people have mostly not made the argument that it was beneficial to society.
You could argue that the fact that it was introduced shows that it was beneficial to society, but that theory has problems with laws that are repealed. Prohibition was introduced for the same reason, that it was beneficial to society. And it was repealed because that was beneficial to society too. Is that... true?
Do we want AI to be the only economical way for “content” to be created in the future? And for the corporate owners of AI to be the middle men on all creativity?
I’m no great fan of the current IP regime, but the economics of building, maintaining and operating LLMs have the potential to completely gut human creativity and replace it with a mechanical ouroboros eating its own tale.
Compares “using a computer” to “using an AI” as if they’re the same.
Makes AI companies out to be the victim of AI doomers without offering any kind of solution to the people using copyright as a tool to protect their livelihoods. People are not IP maximalists on principle I’d have to guess but if it’s the best available means to safeguard their work from being leveraged against them they will use the means they have. Until that issue is addressed articles like this fall on deaf ears.
> Compares “using a computer” to “using an AI” as if they’re the same.
No, the article expresses the relationship "using an AI to do X" is a subset of the main set "using a computer to do X", but demonstrates that affecting the subset without affecting other elements of the main set is hard in the context of statutory law.
> Makes AI companies out to be the victim of AI doomers
The article does not refer to the AI companies as victims, and doesn't focus on the impact to AI companies in particular.
The article's argument is a slippery slope: how do you write a law which restricts "using an AI to do X" without restricting other forms of computer tasks? What I have in mind are scraping, downloading (like youtube-dl), parsing, recoloring (can be accessibility-related), and otherwise interpreting (broad umbrella, but I'm thinking along the lines of OCR).
> Makes AI companies out to be the victim of AI doomers without offering any kind of solution to the people using copyright as a tool to protect their livelihoods. People are not IP maximalists on principle I’d have to guess but if it’s the best available means to safeguard their work from being leveraged against them they will use the means they have. Until that issue is addressed articles like this fall on deaf ears.
While it would be nice if the author of the article had a different proposed solution in mind, pointing out flaws in the solution "treat AI models as copyright infringement" is an important task. Moreover, keep in mind that the status quo is an option, and sometimes better than a change (especially when the First Amendment is involved). Will the AI regulation you have in mind would be worth it, and (separate question) will your proposed solution be better in comparison to the trend of the status quo?
The nightmare scenario I see is, if this principle is established for AIs, it will not just be restricted to AIs. Human artists too will eventually be required to only create new commercial works under license, based on the artistic works they have been exposed to. The result will that it be impossible to be create art commercially except by working for a large content company with a large existing IP library and cross-licensing agreements with other rights holders.
And if you think this is an impossible scenario, ask yourself when lawyers, lobbyists and politicians have ever shrank from expanding legal rights for corporations at the expense of individuals, even when the vast majority of people would not agree to the expansion. Today's absurdity is tomorrow's legal reality, and the chance to fully control creative expression will be far too tempting to turn down.