I feel like I’m reading the exact same sentence that someone posted about if StableDiffusion included images or was just “learning patterns” and totally generative from 6 to 8 months ago.
People argued, oh so hard about it.
“But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”
…but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.
Now. Let’s talk about code…
> and the problem is that it's synthesising code identical to existing code
There’s a word for that. It’s called copying.
Copying fragments, copying full text. It’s a black box, it spits out content that is indistinguishable (again, see stable diffusion, a lossy jpg of an image is the same image even when it is not bit-for-bit identical) from the input.
That’s copying.
The black box might be a neutral network, or a python script that reads a file from disk. The process is irrelevant.
If you memorise a block of code and type it out by hand with no reference, you are copying it.
These models copy code.
These models have embedded full text copies of some code.
What we do with that is an ethical question, but that it is true is not in dispute. It is true. It’s documented.
If you think these models are not copying at least some training data as output you are factually, and provably incorrect.
> “But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”
> …but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.
The _not possible_ assertion is in response to people arguing that generative image models literally copy and thus "steal" the images in their training set or that such generative models "work" by infringing on said images. The assertion is not that the models cannot possibly contain _any_ copies of training data, the assertion is that the models are generalizing and can not contain a majority of the training data.
Now, let's look at some lines from your linked article titled "Paper: Stable Diffusion 'memorizes' some images, sparking privacy concerns", which happens to lead with scare quotes around "memorizes" and already hedges with "some":
> Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested . . . resulting in a roughly 0.03 percent memorization rate in this particular scenario.
I do not consider a 0.03% rate of full copies particularly damning. Instead, it sounds like a flaw in training which could be addressed, which the article does in fact also mention being as a matter of overfitting to images that are over-represented in the training data.
> the 160 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.
Which side's claims do you feel this line supports?
Generative text models should, in theory, also be generalizing and not memorizing. In practice, however, it does appear far easier to get Copilot and the like to spit out exact copies of code, complete with license text. Presumably, there's just not enough code out there to sufficiently generalize from, so we're seeing overfitting as a matter of course.
If you made 1,000,000 AI-generated MP3s available and 0.03% of them matched Sony’s copyrighted music catalog then you would be liable for $250,000 * 30 in statutory damages. Arguing that infringement doesn’t occur because the incidence rate is low is wallpapering over a serious problem.
Really depends on the details. If the extractable works align with songs that are over-represent in the training data, they may largely consist of performances of public domain compositions. And if their extraction also requires prompting which explicitly requests an infringing result, then the actual liability might be something you could mount a defense for.
Or hell, the extracted potentially infringing and over-represented material might all be pop music set to variations of Pachelbel's Canon, and I'd pay to see that lawsuit.
Details matter here. For a given musical performance, there are at least 3 copyrights in action:
1. The copyright on the composition. This can also include arrangements - for instance, Gershwin's original piano version of Rhapsody in Blue is now public domain, but the orchestral version everyone knows,
2. The copyright on the sheet music (the actual layout, spacing, editorial notes, things like that.. it's actually an insanely deep subject. I've got an 800 page book on the subject - which is referred to as music engraving, as up until about 40 years ago it was literally done by engraving the plates by hand. Much much harder problem than doing normal book-style text layout, as it's fully 2D, whereas text is basically 1d with occasional special cases. (NB: This copyright is really only relevant to the musicians, conductors, etc, but it does matter.
3. The copyright of the particular recording. This is the really relevant one. A 5 year old recording of a 500 year old work is very much under copyright.
If that was the case, they'd be financing this stuff. Afaik they're not. What if you cross-referenced the sony copyrighted catalogue with 1 million traditional/public domain songs? I'd guess a 0.03% would be a rounding error.
I'll see if I can dig some up, but there have been similar examples to the image reproduction case that have come out from codepilot, all it took was using some more unique method names (i.e. ones that were very specific to a particular code base) and oh look there comes the "totally not copied, honest gov'nor" code.
Any company that is using copilot or similar is walking right in to a massive legal minefield. Any engineer that chooses to use Copilot without explicit approval from the company should seriously worry about what target they're putting on their back. You don't want to be putting yourself in a position where your employer can say "They did this without our approval" or worse "Despite explicitly having told them not to", should there end up being legal action. Someone is going to establish precedence one way or another at some point, don't let it be you.
"I'll see if I can dig some up, but there have been similar examples to the image reproduction case that have come out from codepilot, all it took was using some more unique method names "
If Edison goes over to Tesla’s house and looks at the latest invention on Tesla’s workbench, then goes to his factory and builds the same thing and sells it, then Edison has stolen and copied Tesla’s invention.
If Edison breaks into Tesla’s house at night, takes* the invention off his workbench, goes to his factory and uses it as a reference to build and sell the same thing, then Edison has stolen and copied Tesla’s invention.
In both cases we can say “stolen and copied”, but actual sequence of events and the legal recourse that Tesla has is different in each case.
This difference matters because open source is like Tesla having his workbench in a public place where people are encouraged to watch him work, play around with the inventions, help him build, etc. In the “Tesla’s public workbench” situation, Tesla’s recourse is debatably impacted in the first scenario, but it is definitely unchanged in the second scenario. Arguing as though Copilot is doing “second scenario stealing and copying” rather than “first scenario stealing and copying” sidesteps that debate about whether open sourcing your code made it ‘up for grabs’ for natural and/or artificial intelligences to learn from and/or memorize. Now, it’s not clear what side will win that debate - I personally think “this has infringed intellectual property rights” has a decent chance of being the victor. But sidestepping that debate is setting off peoples’ “you’re trying to pull a fast one” alarm and making them unsympathetic. That’s what I’m getting at here.
*: to avoid “steal vs copy depriving of property”-type discussions, postulate that Edison returns the invention to the workbench before Tesla notices
As to the linked article, I would quote from the article itself: “[The paper] is dense with nuance that could potentially be molded to fit a particular narrative”. I could say more: both the article and the paper state there were zero byte-identical matches, and 203 direct or perceptual near matches out of 175 million generations (or out 160 million training images) is 0.0001%, which argues against this very strong claim of “factually and provably incorrect”. But I think the argument I will actually make is this: I said Copilot was synthesizing identical code from its knowledge base, which is completely compatible with the claim that it is outputting code from memory (in the human sense of memory). What it is not compatible with is the claim that it is outputting code from memory (in the computer sense of memory).
These assertions have very little weight - what matters is what the law says. And for now, in the jurisdictions that matter, the law is still silent. Maybe it will catch up via either legislation or case law, or maybe it won't. But for now there is no legal consensus on what these new things are doing.
Of course there is a moral/ethical component to this, and individuals are obviously free to hold personal views. But unless some global working consensus forms (unlikely) then I don't see how change will come from that.
I suspect that it's going to be difficult for politicians to find the will to do something about this issue, and then to get their heads around it enough to make good laws.
If y = f(x, T) when y is a member of T, prove f() doesn’t copy a value from T.
This is possible. For example, many functions f(0) return 0; if we can derive a function g(x) that generates y without access to T (ie. the training data) then we can plausibly assume that f(x, T) might indeed not be copying from T.
…but can we do so?
If so, we must admit that it is possible that f(x, T) = g(x) and no copying is taking place.
So, in the past 50 years of prior work can we come up with an example of a function g(x) that generates the exact code we see coming out of these LLMs?
I’ll grant, it’s not impossible.
For example, if stable diffusion generated sin wave patterns or fractals that exhibited “deep complex structure” and it happens that some artist had written code to do that exact thing, there would (I believe) be a fair case that stable diffusion was not copying the artist, it was simply a parallel implementation of the same generative code.
However, now my scepticism kicks in.
For code written by hand, that was not generated, we are suggesting that a LLM is a parallel equivalent implementation of a human mind, and it can, with purely “learnt patterns” replicate not only the intent and structure of known code, but the exact code itself, repeatedly.
I’ll grant. It’s not impossible, and it’s very difficult to prove, but it seems fabulously, unbelievably, extraordinarily unlikely that a clean room reimplementation would be identical, repeatedly.
We’re entering into the domain of assigning probability and then doing a zero knowledge proof here, but the solution is easy.
Just prove by counter example.
Train a model that generates such code without the code as training data.
You have now won the argument.
…on the bright side, you’ve also solved the 10 million dollar question of “how do I train my LLM when I don’t have enough training data” (because that’s what you just did; train it to do something without training data), so you’re now rich and give zero ducks what I think. Congrats.
> For example, if stable diffusion generated sin wave patterns or fractals that exhibited “deep complex structure” and it happens that some artist had written code to do that exact thing, there would (I believe) be a fair case that stable diffusion was not copying the artist, it was simply a parallel implementation of the same generative code.
I understand your pseudo equations and I understand [some] engineering.
Why are artists writing code? Is that to show the parallel of if artists did write code?
Unless artists do really write code but that’s beyond the scope of what I’m asking.
The point was that some artist may have created a work representing a complex fractal, perhaps through code since it can be quite difficult to manually compute some fractal patterns.
And that in that case, g(x) that can produce a perfect copy of the artist's work without having the image in the training set provably exists, and is the actual mathematical fractal function itself.
The law is generally less concerned with how something happened and more concerned with what happened. So I would suspect that the legal frameworks will be more restrictive towards code that is reproduced verbatim, regardless of if the code is directly copied or LLM generated.
The subject of the thread is "Will my code be regurgitated if I put it on Github ? Should I avoid putting my code on Github if I don't want it copied?"
The legality aspect that you are injecting into the discussion is irrelevant
Basically if you take something then feed new info into it is that new thing a new thing or a derivative of that work or other? With these models the data and mathmatical spline is so smeared across thousands of endpoints it is hard to tell. However, I do not think the courts have to think about the details of how the models work. They can do something else so a jury can get its head around it.
But basically the courts will probably simplify it into copyright things go in one side. Other things come out the other that sort of kind of resemble the original thing. Do not worry about the details of how it is done. Is that new thing owned by the original copyright holder? Or many holders, as it is smeared into a hundred other items that may or may not have copyright holders? Or is it a new work? Or is it a derivative work just colored?
This could go either way. As someone put it very nicely here a few weeks ago. These AIs are like the most amazing auto complete you have ever seen. Now in copyright if you make something and I independently come up with the same exact thing. Also I can prove that I am in very little trouble. Now in this case without that input code that AI model probably would not predict that exact string. But it in some cases exactly predict it again but can predict thousands of other things also. Is that prediction copyrightable? What if it predicts part of the code but with something else? I as a 3rd party who did not create the model and just used it and got the code what are my liabilities? Then if it is what are the consequences of that? The courts will have to decide eventually.
Does this storage argument also go for humans? If I ask a painter to paint paintings they've seen before, what will be their rate of copying the painting exactly?
Are you asking if it is okay for humans to plagiarize or otherwise violate copyright? For both the human and the algorithm, the answer is: it depends. Yes?
Are you trying to make money off of it? Is it for personal use? Was it protected or public domain?
It just all depends on the circumstances. Human or not.
“It depends on the circumstances” seems to me like cover for “we made up some random rules that made sense at the time”.
If you cover a song on YouTube, you can apparently be demonetised or attract copyright strikes from the original artist. But in software, you tell me an algorithm and I code it up, that’s an original work.
Oh, I used to be a lawyer. "It depends" is just always the answer, whether we like it or not. And, yes, it is pretty much because of “we made up some random rules that made sense at the time”.
Wrong. Literally the whole thing about law - and especially about intellectual property laws - is that process is as much, if not more relevant, than the outcome. This is why "code as law" efforts are plain suicidal. This is why you can't just print out a hex dump of a pirated MP3 file and claim it's not copyright violation because it's just a long number that your RNG spit out - it would've been a good argument if your RNG actually did that, but it didn't, and that's what matters.
This is what it means when we say that, for just about everyone except computer scientists, bits have colour[0]. Lawyers and regulators - and even ordinary people - track provenance of things; how something came to be matters as much, and often more, than what the thing itself is.
This is what makes generative AI case interesting. They're right there in the middle between two extremes: machines xeroxing their inputs, and humans engaging in creative work. We call the former "copying", and the latter "innovation" or "invention" or "creativity". The two have vastly different legal implications. Generative AI is forcing us to formalize what actually makes them different, as until now, we didn't have a clear answer, because we didn't need one.
The pirate bay founders made the argument that process was necessary and lost fairly big. They argued that the process dictated that the prosecutor had to first prove that a copy had been made, and prosecute that, before they could argue that the pirate bay somehow helped with that crime.
The court did not agree. They looked instead towards an anti-biker gang law that illustrated that a biker bar can be found guilty of assisting with gang crime, even if no specific crime can be directly associated with the bar.
The defense team argument - that prosecutors need to prove that a crime had occurred - failed. The courts only require that the opposite is not believable, which given all the facts around the case was deemed sufficient. In that question the process doesn't matter. If the court do not think it believable that copying has not occurred, any argument about "machines xeroxing their inputs and humans engaging in creative work" will be ignored.
I don’t care if there’s a human in the box. If the box spits out training input as output it is copying it.
It doesn’t matter if there is a human doing it or not.
For your supposition to work, the input to the box would be only an abstract summary of the logical steps, and the output an exact copy of some other thing that was never an input.
In that case, yes, it would not be copying.
..but, is that the case? Hm? Is that what we care about? Is it possible to randomly generate the exact sequence of text with no matching training input? Along with the comments?
It seems fabulously unlikely, the point of being totally absurd.
I'm a proponent of not restricting (well, or trying to restrict) machine learning models and not considering them a lossy database but it must be said here if humans can recreate copyrighted works from memory and publish them, they are in trouble too.
I agree. The problem is that a human has ethical deterrents to avoid copying data while a machine doesn’t, so we have to rely purely on legal incentives to avoid copies from being produced.
I think the best argument here is that having the work in memory is not illegal, and human brains are not bound to copyright even when they can also be considered lossy databases. The question is where do we draw the line for a lossy database.
If you transcribe a copyrighted book by hand, that doesn't give you the right to publish it. I don't think being a human currently gives you a legal loophole to copy works so why make the comparison?
Funnily, the copyright argument is shifting towards “if resemblance is substantial, that’s infringing anyway”, circumventing whole discussions around “is it infringing if it’s called learning” arguments.
Close to zero? It's too hard to create an identical picture from memory, even with the original the fakes are not bit-perfect. Computers, on the other hand, are great at copy&paste
Except they're not far? We're not talking about average images with 3 eyes or similarly obviously wrong code, the original comment is specifically about storing perfect copies. Human brains can't store perfect copies (and then reproduce), especially at scale
the human painter will undoubtedly be able to put out 1 billion exact copies of paintings based on their memory and skills in under 10 seconds so there is no reason to consider any difference in the cases.
Why is registering a copyright relevant in cases of copyright infringement? Copyright infringement is infringement of the right of someone else against copying.
They do copy code sometimes but most of the code they generate is new, I think some (maybe not many but some) people have the impression that they literally search for similar code and copy it exactly 100% of the time which isn’t true
I feel like I’m reading the exact same sentence that someone posted about if StableDiffusion included images or was just “learning patterns” and totally generative from 6 to 8 months ago.
People argued, oh so hard about it.
“But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”
…but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.
Now. Let’s talk about code…
> and the problem is that it's synthesising code identical to existing code
There’s a word for that. It’s called copying.
Copying fragments, copying full text. It’s a black box, it spits out content that is indistinguishable (again, see stable diffusion, a lossy jpg of an image is the same image even when it is not bit-for-bit identical) from the input.
That’s copying.
The black box might be a neutral network, or a python script that reads a file from disk. The process is irrelevant.
If you memorise a block of code and type it out by hand with no reference, you are copying it.
These models copy code.
These models have embedded full text copies of some code.
What we do with that is an ethical question, but that it is true is not in dispute. It is true. It’s documented.
If you think these models are not copying at least some training data as output you are factually, and provably incorrect.
(1) - https://arstechnica.com/information-technology/2023/02/resea...