Please don’t upload my code on GitHub

fwlr · on May 8, 2023

If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code. Even if you decode the weights from floats to strings, you won't find the strings of code stored anywhere. It's probably not correct to say it's "learning" from code the same way humans are, but it is "learning" from code in the way LLMs learn.

Fundamentally it seems to me that Copilot truly does synthesise code from some store of knowledge (even if it's hard to understand what this store of knowledge is), and the problem is that it's synthesising code identical to existing code. There are legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing", and it's different tools and rhetoric from the ones we have for dealing with the problem of "stealing or copying existing things". It's valid to have an issue with with Copilot ingesting your code, but unfortunately people are largely using tools and rhetoric from the latter category to approach the issue, and that slight misapplication is causing their issues to fall on deaf ears a lot of the time.

dspillett · on May 8, 2023

> If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code.

I think of it the other way around: I'd be more trusting that LLMs aren't a problem in that sense, if the big commercial entities linked to them used their own code to train them as well as public repositories.

MS's codebase must contain much good training material, unless they think their own code is crap and are embarrassed to let the AI look at it!

I am still undecided, but erring in the direction of not wanting my stuff to be used to train them, though I've long since passed Douglas Adam's "35 year's old" tech barrier so maybe I'm just not liking change…

For now at least I won't be putting my own code in services like GitHub, then at least if someone else does or an LLM is trained on public sites generally, I've not explicitly agreed to anything that says the company can use my code that way - which you do (they say, I believe there is at least one court case brewing on the matter) when you sign up to the services (agreeing to their terms in the process) and use them that way.

jwestbury · on May 9, 2023

> MS's codebase must contain much good training material, unless they think their own code is crap and are embarrassed to let the AI look at it!

Having worked at Microsoft, I'm not sure I want my coding assistant trained on Microsoft's codebase.

dspillett · on May 9, 2023

But would it be worse overall than the quality in the public repositories that they did use for the training data?

wokwokwok · on May 8, 2023

Have I gone back in time?

I feel like I’m reading the exact same sentence that someone posted about if StableDiffusion included images or was just “learning patterns” and totally generative from 6 to 8 months ago.

People argued, oh so hard about it.

“But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”

…but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.

Now. Let’s talk about code…

> and the problem is that it's synthesising code identical to existing code

There’s a word for that. It’s called copying.

Copying fragments, copying full text. It’s a black box, it spits out content that is indistinguishable (again, see stable diffusion, a lossy jpg of an image is the same image even when it is not bit-for-bit identical) from the input.

That’s copying.

The black box might be a neutral network, or a python script that reads a file from disk. The process is irrelevant.

If you memorise a block of code and type it out by hand with no reference, you are copying it.

These models copy code.

These models have embedded full text copies of some code.

What we do with that is an ethical question, but that it is true is not in dispute. It is true. It’s documented.

If you think these models are not copying at least some training data as output you are factually, and provably incorrect.

(1) - https://arstechnica.com/information-technology/2023/02/resea...

sdiupIGPWEfh · on May 8, 2023

> “But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”

> …but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.

The _not possible_ assertion is in response to people arguing that generative image models literally copy and thus "steal" the images in their training set or that such generative models "work" by infringing on said images. The assertion is not that the models cannot possibly contain _any_ copies of training data, the assertion is that the models are generalizing and can not contain a majority of the training data.

Now, let's look at some lines from your linked article titled "Paper: Stable Diffusion 'memorizes' some images, sparking privacy concerns", which happens to lead with scare quotes around "memorizes" and already hedges with "some":

> Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested . . . resulting in a roughly 0.03 percent memorization rate in this particular scenario.

I do not consider a 0.03% rate of full copies particularly damning. Instead, it sounds like a flaw in training which could be addressed, which the article does in fact also mention being as a matter of overfitting to images that are over-represented in the training data.

> the 160 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.

Which side's claims do you feel this line supports?

Generative text models should, in theory, also be generalizing and not memorizing. In practice, however, it does appear far easier to get Copilot and the like to spit out exact copies of code, complete with license text. Presumably, there's just not enough code out there to sufficiently generalize from, so we're seeing overfitting as a matter of course.

henriquez · on May 8, 2023

If you made 1,000,000 AI-generated MP3s available and 0.03% of them matched Sony’s copyrighted music catalog then you would be liable for $250,000 * 30 in statutory damages. Arguing that infringement doesn’t occur because the incidence rate is low is wallpapering over a serious problem.

sdiupIGPWEfh · on May 8, 2023

Really depends on the details. If the extractable works align with songs that are over-represent in the training data, they may largely consist of performances of public domain compositions. And if their extraction also requires prompting which explicitly requests an infringing result, then the actual liability might be something you could mount a defense for.

Or hell, the extracted potentially infringing and over-represented material might all be pop music set to variations of Pachelbel's Canon, and I'd pay to see that lawsuit.

TylerE · on May 9, 2023

Details matter here. For a given musical performance, there are at least 3 copyrights in action:

1. The copyright on the composition. This can also include arrangements - for instance, Gershwin's original piano version of Rhapsody in Blue is now public domain, but the orchestral version everyone knows,

2. The copyright on the sheet music (the actual layout, spacing, editorial notes, things like that.. it's actually an insanely deep subject. I've got an 800 page book on the subject - which is referred to as music engraving, as up until about 40 years ago it was literally done by engraving the plates by hand. Much much harder problem than doing normal book-style text layout, as it's fully 2D, whereas text is basically 1d with occasional special cases. (NB: This copyright is really only relevant to the musicians, conductors, etc, but it does matter.

3. The copyright of the particular recording. This is the really relevant one. A 5 year old recording of a 500 year old work is very much under copyright.

sidewndr46 · on May 9, 2023

This is just a distraction. Everyone knows we're talking about the last one here.

No one sued Napster because their guitar tabs were being shared.

TylerE · on May 10, 2023

No, it absolutely is not.

From GP: " they may largely consist of performances of public domain compositions."

My entire point is that the composition being PD does not mean RECORDINGS of it are PD.

aforwardslash · on May 9, 2023

And, obviously, the lyrics.

aforwardslash · on May 9, 2023

If that was the case, they'd be financing this stuff. Afaik they're not. What if you cross-referenced the sony copyrighted catalogue with 1 million traditional/public domain songs? I'd guess a 0.03% would be a rounding error.

matthew9219 · on May 8, 2023

Pedantic: 0.03% of 1 million is 300, not 30.

Twirrim · on May 8, 2023

I'll see if I can dig some up, but there have been similar examples to the image reproduction case that have come out from codepilot, all it took was using some more unique method names (i.e. ones that were very specific to a particular code base) and oh look there comes the "totally not copied, honest gov'nor" code.

Any company that is using copilot or similar is walking right in to a massive legal minefield. Any engineer that chooses to use Copilot without explicit approval from the company should seriously worry about what target they're putting on their back. You don't want to be putting yourself in a position where your employer can say "They did this without our approval" or worse "Despite explicitly having told them not to", should there end up being legal action. Someone is going to establish precedence one way or another at some point, don't let it be you.

hutzlibu · on May 8, 2023

"I'll see if I can dig some up, but there have been similar examples to the image reproduction case that have come out from codepilot, all it took was using some more unique method names "

You mean, like the example the article used?

https://web.archive.org/web/20221017081115/https://nitter.ne...

"sparse matrix transpose, cs_"

Generated the authors matrixes code.

robocat · on May 8, 2023

Here’s the example of copilot regurgitating GPL code verbatim:

https://codeium.com/blog/copilot-trains-on-gpl-codeium-does-...

It produces an almost exact copy of some training data.

fwlr · on May 8, 2023

If Edison goes over to Tesla’s house and looks at the latest invention on Tesla’s workbench, then goes to his factory and builds the same thing and sells it, then Edison has stolen and copied Tesla’s invention.

If Edison breaks into Tesla’s house at night, takes* the invention off his workbench, goes to his factory and uses it as a reference to build and sell the same thing, then Edison has stolen and copied Tesla’s invention.

In both cases we can say “stolen and copied”, but actual sequence of events and the legal recourse that Tesla has is different in each case.

This difference matters because open source is like Tesla having his workbench in a public place where people are encouraged to watch him work, play around with the inventions, help him build, etc. In the “Tesla’s public workbench” situation, Tesla’s recourse is debatably impacted in the first scenario, but it is definitely unchanged in the second scenario. Arguing as though Copilot is doing “second scenario stealing and copying” rather than “first scenario stealing and copying” sidesteps that debate about whether open sourcing your code made it ‘up for grabs’ for natural and/or artificial intelligences to learn from and/or memorize. Now, it’s not clear what side will win that debate - I personally think “this has infringed intellectual property rights” has a decent chance of being the victor. But sidestepping that debate is setting off peoples’ “you’re trying to pull a fast one” alarm and making them unsympathetic. That’s what I’m getting at here.

*: to avoid “steal vs copy depriving of property”-type discussions, postulate that Edison returns the invention to the workbench before Tesla notices

As to the linked article, I would quote from the article itself: “[The paper] is dense with nuance that could potentially be molded to fit a particular narrative”. I could say more: both the article and the paper state there were zero byte-identical matches, and 203 direct or perceptual near matches out of 175 million generations (or out 160 million training images) is 0.0001%, which argues against this very strong claim of “factually and provably incorrect”. But I think the argument I will actually make is this: I said Copilot was synthesizing identical code from its knowledge base, which is completely compatible with the claim that it is outputting code from memory (in the human sense of memory). What it is not compatible with is the claim that it is outputting code from memory (in the computer sense of memory).

andyjohnson0 · on May 8, 2023

> That’s copying.

> These models copy code.

These assertions have very little weight - what matters is what the law says. And for now, in the jurisdictions that matter, the law is still silent. Maybe it will catch up via either legislation or case law, or maybe it won't. But for now there is no legal consensus on what these new things are doing.

Of course there is a moral/ethical component to this, and individuals are obviously free to hold personal views. But unless some global working consensus forms (unlikely) then I don't see how change will come from that.

I suspect that it's going to be difficult for politicians to find the will to do something about this issue, and then to get their heads around it enough to make good laws.

wokwokwok · on May 8, 2023

If y = f(x, T) when y is a member of T, prove f() doesn’t copy a value from T.

This is possible. For example, many functions f(0) return 0; if we can derive a function g(x) that generates y without access to T (ie. the training data) then we can plausibly assume that f(x, T) might indeed not be copying from T.

…but can we do so?

If so, we must admit that it is possible that f(x, T) = g(x) and no copying is taking place.

So, in the past 50 years of prior work can we come up with an example of a function g(x) that generates the exact code we see coming out of these LLMs?

I’ll grant, it’s not impossible.

For example, if stable diffusion generated sin wave patterns or fractals that exhibited “deep complex structure” and it happens that some artist had written code to do that exact thing, there would (I believe) be a fair case that stable diffusion was not copying the artist, it was simply a parallel implementation of the same generative code.

However, now my scepticism kicks in.

For code written by hand, that was not generated, we are suggesting that a LLM is a parallel equivalent implementation of a human mind, and it can, with purely “learnt patterns” replicate not only the intent and structure of known code, but the exact code itself, repeatedly.

I’ll grant. It’s not impossible, and it’s very difficult to prove, but it seems fabulously, unbelievably, extraordinarily unlikely that a clean room reimplementation would be identical, repeatedly.

We’re entering into the domain of assigning probability and then doing a zero knowledge proof here, but the solution is easy.

Just prove by counter example.

Train a model that generates such code without the code as training data.

You have now won the argument.

…on the bright side, you’ve also solved the 10 million dollar question of “how do I train my LLM when I don’t have enough training data” (because that’s what you just did; train it to do something without training data), so you’re now rich and give zero ducks what I think. Congrats.

skinnymuch · on May 8, 2023

Could you explain this:

> For example, if stable diffusion generated sin wave patterns or fractals that exhibited “deep complex structure” and it happens that some artist had written code to do that exact thing, there would (I believe) be a fair case that stable diffusion was not copying the artist, it was simply a parallel implementation of the same generative code.

I understand your pseudo equations and I understand [some] engineering.

Why are artists writing code? Is that to show the parallel of if artists did write code?

Unless artists do really write code but that’s beyond the scope of what I’m asking.

tsimionescu · on May 9, 2023

The point was that some artist may have created a work representing a complex fractal, perhaps through code since it can be quite difficult to manually compute some fractal patterns.

And that in that case, g(x) that can produce a perfect copy of the artist's work without having the image in the training set provably exists, and is the actual mathematical fractal function itself.

mbreese · on May 8, 2023

The law is generally less concerned with how something happened and more concerned with what happened. So I would suspect that the legal frameworks will be more restrictive towards code that is reproduced verbatim, regardless of if the code is directly copied or LLM generated.

skinnymuch · on May 8, 2023

The law is concerned about the “who” the most usually from my personal lived experience.

nassimm · on May 8, 2023

The subject of the thread is "Will my code be regurgitated if I put it on Github ? Should I avoid putting my code on Github if I don't want it copied?"

The legality aspect that you are injecting into the discussion is irrelevant

andyjohnson0 · on May 8, 2023

I was replying to the comment by @wokwokwok.

sumtechguy · on May 8, 2023

This is correct. I suspect that the colorization precedencies will come back into play https://chart.copyrightdata.com/Colorization.html

Basically if you take something then feed new info into it is that new thing a new thing or a derivative of that work or other? With these models the data and mathmatical spline is so smeared across thousands of endpoints it is hard to tell. However, I do not think the courts have to think about the details of how the models work. They can do something else so a jury can get its head around it.

But basically the courts will probably simplify it into copyright things go in one side. Other things come out the other that sort of kind of resemble the original thing. Do not worry about the details of how it is done. Is that new thing owned by the original copyright holder? Or many holders, as it is smeared into a hundred other items that may or may not have copyright holders? Or is it a new work? Or is it a derivative work just colored?

This could go either way. As someone put it very nicely here a few weeks ago. These AIs are like the most amazing auto complete you have ever seen. Now in copyright if you make something and I independently come up with the same exact thing. Also I can prove that I am in very little trouble. Now in this case without that input code that AI model probably would not predict that exact string. But it in some cases exactly predict it again but can predict thousands of other things also. Is that prediction copyrightable? What if it predicts part of the code but with something else? I as a 3rd party who did not create the model and just used it and got the code what are my liabilities? Then if it is what are the consequences of that? The courts will have to decide eventually.

skinnymuch · on May 8, 2023

The law cares about class. For which class: not the working class. AI has no reason not to continue through the ethics issues in the Global North.

I believe China is trying to limit AI powers but who knows how that will go.

stavros · on May 8, 2023

Does this storage argument also go for humans? If I ask a painter to paint paintings they've seen before, what will be their rate of copying the painting exactly?

checkyoursudo · on May 8, 2023

Are you asking if it is okay for humans to plagiarize or otherwise violate copyright? For both the human and the algorithm, the answer is: it depends. Yes?

Are you trying to make money off of it? Is it for personal use? Was it protected or public domain?

It just all depends on the circumstances. Human or not.

josephg · on May 8, 2023

“It depends on the circumstances” seems to me like cover for “we made up some random rules that made sense at the time”.

If you cover a song on YouTube, you can apparently be demonetised or attract copyright strikes from the original artist. But in software, you tell me an algorithm and I code it up, that’s an original work.

The lines all seem pretty arbitrary to me.

checkyoursudo · on May 8, 2023

Oh, I used to be a lawyer. "It depends" is just always the answer, whether we like it or not. And, yes, it is pretty much because of “we made up some random rules that made sense at the time”.

toyg · on May 9, 2023

If you remove "random", that's a pretty fair representation of any law, lol.

wokwokwok · on May 8, 2023

It doesn’t matter. The process is irrelevant.

Black box. Input. Output. Is the output the same as some training input?

It’s not rocket science.

“…but humans…” argument is not relevant. This is not a human.

TeMPOraL · on May 8, 2023

> It doesn’t matter. The process is irrelevant.

Wrong. Literally the whole thing about law - and especially about intellectual property laws - is that process is as much, if not more relevant, than the outcome. This is why "code as law" efforts are plain suicidal. This is why you can't just print out a hex dump of a pirated MP3 file and claim it's not copyright violation because it's just a long number that your RNG spit out - it would've been a good argument if your RNG actually did that, but it didn't, and that's what matters.

This is what it means when we say that, for just about everyone except computer scientists, bits have colour[0]. Lawyers and regulators - and even ordinary people - track provenance of things; how something came to be matters as much, and often more, than what the thing itself is.

This is what makes generative AI case interesting. They're right there in the middle between two extremes: machines xeroxing their inputs, and humans engaging in creative work. We call the former "copying", and the latter "innovation" or "invention" or "creativity". The two have vastly different legal implications. Generative AI is forcing us to formalize what actually makes them different, as until now, we didn't have a clear answer, because we didn't need one.

--

[0] - https://ansuz.sooke.bc.ca/entry/23

belorn · on May 9, 2023

The pirate bay founders made the argument that process was necessary and lost fairly big. They argued that the process dictated that the prosecutor had to first prove that a copy had been made, and prosecute that, before they could argue that the pirate bay somehow helped with that crime.

The court did not agree. They looked instead towards an anti-biker gang law that illustrated that a biker bar can be found guilty of assisting with gang crime, even if no specific crime can be directly associated with the bar.

The defense team argument - that prosecutors need to prove that a crime had occurred - failed. The courts only require that the opposite is not believable, which given all the facts around the case was deemed sufficient. In that question the process doesn't matter. If the court do not think it believable that copying has not occurred, any argument about "machines xeroxing their inputs and humans engaging in creative work" will be ignored.

jokethrowaway · on May 9, 2023

I wonder who had much more money, PB or the RIAA

stavros · on May 8, 2023

But if I put a human in the black box, that somehow now matters to your argument, because you're saying that it only holds for machines.

wokwokwok · on May 8, 2023

I don’t care if there’s a human in the box. If the box spits out training input as output it is copying it.

It doesn’t matter if there is a human doing it or not.

For your supposition to work, the input to the box would be only an abstract summary of the logical steps, and the output an exact copy of some other thing that was never an input.

In that case, yes, it would not be copying.

..but, is that the case? Hm? Is that what we care about? Is it possible to randomly generate the exact sequence of text with no matching training input? Along with the comments?

It seems fabulously unlikely, the point of being totally absurd.

patates · on May 8, 2023

I'm a proponent of not restricting (well, or trying to restrict) machine learning models and not considering them a lossy database but it must be said here if humans can recreate copyrighted works from memory and publish them, they are in trouble too.

stavros · on May 8, 2023

I agree, I'm not saying that machines don't produce copies of existing data, I'm saying that's not all they produce.

dkersten · on May 8, 2023

I agree. The problem is that a human has ethical deterrents to avoid copying data while a machine doesn’t, so we have to rely purely on legal incentives to avoid copies from being produced.

patates · on May 8, 2023

I think the best argument here is that having the work in memory is not illegal, and human brains are not bound to copyright even when they can also be considered lossy databases. The question is where do we draw the line for a lossy database.

ubercow13 · on May 8, 2023

If you transcribe a copyrighted book by hand, that doesn't give you the right to publish it. I don't think being a human currently gives you a legal loophole to copy works so why make the comparison?

chmod775 · on May 8, 2023

Humans (and their creativity) have special status and privileges in law.

Machines don't. It doesn't matter how fancy they are. The law doesn't care.

So yes, a human in a black box is different from a machine in a black box until laws change.

numpad0 · on May 8, 2023

Funnily, the copyright argument is shifting towards “if resemblance is substantial, that’s infringing anyway”, circumventing whole discussions around “is it infringing if it’s called learning” arguments.

eviks · on May 8, 2023

Close to zero? It's too hard to create an identical picture from memory, even with the original the fakes are not bit-perfect. Computers, on the other hand, are great at copy&paste

stavros · on May 8, 2023

Except, if you look closely, the AI-generated duplicates are far from identical too.

eviks · on May 8, 2023

Except they're not far? We're not talking about average images with 3 eyes or similarly obviously wrong code, the original comment is specifically about storing perfect copies. Human brains can't store perfect copies (and then reproduce), especially at scale

alvarezbjm-hn · on May 8, 2023

Close to 0, but not 0

https://www.nationalgeographic.com/science/article/autism-ar...

Stephen Wiltshire is cool.

eviks · on May 8, 2023

But how is this but relevant?

bryanrasmussen · on May 8, 2023

the human painter will undoubtedly be able to put out 1 billion exact copies of paintings based on their memory and skills in under 10 seconds so there is no reason to consider any difference in the cases.

stavros · on May 8, 2023

Is number of copies and speed of making them a significant determinant in whether copyright infringement took place or not?

bryanrasmussen · on May 8, 2023

an opposite question - you seem to think that a machine has the same protections against charges of copyright infringement as a human does - why?

They don't have the same rights to register a copyright.

stavros · on May 8, 2023

Why is registering a copyright relevant in cases of copyright infringement? Copyright infringement is infringement of the right of someone else against copying.

circuit10 · on May 9, 2023

They do copy code sometimes but most of the code they generate is new, I think some (maybe not many but some) people have the impression that they literally search for similar code and copy it exactly 100% of the time which isn’t true

scotty79 · on May 8, 2023

> in fact, have embedded full copies of the source images

few images of of millions

casey2 · on May 8, 2023

It's clearly not a copy stop arguing in bad faith.

pavlov · on May 8, 2023

> ‘There are legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing", and it's different tools and rhetoric from the ones we have for dealing with the problem of "stealing or copying existing things".’

Copyright law applies in both cases. If you create work that’s substantially similar to an existing one, you’re risking copyright infringement even if none of the original work’s content was reproduced in the mechanical sense.

If this were not the case, why would cover bands pay the original artists for rights to the songs?

“I learned to play ‘Yesterday’ by heart” doesn’t mean you can do anything with the song without paying the Beatles. The same applies to machine learning, if all the model has learned is to imitate copyrighted works.

fwlr · on May 8, 2023

Yes, I do think that LLMs are probably doing copyright infringement. There is a long and rich heritage, in tech and elsewhere, of arguing that copyright infringement is not stealing, it is its own separate (though similar) crime and usually a lesser crime to some degree. Maybe my original point could be rephrased snarkily as “you may have a point but until you stop sounding like the RIAA calling everything ‘stealing’, a lot of people are going to ignore you”.

j16sdiz · on May 8, 2023

If I type out a copyrighted program character-by-character from my memory, does this somehow get exempted from copyright law?

edgineer · on May 8, 2023

Trying to be more precise:

If the LLM reproduces a significant portion of a program token-by-token, is it a derivative work and is it not fair use?

telios · on May 8, 2023

Now, I'm not a lawyer, so this isn't legal advice, but...

A derivative work is not fair use. If you end up with a significant portion of another person's program in yours, (such that a substantial portion of your program is in some way related to their program), that will likely be a derivative work - but the definition of a derivative work depends on your jurisdiction and use-case. If you're unlawfully using the source material to produce a derivative work, you cannot copyright that derivative work under 17 U.S.C. § 103(a), and under the same section, you can only copyright your modifications, not the original.

It would be hard to argue fair use in this case; fair use only really applies for parodies/criticism, reporting, and scholarly works - and generally that's an affirmative defense, rather than an express or implied right you have.

Honestly, Copilot is difficult because Copilot can't be the author of the code; the person who used Copilot is the "author" of the code, and I think they'd be the ones liable for copyright infringement if copyrighted content ends up in their code.

To argue someone performed copyright infringement, all you need is to prove (1) a valid copyright exists; (2) that the person had access to the work; (3) the person had the opportunity to steal the work; and (4) that protected elements of the work had been copied (afaik generally under a "substantially similar" standard). Copilot offers an easy way to check both (2) and (3) - a copyright holder could argue that people had access to their code through Copilot, and that Copilot offered an opportunity to steal the work.

klntsky · on May 8, 2023

> the problem is that it's synthesising code identical to existing code

No, the problem is that it's illegal to do so.

goodpoint · on May 8, 2023

> it seems to me that Copilot truly does synthesise code from some store of knowledge

It's a common mistake because we are not used to LLMs.

dingledork69 · on May 9, 2023

How is any of that relevant? If i memorize your work and dissect my brain and run strings on the output you won't find anything either. Same happens if you put it in a .zip file. But both, and the language model, can spit it back out and infringe on your rights.

bombolo · on May 8, 2023

humans have learnt by looking at a few examples, and from there can code almost anything. Humans can also learn without examples, although it's harder. But I expect the 1st C programmers didn't have a K&R book to learn from.

AI instead has looked at all the existing code and it has learnt to do no more than what the existing code can do.

Do you see a difference?

stavros · on May 8, 2023

No, because I can ask it for a piece of code that isn't in the training set and it will write it for me.

mikojan · on May 8, 2023

You are missing the point.

Humans make sense of the world with an extremely limited amount of information. I have not read all of GitHub. Comparatively, I have not read a fraction of it. And I could not read all of GitHub even if I wanted to.

However, I do not need to read all of GitHub anyway because, as a human, I am capable of understanding.

Current generation "AI" is not capable of understanding. Current generation "AI" merely computes the probability of any given word appearing next in this sentence based on an utterly absurd amount of raw data.

That is not what humans do at all. We do not need this information. We cannot even process it.

vidarh · on May 8, 2023

We do not know whether or not that is what humans do or how qualitatively different it is to what humans do, because we don't know all that much about human reasoning works.

Put another way: Some Markov models are Turing complete, so "merely" computing probabilities can be Turing complete with only minor steps, so trying to downplay the potential capabilities of models like this by handwaving about things we don't know is foolish.

We don't need that scale of input, but we also don't know if LLMs need that much input to do well, or if our current training protocols are simply poor. With ongoing work on reducing the training cost, it is at a minimum clear that current training methods are far from optimal.

stavros · on May 8, 2023

You make a logical leap. "Machines learn differently from humans, therefore they cannot understand". My definition for "understanding" isn't just "whatever humans (and nobody else) do".

mikojan · on May 8, 2023

Whatever your definition of "understanding" may be, we appear to agree that these are different things which is all parent asked about:

> Do you see a difference?

bombolo · on May 8, 2023

you can ask for things that are combinations of stuff that is in the training set.

krisoft · on May 8, 2023

> AI instead has looked at all the existing code and it has learnt to do no more than what the existing code can do.

Are you sure about that?

bombolo · on May 8, 2023

> Are you sure about that?

Yes I'm quite sure. Unless you're claiming that source code existed before humans?

krisoft · on May 8, 2023

I’m sorry but I don’t understand you.

It sounds like what you are saying is that the AI cannot write code which would do something it hasn’t seen in the training set. That is how I interpreted the “it has learnt to do no more than what the existing code can do”. Do i understand you right?

If so, how do you know that? Are you talking about the limitations of a particular implementation or a limitation of all AIs as a concept?

friendzis · on May 8, 2023

> If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code

If it was possible to point where the pixel values are actually stored in a jpeg file (i.e. run dd on the file and it spits out the pixel values), I would be a lot more sympathetic to the view that LLMs are not stealing code.

fwlr · on May 8, 2023

In the very next sentence after the one you quote, I say “Even if you decode the weights from floats to strings, you won't find the strings of code stored anywhere.”

If you decode the jpeg file you do get the pixel values.

friendzis · on May 9, 2023

The whole debate is that when you decode the weights (i.e. "run AI on some prompt") you do in fact get training code reproduced verbatim. The fact that we do not have tools to analyze this decoding function analytically is orthogonal.

greysphere · on May 9, 2023

You don't get the exact pixel values decoding a jpeg. But for a lot of uses, what you get is close enough.

One could describe llms similarly.

camgunz · on May 9, 2023

Aren't we just talking about a novel compression technique here. Like there was a voice codec Google released a while ago that essentially modeled your voice and then fed a piano roll of what you said through it.

I don't really care whether there's a lossless, lossy, or AI compression of my stuff. I care that it's my stuff.

wiml · on May 9, 2023

Every time I see this argument about LLMs I can't help but think of Borges' Pierre Menard, Author of the Quixote.

grrowl · on May 9, 2023

If we switch out LLMs for LZMA algorithms your reasoning still makes sense. Which is a real problem with your reasoning.

NationOfJoe · on May 8, 2023

would you be able to elaborate on the, this isn't my field i would be interested to learn an alternative way to talk about all this.

> legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing"

fwlr · on May 8, 2023

Copyright infringement, intellectual property, patents, things like that.

pmontra · on May 8, 2023

What happens if I create a 10 lines function character by character identical to some proprietary or GPLled piece of code, without ever looking at that code nor knowing that it existed?

I expect that the copyright holders could reach to me, tell me that my code is identical to theirs, believe or not that I independently created the same code, and at least ask me to stop distributing it.

Of course

  t += total(e)

is easier to defend than

  monthlyTotal += dailyTotal(expenses)

because the chances that I didn't copy their code get slimmer and slimmer as the names of the identifiers and the structure of the code get more complex.

If I actually looked at their code, it would be wiser from me to at least change the names, maybe also a little bit the structure of the code. That's the difference between being inspired by something and copying it.

TL;DR: if Copilot generates a copy, it's a copy.

lmm · on May 8, 2023

> What happens if I create a 10 lines function character by character identical to some proprietary or GPLled piece of code, without ever looking at that code nor knowing that it existed?

Then you're in the clear, though you may need to convince a court that that's what happened. (Patents and trademarks could still be issues, but there's no copyright issue)

gedy · on May 8, 2023

I know the law doesn't work this way, but wonder if code that can be "learned" this easily should really be copyrightable, etc. These LLMs don't keep a copy of some code to paste in verbatim.

sanxiyn · on May 8, 2023

LLMs totally do, and the fact can be confirmed by experiment. See "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4": https://arxiv.org/abs/2305.00118

vadiml · on May 8, 2023

I'm really baffled by all this discussion on copyrights in the age of AI. The Copilot does not 'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it. IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas.

spuz · on May 8, 2023

The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim.

> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.

If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.

hutzlibu · on May 8, 2023

"The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim."

But only snippets as far as I can tell.

This is the codeexample linked from the author:

https://web.archive.org/web/20221017081115/https://nitter.ne...

It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?

(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

And just slightly changing the code seems trivial, at what point will it be acceptable?

I just don't think spending much energy there is really beneficial for anyone.

I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?

TeMPOraL · on May 8, 2023

> (Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

This. People seem to forget that generative AIs don't just spit out copyrighted work at random, of their own accord. You have to prompt them. And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you? After all, it's you who supplied the missing, highly specific input, that made the AI reproduce a work from the training set.

I maintain that, if we want to make comparisons between transformer models (particularly LLMs) and humans, then the AI isn't like an adult human - it's best thought of as having a mentality of a four year old kid. That is, highly trusting, very naive. It will do its best to fulfill what you ask for, because why wouldn't it? At the point of asking, you and your query are its whole world, and it wasn't trained to distrust the user.

snovv_crash · on May 8, 2023

But this means that Microsoft is publishing a black box (Copilot) that contains GPL code.

If we think of Copilot as a (de)compression algorithm plus the compressed blob that the algorithm uses as its database, the algorithm is fine but the contents of the database pretty clearly violate GPL.

TeMPOraL · on May 8, 2023

While I do believe that thinking and compression will turn out to be fundamentally the same thing, the split you propose is unclear with NN-based models. Code and data are fundamentally the same thing. The distinction we usually make between them is just a simplification, that's mostly useful but sometimes misleading. Transformer models are one of those cases where the distinction clearly doesn't make any sense.

Dah00n · on May 8, 2023

>And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you?

If you, not I, uploaded my GPL'ed code to Github is the blame on you then?

TeMPOraL · on May 8, 2023

> If you, not I, uploaded my GPL'ed code to Github is the blame on you then?

Definitely not me - if your code is GPL'ed, then I'm legally free to upload it to Github, and to an extent even ethically - I am exercising one of my software freedoms.

(Note that even TFA recognizes this and admits it's making an ethical plea, not a legal one.)

Github using that code to train Copilot is potentially questionable. Github distributing Copilot (or access to it) is a contested issue. Copilot spitting out significant parts of GPL-ed code without attaching the license, or otherwise meeting the license conditions, is a potential problem. You incorporating that code into software you distribute is a clear-cut GPL violation.

xigoi · on May 9, 2023

The GitHub terms of service state that you must give certain rights to your code. If you didn't have those rights, but they use them anyway, whose fault is that?

Dah00n · on May 8, 2023

>And just slightly changing the code seems trivial, at what point will it be acceptable?

If I start creating a car by using a blueprint of Fords to create something at what point will it be acceptable? I'd say even if you rework everything completely Ford would still have a case to sue you. I can't see how this is any different. My code is my code and no matter how much you change it, it is still under the same licence as it started out with. If you want it not to be then don't start with a part of my code as a base. In my opinion the case is pretty clear: This is only going on because Microsoft has lots of money and lawyers. A small company doing this would be crushed.

hanselot · on May 8, 2023

Easy. People get to throw rocks at the shiny new thing. To my untrained eye the entire idea of copyrighting a piece of text is ridiculous. Let me phrase it in an entirely different way from how any other person seems to be approaching it.

If a medical procedure is proven to be life-saving, what happens worldwide? Doctors are forced to update their procedures and knowledge base to include the new information, and can get sued for doing something less efficient or more dangerous, by comparison.

If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

I hear an awful lot of people complain all the time about climate change and how bad computers are for the environment, there are even sections on AI model cards devoted to proving how much greenhouse gases have been pushed into the environment, yet none of those virtue signalling idiots are anywhere to be seen when you ask them why they aren't attacking the bureaucracy of copyright and law in the world of computer science.

An arbitrary example that is tangentially related: One could argue that the company sitting on the largest database of self-driving data for public roads is also the one that must be held responsible if other companies require access to such data for safety reasons (aka, human lives would be endangered as a consequence of not having access to all relevant data). See how this same argument can easily be made for any license sitting on top of performance critical code?

So where are these people advocating for climate activism and whatever, when this issue of copyright comes up? Certainly if OpenAI was forced to open source their models, substantial computing resources would not have been wasted training competing open source products, thus killing the planet some more.

So, please forgive me if I find the entire field to be redundant and largely harmful for human life all over.

7jjjjjjj · on May 8, 2023

Yes, of course copyright is dumb and we'd all be better off without it. Duh.

The problem here is that Microsoft is effectively saying, "copyright for me but not for thee." As long as Microsoft gets a state-enforced monopoly on their code, I should get one too.

rekado · on May 8, 2023

> If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

If you don't "slap a license on it" it is unusable by default due to copyright.

ithkuil · on May 8, 2023

Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.

nordsieck · on May 8, 2023

> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

That's just copying with extra steps.

The way to do it legally is to have 1 person read the code, and then write up a document that describes functionally what the code does. Then, a second person implements software just from the notes.

That's the method Compaq used to re-implement the original PC BIOS from IBM.

ithkuil · on May 8, 2023

Indeed. Case closed. If an AI produces verbatim code owned by somebody else and you cannot prove that the AI hasn't been trained on that code, we shall treat the case in exact the same way as we would treat it when humans are involved.

Except that with AI we can more easily (in principle) provide provable provenance of training set and (again in principle) reproduce the model and prove whether it could create the copyrighted work also without having had access to the work in its training set

bryanrasmussen · on May 8, 2023

>The way to do it legally is to have 1 person read the code

wasn't it to have one person run tests of what happened when different things were done, and then write up a document describing the functionality?

In other words I think one person reading the code is still in violation?

nordsieck · on May 8, 2023

> Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.

https://en.wikipedia.org/wiki/Clean_room_design

bryanrasmussen · on May 8, 2023

yes, reading that description it seems pretty clear to me that they did not read the code but they had access to the working system and then

>by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design.

reverse engineering is not 'reading the code'.

Manfred · on May 8, 2023

Theoretically maybe, then they would have to prove they did so without having knowledge about the infringed code in court. You can't make that claim for AI that was trained on the infringed code3.

angio · on May 8, 2023

Yes, that's why any serious effort in producing software compatible with GPL-ed software requires the team writing code not to look at the original code at all. Usually a person (or small team) reads the original software and produces a spec, then another team implements the spec. This reduces the chance of accidentally copying GPL-ed code.

lmm · on May 8, 2023

> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

Maybe, but that would still be copyright infringement. See My Sweet Lord.

swexbe · on May 8, 2023

It’s not accidental. Not infringing copyright isn’t part of the objective function like it would be for a human.

gedy · on May 8, 2023

Not learning or not being inspired by copyrighted code is not a human function either though.

bombolo · on May 8, 2023

Has a human ever memorised verbatim the whole of github?

If someone somehow managed to do that and then happened to have accidentally copied someone's code, how believable would their argument be?

heavyset_go · on May 8, 2023

> Has a human ever memorised verbatim the whole of github?

No, and humans who have read copyrighted code are often prevented from working on clean room implementations of similar projects for this exact reason, so that those humans don't accidentally include something they learned from existing code.

Developers that worked on Windows internals are barred from working on WINE or ReactOS for this exact reason.

usrusr · on May 8, 2023

Hasn't that all been excessively played through in music copyright questions? With the difference that the parody exception that protects e.g. the entire The Rutles catalogue won't get you far in code...

messe · on May 8, 2023

> this would be a clear violation of the licence

Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.

Copyright is a lot less black and white than most here seem to believe.

az226 · on May 8, 2023

That’s part of the rub. YouTube doesn’t break copyright law if a user uploads copyrighted material without proper rights. Now, if YT was a free for all, then yeah. But given it does have copyright reporting functionality and automated systems, it can claim it’s doing a best faith effort to minimize copyright infringement.

Copilot similarly isn’t the one checking in the code. So it’s on each user. That said, Copilot at some point probably needs to add some type of copyright detection heuristics. It already has a suppression feature, but it probably also needs to have some type of checker once code is committed and at that point Copilot generated code needs to be cross-referenced against code Copilot was trained on.

ignoramous · on May 8, 2023

> If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

We aren't talking verbatim generation of entire packages of code here, are we? Code snippets are surely covered under fair use?

afiori · on May 8, 2023

It would almost surely be fair use to include a snippet of code from a different library in your (inline) documentation to argue that your code reimplements a bug for compatibility reasons.

In general it is not fair use if you are using the material for the same scope as the original author[0] or if you are doing it just to namedrop/quote/omage the original.

It is possible to argue that a snippet can be too small to be protected, but that would not be because of fair use.

[0] Suppose that some Author B did as above and copied a snippet of code in their docstring to exlain buggy behaviour of a library they were reimplementing. If you are then trying to reimplement B's libary you can copy the same snippet B copied, but you likely cannot copy the paragraph written by B where they explain the how and the why of the bug.

logifail · on May 8, 2023

> Code snippets are surely covered under fair use?

...for "purposes such as commentary, criticism, news reporting, and scholarly reports"? Sure.

For a commercial product? Best check with your lawyer...

az226 · on May 8, 2023

Oracle would like to have a word..

layer8 · on May 8, 2023

The Fair Use concept is specific to the USA.

Kiro · on May 8, 2023

> it's that it spits out GPL code verbatim

It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.

allmadhare · on May 8, 2023

Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.

In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?

welshwelsh · on May 8, 2023

This seems like a really, really easy problem to fix.

It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.

If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.

spuz · on May 9, 2023

Actually that's exactly what GitHub do:

https://github.blog/2022-11-01-preview-referencing-public-co...

vadiml · on May 8, 2023

So we should attack the problem of proprietary code. Maybe from Right to Repair angle. I believe there should be no such thing as closed source code.

intelVISA · on May 8, 2023

Closed source code is beige corp-speak, its true name is 'malware'.

GTP · on May 8, 2023

In Linus Torvald's book "Just For Fun", there's a chapter about copyright where he presents both the upsides and downsides of it in a pretty much balanced way. I think it's worth reading.

Aachen · on May 8, 2023

Bit of a false equation to act as though a massive computer system is the same as any individual.

People put code on github to be read by anyone (assuming a public repository), but the terms of use are governed by the license. Now you've got a system that ignores the license and scrapes your data for its own purpose. You can pretend it's human but the capabilities aren't the same. (Humans generally don't spend a month being trained on all github code and remember large chunks of it for regurgitation at superhuman speeds, nor can they be horizontally scaled after learning.)

You can still be of the opinion that this is fine, and I may or may not be fine with it as well, I just don't think the stated reason holds up to logic and other opinions ought to "baffle" you

az226 · on May 8, 2023

And GitHub’s EULA gives it the right to train Copilot on public code you host on GitHub.

mikro2nd · on May 8, 2023

The issue, though, is not the code I personally upload to my own public repositories, but the code that someone else uploads to Github by cloning my repository held somewhere else than Github.

Personally I have eschewed any personal use of Github since the MS aquisition and only ever use it where that's mandated by a client (so not my code). If you clone my code from elsewhere into a Github repo, that's just rude and contrary to me every intent and wish.

I think it's time to add a "No GitHub" clause as an optional add-on to the various open-source licenses.

az226 · on May 8, 2023

So then the person who uploaded your code to GitHub has committed a copyright violation and I’m sure GitHub would honor to remove your code from the model training corpus as it was illegally uploaded to GitHub.

ilammy · on May 8, 2023

It’s not necessarily a copyright violation if the license permits copying. Under a permissive license, you are expressly permitted to copy the code and distribute copies provided you comply with whatever conditions the license mandates, without an explicit blessing of the copyright holders. Most popular licenses do not include a prohibition on training AI models. Maybe people should start including a clause.

xigoi · on May 9, 2023

Many popular licenses include a prohibition on being used to create proprietary software. GitHub Copilot is proprietary.

Aachen · on May 8, 2023

That's great, but GP's argument was

> Copilot does not 'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it.

Not "the terms of use you agreed to allow them to do it". Different argument with different amount of merit in my opinion

az226 · on May 8, 2023

Agreed. I was just saying in the current environment GitHub has that license, nobody else has. So if the courts decide one day that because machines learn differently from humans, they will allow copyright holders to add a license exception that disallows machine training, then GitHub will benefit from this. It’s kind of ironical. What’s best for society is to not have any such law enacted and continue to allow open source models to progress alongside proprietary ones (in addition to more level competitive dynamics on the proprietary side).

bombolo · on May 8, 2023

They could just train a model on GPL code that can only be used on GPL code.

For MIT licenses that's impossible currently because of the requirement to mention the authors.

jeroenhd · on May 8, 2023

Copilot has been caught multiple times reproducing code verbatim. At some point it spat out some guy's complete "about me" blog page. That's not learning, that's copying in a roundabout way.

Also, AI doesn't learn "like a human". Neural networks are an extremely simplistic representation of a biological brain and the details of how learning and human memory works aren't even all that clear yet.

Open source code usually comes with expectations for the people who use it. That expectation can be as simple as requiring a reference back to the authors, adding a license file to clarify what the source was based on, or in more extreme cases putting licensing requirements on the final product.

Unless Microsoft complies with the various project licenses, I don't see why this is antithetical to the idea of open source at all.

asimpletune · on May 8, 2023

No disrespect but I am baffled by your statement that it learns, even to go so far as to say as a human coder would learn.

I don't really want this to comment to be perceived as flame bait (AI seems to be a very sensitive topic in the same sense as crypto currency), so instead let me just pose a simple question. If Copilot really learns as a human, then why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

spuz · on May 8, 2023

I think the comment was trying to draw the distinction between a database and a language model. The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller. This should tell us that a language model cannot reproduce copyrighted code byte for byte because the original data simply doesn't exist. Similarly, when you and I read a block of code, it leaves our memory pretty quickly and we wouldn't be able to reproduce it byte for byte if we wanted. We say the model learns like a human because it is able to extract generalised patterns from viewing many examples. That doesn't mean it learns exactly like a human but it's also definitely not a database.

The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.

asimpletune · on May 8, 2023

I see what you're going for, and I respect your point of view, but also respectfully I think the logic is a little circular.

To say "it's not a database, it's a language model, and that means it extracts generalized patterns from viewing examples, just like humans" to me that just means that occasionally humans behave like language models. That doesn't mean though that therefore it thinks like a human, but rather sometimes humans think like a language model (a fundamental algorithm), which is circular. It hardly makes sense to justify that a language model learns like a human, just because people also occasionally copy patterns and search/replace values and variable names.

To really make the comparison honest, we have to be more clear about the hypothetical humans in question. For a human who has truly learned from looking at many examples, we could have a conversation with them and they would demonstrate a deeper sense of understanding behind the meaning of what they copied. This is something a LLM could not do. On the other hand, if a person really had no idea, like someone who copied answers from someone else in a test, we'd just say well you don't really understand this and you're just x degrees away from having copied their answers verbatim. I believe LLMs are emulating this behavior and not the former.

I mean, how many times in your life have you talked to a human being who clearly had no idea what they were doing because they copied something and didn't understand it all? If that's the analogy that's being made then I'd say it's a bad one, because it is actually choosing the one time where humans don't understand what they've done as a false equivalence to language models thinking like a human.

Basically, sometimes humans meaninglessly parrot things too.

xigoi · on May 9, 2023

> The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller.

This just means it's a really efficient lossy compression algorithm, not that it learns like a human.

CapsAdmin · on May 8, 2023

> why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.

Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.

A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?

ChatGTP · on May 8, 2023

What they are saying is that if you’ve studied computer science , you should be able to write a computer program without storing millions or billions of lines of code from GitHub in your brain.

A CS graduate could workout how to write software without doing that.

So they’re just pointing out the difference in “learning”.

CapsAdmin · on May 8, 2023

LLM's are not storing millions or billions lines of code, and neither do we. Both store something more general and abstract.

But I'm saying there's a big difference between a CS graduate and some current LLM that learns from "the CS curriculum". A CS graduate can ask questions, use google to learn about things outside of school, work on hobby projects, study existing code outside of what's shown in university, get compiler feedback when things go wrong, etc.

All a language model can do is read text and try to predict what comes next.

mirekrusin · on May 8, 2023

We do but we also simulate it doing homework very well.

flumpcakes · on May 8, 2023

AI doesn't "learn". It's statistical inference if anything.

If I took two copy-righted pictures and layered them on top of each other at 50% opacity. Would that be OK or copy right infringement?

AI models just use more weights/biases and more images (or any input).

vadiml · on May 8, 2023

And what is LEARNING in your opinion?

flumpcakes · on May 8, 2023

Cambridge dictionary has it as: "knowledge or a piece of information obtained by study or experience".

If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned? Or the application of a statistical inference model? This alone is probably far enough abstracted to never be an ethical or legal issue. However, if I had a model that was only "trained" on Stephen King books, and used it to write a novel, would that be OK? Or do you think it would be in the realm of copyright infringement?

By your definition anything a computer does means it has learned it. If I copy and paste a picture, has the computer "learned" it while it reads out the data byte-by-byte? That sure sounds like it is "studying" the picture.

"AI" and "ML" are just statistics powered by computers that can do billions of calculations per second. It is not special, it is not "learning". To portray some value to it as something else is disingenuous at best, and fraud at worst.

CapsAdmin · on May 8, 2023

Your polaroid example would require someone to write code that does that one specific thing. You could also argue that this would violate copyright if it was trained on some photographer's specific unique style, made as an app and marketed as being able to mimic the photographer's style. But in your example you have 1000 random polaroid images of unknown origin, so somehow it becomes abstract enough that it doesn't become an issue.

In your stephen king example I would say it's still learned, because the "code" is a general language model that can learn anything. It's just you decided to only train it on stephen king novels. If you have an image model that trained 100% on public domain images and finetune it to replicate a specific artist's style I would personally think the finetuned model and its creator is maybe violating copyright.

But when it comes to learning I would say when you write a program whose purpose is to learn the next word or pixel, but it's up to the computer to figure out how to do that, the computer is learning when you feed it input data. It's the program's job to figure out the best way to predict, not the programmer. (it's not that black and white given that the programmer will also sometimes guide the program, but you get the idea)

When you write a program that does one or several things, it's not learning.

I think it's something to do with the difference between emergent behavior from simple rules and intentional behavior from complex rules.

flumpcakes · on May 8, 2023

I think you're using fancy language like "general language model" to obscure the facts.

If I created a program to read words from the input and assign weights based on previous words, I could feed in any data. Just like the polaroid example. (I suggested that the polaroid example was abstract enough not to be an ethical/legal problem because I believe it is mostly transformative, unless the colours themselves were copyrighted or a distinct enough work in themselves.)

Now If I only feed in Stephen King books and let it run, suddenly it outputs phrases, wording, place names, character names, adjectives all from Stephen King's repertoire. Is this a 'general language model'? Should this by copyright exempt? I don't think this is transformative enough at all. I've just mangled copyrighted works together, probably not enough to stand-up against a copyright claim.

I think people use AI and ML as buzzwords to try and obfuscate what's actually happening. If we were talking about AI and ML that doesn't need training on any licensed or copyrighted work (including 'public domain') then we can have a different conversation, but at the moment it's obscured copyright theft.

CapsAdmin · on May 8, 2023

I can agree it's obscure in the sense that we shrug when asked about how it works. If you specifically train a model to mimic a specific style I can get behind it leaning more towards theft, or at least being immoral regardless of laws.

If you train a model to replicate 10000 specific artists, I could also get behind it being more like theft.

But if the intention was to train with random data (and some of it could be copyrighted) just like your polaroid example to generate anything you want, I'm not so sure anymore.

I feel the intent is the most important part here. But then again I don't know the intent behind these companies, and I guess you don't either. Maybe no single person working in these companies know the intent either.

It also gets murky when you have prompts that can refer to specific artists and when people who use the models explicitly try to copy an artists style. In the case of stable diffusion, if the CEO's to be believed the clip model had learned to associate images of greg ruktowski and other artists to images that were not theirs but in a similar style[0]

Even murkier is when you have a base model trained on public data, but people finetune at home to replicate some specific artist's style.

[0] https://twitter.com/EMostaque/status/1571634871084236801

golergka · on May 8, 2023

> If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned?

You wouldn't. LUT would.

datavirtue · on May 8, 2023

It's data. No one owns data.

bamboozled · on May 8, 2023

Can I have you’re credit card number, expiry and verification number please? Also your DNA ?

Since it’s data that should be cool right ?

precompute · on May 8, 2023

Equating human cognition with machine algorithms is the root of the issue, and a significant part of its "legitimacy" comes from the need for "AI" companies to push their products as effective, and there's no better marketing than to equate humans to machines. Not even novel.

goodpoint · on May 8, 2023

It requires abstraction. Something that LLMs are not capable of, beyond trivial amounts.

intelVISA · on May 8, 2023

TRAINING your 3rd eye/branch predictor

if(nonfree_software){

// unhappy path

}

CapsAdmin · on May 8, 2023

You can make out the two original copyrighted pictures in that case, and all you did was using 50% opacity which might not be very transformative, so probably?

In my mind (and I suspect others too) in machine learning context, statistical inference and learning became synonymous with all the recent development.

The way I see it, there's now a discussion around copyright because people have different fundemental views on what learning is and what it means to be a human that don't really surface.

lewhoo · on May 8, 2023

If "like a human" is enough to get human rights then why did I get a parking ticket even when I argued that my car just stands there like a human ? This really isn't as good a defense as people portray. There are a lot of rights and privileges granted to humans but not to objects - we can all agree on that I think.

datavirtue · on May 8, 2023

And if you need a person with supercharged rights and a slippery amount of liability...form a corporation.

bilqis · on May 8, 2023

There is a difference between a person learning and a commercial product learning from someone else’s work, probably ignoring all the licenses.

adlpz · on May 8, 2023

To be fair, when a programmer learns from publicly available but not public-domain code, and then applies the ideas, patterns, idioms and common implementations in their daily job as a software developer, the result is very much a "commercial product" (the dev company, the programmer themselves if a freelancer) learning from someone else's work and ignoring all the licenses.

The only leap here is the fact that the programmer has outsourced the learning to a tool that does it for them, which they then use to do their job, just as before.

loveparade · on May 8, 2023

No, the difference is that OpenAI has a huge competitive advantage due to direct partnership with Github, which is owned by Microsoft. In fact, it's even worse. With OpenAI making money from GPT, Github has even less incentive to make data easily available to others because that would allow for competition to come in. I wouldn't be surprised if Github starts locking down their APIs in the near future to prevent competitors from getting data for their models.

Nobody is arguing against uploading code. It's about Github/Microsoft specifically.

adlpz · on May 8, 2023

I agree there's a difference in the ease of access, a competitive advantage, sure. And I get that people writing public-source (however licensed) software don't want to make it easier for them (as in, Microsoft) to make money off of "learning" (of the machine type) from it. That's fair.

However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.

I mean, for the rest of the content all the new fancy LLMs have been trained with, there wasn't a Github equivalent. They just used massive scraped dumps of text from wherever they could find them, which most definitely included trillions of lines of very much copyrighted text.

In short: not only I don't really see an issue with Copilot-like AIs learning from publicly available code (as I described in the GP comment) but I also think if you publish code anywhere at all it's inevitable that it'll end up in Copilot, regardless of where you host it. If you want to make it more expensive for Microsoft to scrape it, sure, go ahead, but I don't think it matters in the long run.

bamboozled · on May 8, 2023

However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.

I’d be quite careful with of this view.

By your logic, it should be ok to take the Linux kernel, copy it, build it, then sell it and give nothing back to the community that built it. Then just blame it on the authors for uploading it to the internet ?

josefx · on May 8, 2023

> all this discussion on copyrights in the age of AI.

copyright is a thing, AI do not change that.

> does not 'steal' or and reproduce our code - it simply LEARNS from it as a human

And here we have the central problem, does it act like a human or does it not act like a human? Humans copy things they learn all the time, some of us know various songs by heart, others will even quote entire movies from memory. If AI can learn and reproduce things like humans do then you need to take steps to ensure that the output is properly stripped from any content that might infringe on existing copyrighted works.

ChatGTP · on May 8, 2023

There is a definite difference between singing a song while walking down the street and writing down the lyrics, putting it in a database, claiming it’s my content and then selling it on, even if it’s slightly rehashed.

webmobdev · on May 8, 2023

I would have no problem if such AI systems are also completely open source, can be run by me on my system and come with all models to use them also easily available (again in some form of opensource license). I genuinely don't see that happening in the future with BigTech. As such, as a proponent of FSF GPL philosophy, I have no interest in supporting such systems with my hard work, my source code. So yes, I do consider it stealing - my hard labour in any GPL opensource work is meant for the public good (for example, to preserve our right to repair by ensuring the source code is always available through the GPL license). Any corporate that uses my work, for profit, without either paying me or blocking the public good that I am trying for is simply exploiting me and the goodwill of others like me.

anileated · on May 8, 2023

Copilot does not steal. Copilot does not learn. If you want to apply these concepts to LLMs, first prove how an LLM is human and then explain why it doesn’t have human rights.

Rather, Copilot is a tool. Microsoft/ClosedAI operate this tool. Commercially. They crawl original works and through running ML on it automatically generate and sell derivative works from those original works, without any consent or compensation. They are the ones who violate copyright, not Copilot.

Merad · on May 8, 2023

Whether an LLM actually learns is completely tangential to the topic at hand. A human coder who learned from copyrighted code and then reproduced that code (intentionally or not) would be in violation of the copyright. This is why projects like Wine are so careful about doing clean room implementations.

As an aside, it seems really strange to invoke "open source ideas" as an argument in favor of a for-profit company building a closed source product that relies on millions of lines of open source code.

bamboozled · on May 8, 2023

It’s also fair to say that a lot of this carefulness has probably made life difficult for the developers of wine, but they wanted to avoid Microsoft’s legal team. So they respected the copyright laws.

Here is Microsoft doing as Microsoft does…

pull_my_finger · on May 8, 2023

I'm in several communities for smaller/niche languages and asking questions about things that have few sources make it much more clear that it's not "learning" but grabbing passages/chunks of source. Maybe with subjects that have more coverage it can assimilate more "original" sounding output.

cccbbbaaa · on May 8, 2023

Plenty of people already argued that LLMs don't actually learn like a human. However, you should keep in mind the reason why clean-room reverse engineering exists: humans learn from source material. FLOSS RE projects (e.g. nouveau) typically don't like leaks, because some contributors might be exposed to copyrighted material. Sometimes, the opposite happens: people working on proprietary software are not allowed to see the source of a FLOSS alternative.

Twirrim · on May 8, 2023

> it simply LEARNS from it as a human coder would learn from it.

It doesn't LEARN anything, let alone like a human coder would. It has absolutely zero understanding. It's not actually intelligent. It's a highly tuned mathematical model that predicts what the next word should be.

BlueTemplar · on May 8, 2023

I can also learn things with no understanding (like a foreign word), I doubt that would make me immune to copyright ?

Twirrim · on May 8, 2023

If you were to learn a phrase that insulted the king in Thai, and said it in Thailand, you would end up in jail. Doesn't matter if you understood what the phrase said. Ignorance doesn't make you immune to consequences.

sethd · on May 8, 2023

Your comment implies that we’re in some age of AGI, but we’re not there yet. Some argue that we’re not even close, but who knows, that’s all speculation.

> it simply LEARNS from it as a human coder would learn from it.

The LLM doesn’t learn, the authors of the LLM are encoding copyright protected content into a model using gradient decent and other techniques.

Now as far as I understand the law, that’s OK. The problems arise when distribution of the model comes into play.

I’m curious, are you a programmer yourself? Don’t take this the wrong way, but I want to understand the background of people who coming to the kind of conclusion you seemed to arrive at about how LLMs work.

otikik · on May 8, 2023

> it simply LEARNS from it as a human coder would learn from it

What humans do to learn is intuitive, but it is not simple. What the machine does is also not simple, it involves some tricky math.

Precisely if the process was simple, then it could be more easily argued that the machine is "just copying" - that is simple.

There's a lot of nuance here.

What the machine is doing "looks similar to what humans do from the exterior", the same way that a plane flying "looks similar" to a flying bird. But the airplane does not flap its wings.

> kind of irrational and antithetical to open source ideas

Open source ideas are not the only ideas in town.

eptcyka · on May 8, 2023

Humans don't learn an algorithm by memorizing a particular implementation character by character.

ignoramous · on May 8, 2023

That's all the more reason for the utility of solutions like Copilot? Humans are limited in both time and memory.

Though, GitHub would do well to also bake-in approp attributions if a significant portion of the generated code is a copypasta.

remix2000 · on May 8, 2023

Neither does copilot.

eptcyka · on May 8, 2023

But it does though. There have been many times where this was the case.

Kiro · on May 8, 2023

It only happens if you bait it really hard and push it into a corner. That's not representative at all. I use Copilot to write highly niched code that's based on my own repo. It's simply amazing at understanding the context and suggest things I was about to write anyway. Nothing it produces is just copypasted character by character. Not even close.

bamboozled · on May 8, 2023

As others have pointed out, it means the model contains copyrighted material. So I guess that’s totally illegal. Like if I ripped a Windows ISO, zipped it up and shared it with half the world. You know what would happen to me don’t you ?

Kiro · on May 8, 2023

Not the same thing at all. The data isn't just sitting there in a store inside the model that you can query. No-one would be able to look at the raw data and find any copyrighted material, even if all it was trained on was copyrighted code (which I agree is an issue).

ChatGTP · on May 9, 2023

There’s a lot of misconceptions here but LLMs and stable diffusion have spat out copyrighted material verbatim.

So that’s not accurate.

Kiro · on May 9, 2023

What is not accurate? They are still not storing any material internally, even if the patterns they have learned can cause them to output copyrighted material verbatim. People need to break out of the mental model that an LLM is just a bunch of pointers fetching data from an internal data store.

ChatGTP · on May 10, 2023

Have a read through other comments on this thread, you'll see some good examples.

golergka · on May 8, 2023

And airplanes don't flap their wings, but we still agree that they're flying, just as birds do.

vadiml · on May 8, 2023

There are people who do it... I personally know a guy whit photographic memory

j16sdiz · on May 8, 2023

He don't get an exemption from copyright law, or do he?

ChatGTP · on May 8, 2023

Humans are intentionally loading up giant sets of curated data for training, purposes, into a super computer to produce a model which is an black box and have provided zero attribution or credit to those who made this work possible. Humans are tuning these models to produce the results you see.

In the case of ChatGPT-x, Open AI company which is disguised as a not for profit with a goal of producing ever more powerful models that may eventually be capable of replacing you at work while seemingly not having any plan to give back to those who’s work was used to make them insane amounts of money.

They haven’t even given back any of their research. So it’s ok to take everyone’s open source work and not give back is it ?

This isn’t some cute little robot who wakes up in the morning and decided it wants to be a coder. This is a multi-national company who has created the narrative you’re repeating. They know exactly what they’re doing.

oytis · on May 8, 2023

"Learning" is a technical term, AI doesn't really learn the same way a human does. There is a huge difference between allowing your fellow human beings to learn from you and allowing corporations to appropriate your knowledge by passing it through a stochastic shuffler.

zirgs · on May 8, 2023

Individuals can train their own LLMs too.

oytis · on May 9, 2023

Copilot is run by a corporation, and the model is owned by the corporation - despite being trained on open source data.

In general individuals will have problems with the first L of LLMs - unless the community invents a way to democratise LLMs and deep learning in general. So far deep learning space a much less friendly place for individuals than software was when ideals of open source movement were formed.

zirgs · on May 9, 2023

A full LLM is too expensive for individuals to train, but LoRAs aren't.

There are multiple open source LLMs out there that can be extended.

We can already see it in AI art scene. People are training their own checkpoints and LoRAs of celebrities, art styles and other stuff that aren't included in base models.

Some artists demand to be excluded from base model training datasets, but there's nothing they can do against individuals who want to copy their style - other than not posting their art publicly at all.

I see the same thing here. If your source code is public - someone will find a way to train an AI on it.

xxs · on May 8, 2023

>it simply LEARNS from it as a human coder would learn from it

I thought that was a sarcastic remark, given the capitalization of 'learn', but followed by IMHO dispelled that part.

We have no idea how humans learn, and the 'AI' has a statistical approach, not much more than that.