Hacker News new | past | comments | ask | show | jobs | submit login
Please don’t upload my code on GitHub (codeberg.page)
360 points by modinfo on May 8, 2023 | hide | past | favorite | 478 comments



If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code. Even if you decode the weights from floats to strings, you won't find the strings of code stored anywhere. It's probably not correct to say it's "learning" from code the same way humans are, but it is "learning" from code in the way LLMs learn.

Fundamentally it seems to me that Copilot truly does synthesise code from some store of knowledge (even if it's hard to understand what this store of knowledge is), and the problem is that it's synthesising code identical to existing code. There are legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing", and it's different tools and rhetoric from the ones we have for dealing with the problem of "stealing or copying existing things". It's valid to have an issue with with Copilot ingesting your code, but unfortunately people are largely using tools and rhetoric from the latter category to approach the issue, and that slight misapplication is causing their issues to fall on deaf ears a lot of the time.


> If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code.

I think of it the other way around: I'd be more trusting that LLMs aren't a problem in that sense, if the big commercial entities linked to them used their own code to train them as well as public repositories.

MS's codebase must contain much good training material, unless they think their own code is crap and are embarrassed to let the AI look at it!

I am still undecided, but erring in the direction of not wanting my stuff to be used to train them, though I've long since passed Douglas Adam's "35 year's old" tech barrier so maybe I'm just not liking change…

For now at least I won't be putting my own code in services like GitHub, then at least if someone else does or an LLM is trained on public sites generally, I've not explicitly agreed to anything that says the company can use my code that way - which you do (they say, I believe there is at least one court case brewing on the matter) when you sign up to the services (agreeing to their terms in the process) and use them that way.


> MS's codebase must contain much good training material, unless they think their own code is crap and are embarrassed to let the AI look at it!

Having worked at Microsoft, I'm not sure I want my coding assistant trained on Microsoft's codebase.


But would it be worse overall than the quality in the public repositories that they did use for the training data?


Have I gone back in time?

I feel like I’m reading the exact same sentence that someone posted about if StableDiffusion included images or was just “learning patterns” and totally generative from 6 to 8 months ago.

People argued, oh so hard about it.

“But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”

…but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.

Now. Let’s talk about code…

> and the problem is that it's synthesising code identical to existing code

There’s a word for that. It’s called copying.

Copying fragments, copying full text. It’s a black box, it spits out content that is indistinguishable (again, see stable diffusion, a lossy jpg of an image is the same image even when it is not bit-for-bit identical) from the input.

That’s copying.

The black box might be a neutral network, or a python script that reads a file from disk. The process is irrelevant.

If you memorise a block of code and type it out by hand with no reference, you are copying it.

These models copy code.

These models have embedded full text copies of some code.

What we do with that is an ethical question, but that it is true is not in dispute. It is true. It’s documented.

If you think these models are not copying at least some training data as output you are factually, and provably incorrect.

(1) - https://arstechnica.com/information-technology/2023/02/resea...


> “But stable diffusion is GB in size, how can it have embedded full images of a dataset that is so much larger? It’s not possible.”

> …but, it turns out it is possible (1) and it does, in fact, have embedded full copies of the source images.

The _not possible_ assertion is in response to people arguing that generative image models literally copy and thus "steal" the images in their training set or that such generative models "work" by infringing on said images. The assertion is not that the models cannot possibly contain _any_ copies of training data, the assertion is that the models are generalizing and can not contain a majority of the training data.

Now, let's look at some lines from your linked article titled "Paper: Stable Diffusion 'memorizes' some images, sparking privacy concerns", which happens to lead with scare quotes around "memorizes" and already hedges with "some":

> Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested . . . resulting in a roughly 0.03 percent memorization rate in this particular scenario.

I do not consider a 0.03% rate of full copies particularly damning. Instead, it sounds like a flaw in training which could be addressed, which the article does in fact also mention being as a matter of overfitting to images that are over-represented in the training data.

> the 160 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.

Which side's claims do you feel this line supports?

Generative text models should, in theory, also be generalizing and not memorizing. In practice, however, it does appear far easier to get Copilot and the like to spit out exact copies of code, complete with license text. Presumably, there's just not enough code out there to sufficiently generalize from, so we're seeing overfitting as a matter of course.


If you made 1,000,000 AI-generated MP3s available and 0.03% of them matched Sony’s copyrighted music catalog then you would be liable for $250,000 * 30 in statutory damages. Arguing that infringement doesn’t occur because the incidence rate is low is wallpapering over a serious problem.


Really depends on the details. If the extractable works align with songs that are over-represent in the training data, they may largely consist of performances of public domain compositions. And if their extraction also requires prompting which explicitly requests an infringing result, then the actual liability might be something you could mount a defense for.

Or hell, the extracted potentially infringing and over-represented material might all be pop music set to variations of Pachelbel's Canon, and I'd pay to see that lawsuit.


Details matter here. For a given musical performance, there are at least 3 copyrights in action:

1. The copyright on the composition. This can also include arrangements - for instance, Gershwin's original piano version of Rhapsody in Blue is now public domain, but the orchestral version everyone knows,

2. The copyright on the sheet music (the actual layout, spacing, editorial notes, things like that.. it's actually an insanely deep subject. I've got an 800 page book on the subject - which is referred to as music engraving, as up until about 40 years ago it was literally done by engraving the plates by hand. Much much harder problem than doing normal book-style text layout, as it's fully 2D, whereas text is basically 1d with occasional special cases. (NB: This copyright is really only relevant to the musicians, conductors, etc, but it does matter.

3. The copyright of the particular recording. This is the really relevant one. A 5 year old recording of a 500 year old work is very much under copyright.


This is just a distraction. Everyone knows we're talking about the last one here.

No one sued Napster because their guitar tabs were being shared.


No, it absolutely is not.

From GP: " they may largely consist of performances of public domain compositions."

My entire point is that the composition being PD does not mean RECORDINGS of it are PD.


And, obviously, the lyrics.


If that was the case, they'd be financing this stuff. Afaik they're not. What if you cross-referenced the sony copyrighted catalogue with 1 million traditional/public domain songs? I'd guess a 0.03% would be a rounding error.


Pedantic: 0.03% of 1 million is 300, not 30.


I'll see if I can dig some up, but there have been similar examples to the image reproduction case that have come out from codepilot, all it took was using some more unique method names (i.e. ones that were very specific to a particular code base) and oh look there comes the "totally not copied, honest gov'nor" code.

Any company that is using copilot or similar is walking right in to a massive legal minefield. Any engineer that chooses to use Copilot without explicit approval from the company should seriously worry about what target they're putting on their back. You don't want to be putting yourself in a position where your employer can say "They did this without our approval" or worse "Despite explicitly having told them not to", should there end up being legal action. Someone is going to establish precedence one way or another at some point, don't let it be you.


"I'll see if I can dig some up, but there have been similar examples to the image reproduction case that have come out from codepilot, all it took was using some more unique method names "

You mean, like the example the article used?

https://web.archive.org/web/20221017081115/https://nitter.ne...

"sparse matrix transpose, cs_"

Generated the authors matrixes code.


Here’s the example of copilot regurgitating GPL code verbatim:

https://codeium.com/blog/copilot-trains-on-gpl-codeium-does-...

It produces an almost exact copy of some training data.


If Edison goes over to Tesla’s house and looks at the latest invention on Tesla’s workbench, then goes to his factory and builds the same thing and sells it, then Edison has stolen and copied Tesla’s invention.

If Edison breaks into Tesla’s house at night, takes* the invention off his workbench, goes to his factory and uses it as a reference to build and sell the same thing, then Edison has stolen and copied Tesla’s invention.

In both cases we can say “stolen and copied”, but actual sequence of events and the legal recourse that Tesla has is different in each case.

This difference matters because open source is like Tesla having his workbench in a public place where people are encouraged to watch him work, play around with the inventions, help him build, etc. In the “Tesla’s public workbench” situation, Tesla’s recourse is debatably impacted in the first scenario, but it is definitely unchanged in the second scenario. Arguing as though Copilot is doing “second scenario stealing and copying” rather than “first scenario stealing and copying” sidesteps that debate about whether open sourcing your code made it ‘up for grabs’ for natural and/or artificial intelligences to learn from and/or memorize. Now, it’s not clear what side will win that debate - I personally think “this has infringed intellectual property rights” has a decent chance of being the victor. But sidestepping that debate is setting off peoples’ “you’re trying to pull a fast one” alarm and making them unsympathetic. That’s what I’m getting at here.

*: to avoid “steal vs copy depriving of property”-type discussions, postulate that Edison returns the invention to the workbench before Tesla notices

As to the linked article, I would quote from the article itself: “[The paper] is dense with nuance that could potentially be molded to fit a particular narrative”. I could say more: both the article and the paper state there were zero byte-identical matches, and 203 direct or perceptual near matches out of 175 million generations (or out 160 million training images) is 0.0001%, which argues against this very strong claim of “factually and provably incorrect”. But I think the argument I will actually make is this: I said Copilot was synthesizing identical code from its knowledge base, which is completely compatible with the claim that it is outputting code from memory (in the human sense of memory). What it is not compatible with is the claim that it is outputting code from memory (in the computer sense of memory).


> That’s copying.

> These models copy code.

These assertions have very little weight - what matters is what the law says. And for now, in the jurisdictions that matter, the law is still silent. Maybe it will catch up via either legislation or case law, or maybe it won't. But for now there is no legal consensus on what these new things are doing.

Of course there is a moral/ethical component to this, and individuals are obviously free to hold personal views. But unless some global working consensus forms (unlikely) then I don't see how change will come from that.

I suspect that it's going to be difficult for politicians to find the will to do something about this issue, and then to get their heads around it enough to make good laws.


If y = f(x, T) when y is a member of T, prove f() doesn’t copy a value from T.

This is possible. For example, many functions f(0) return 0; if we can derive a function g(x) that generates y without access to T (ie. the training data) then we can plausibly assume that f(x, T) might indeed not be copying from T.

…but can we do so?

If so, we must admit that it is possible that f(x, T) = g(x) and no copying is taking place.

So, in the past 50 years of prior work can we come up with an example of a function g(x) that generates the exact code we see coming out of these LLMs?

I’ll grant, it’s not impossible.

For example, if stable diffusion generated sin wave patterns or fractals that exhibited “deep complex structure” and it happens that some artist had written code to do that exact thing, there would (I believe) be a fair case that stable diffusion was not copying the artist, it was simply a parallel implementation of the same generative code.

However, now my scepticism kicks in.

For code written by hand, that was not generated, we are suggesting that a LLM is a parallel equivalent implementation of a human mind, and it can, with purely “learnt patterns” replicate not only the intent and structure of known code, but the exact code itself, repeatedly.

I’ll grant. It’s not impossible, and it’s very difficult to prove, but it seems fabulously, unbelievably, extraordinarily unlikely that a clean room reimplementation would be identical, repeatedly.

We’re entering into the domain of assigning probability and then doing a zero knowledge proof here, but the solution is easy.

Just prove by counter example.

Train a model that generates such code without the code as training data.

You have now won the argument.

…on the bright side, you’ve also solved the 10 million dollar question of “how do I train my LLM when I don’t have enough training data” (because that’s what you just did; train it to do something without training data), so you’re now rich and give zero ducks what I think. Congrats.


Could you explain this:

> For example, if stable diffusion generated sin wave patterns or fractals that exhibited “deep complex structure” and it happens that some artist had written code to do that exact thing, there would (I believe) be a fair case that stable diffusion was not copying the artist, it was simply a parallel implementation of the same generative code.

I understand your pseudo equations and I understand [some] engineering.

Why are artists writing code? Is that to show the parallel of if artists did write code?

Unless artists do really write code but that’s beyond the scope of what I’m asking.


The point was that some artist may have created a work representing a complex fractal, perhaps through code since it can be quite difficult to manually compute some fractal patterns.

And that in that case, g(x) that can produce a perfect copy of the artist's work without having the image in the training set provably exists, and is the actual mathematical fractal function itself.


The law is generally less concerned with how something happened and more concerned with what happened. So I would suspect that the legal frameworks will be more restrictive towards code that is reproduced verbatim, regardless of if the code is directly copied or LLM generated.


The law is concerned about the “who” the most usually from my personal lived experience.


The subject of the thread is "Will my code be regurgitated if I put it on Github ? Should I avoid putting my code on Github if I don't want it copied?"

The legality aspect that you are injecting into the discussion is irrelevant


I was replying to the comment by @wokwokwok.


This is correct. I suspect that the colorization precedencies will come back into play https://chart.copyrightdata.com/Colorization.html

Basically if you take something then feed new info into it is that new thing a new thing or a derivative of that work or other? With these models the data and mathmatical spline is so smeared across thousands of endpoints it is hard to tell. However, I do not think the courts have to think about the details of how the models work. They can do something else so a jury can get its head around it.

But basically the courts will probably simplify it into copyright things go in one side. Other things come out the other that sort of kind of resemble the original thing. Do not worry about the details of how it is done. Is that new thing owned by the original copyright holder? Or many holders, as it is smeared into a hundred other items that may or may not have copyright holders? Or is it a new work? Or is it a derivative work just colored?

This could go either way. As someone put it very nicely here a few weeks ago. These AIs are like the most amazing auto complete you have ever seen. Now in copyright if you make something and I independently come up with the same exact thing. Also I can prove that I am in very little trouble. Now in this case without that input code that AI model probably would not predict that exact string. But it in some cases exactly predict it again but can predict thousands of other things also. Is that prediction copyrightable? What if it predicts part of the code but with something else? I as a 3rd party who did not create the model and just used it and got the code what are my liabilities? Then if it is what are the consequences of that? The courts will have to decide eventually.


The law cares about class. For which class: not the working class. AI has no reason not to continue through the ethics issues in the Global North.

I believe China is trying to limit AI powers but who knows how that will go.


Does this storage argument also go for humans? If I ask a painter to paint paintings they've seen before, what will be their rate of copying the painting exactly?


Are you asking if it is okay for humans to plagiarize or otherwise violate copyright? For both the human and the algorithm, the answer is: it depends. Yes?

Are you trying to make money off of it? Is it for personal use? Was it protected or public domain?

It just all depends on the circumstances. Human or not.


“It depends on the circumstances” seems to me like cover for “we made up some random rules that made sense at the time”.

If you cover a song on YouTube, you can apparently be demonetised or attract copyright strikes from the original artist. But in software, you tell me an algorithm and I code it up, that’s an original work.

The lines all seem pretty arbitrary to me.


Oh, I used to be a lawyer. "It depends" is just always the answer, whether we like it or not. And, yes, it is pretty much because of “we made up some random rules that made sense at the time”.


If you remove "random", that's a pretty fair representation of any law, lol.


It doesn’t matter. The process is irrelevant.

Black box. Input. Output. Is the output the same as some training input?

It’s not rocket science.

“…but humans…” argument is not relevant. This is not a human.


> It doesn’t matter. The process is irrelevant.

Wrong. Literally the whole thing about law - and especially about intellectual property laws - is that process is as much, if not more relevant, than the outcome. This is why "code as law" efforts are plain suicidal. This is why you can't just print out a hex dump of a pirated MP3 file and claim it's not copyright violation because it's just a long number that your RNG spit out - it would've been a good argument if your RNG actually did that, but it didn't, and that's what matters.

This is what it means when we say that, for just about everyone except computer scientists, bits have colour[0]. Lawyers and regulators - and even ordinary people - track provenance of things; how something came to be matters as much, and often more, than what the thing itself is.

This is what makes generative AI case interesting. They're right there in the middle between two extremes: machines xeroxing their inputs, and humans engaging in creative work. We call the former "copying", and the latter "innovation" or "invention" or "creativity". The two have vastly different legal implications. Generative AI is forcing us to formalize what actually makes them different, as until now, we didn't have a clear answer, because we didn't need one.

--

[0] - https://ansuz.sooke.bc.ca/entry/23


The pirate bay founders made the argument that process was necessary and lost fairly big. They argued that the process dictated that the prosecutor had to first prove that a copy had been made, and prosecute that, before they could argue that the pirate bay somehow helped with that crime.

The court did not agree. They looked instead towards an anti-biker gang law that illustrated that a biker bar can be found guilty of assisting with gang crime, even if no specific crime can be directly associated with the bar.

The defense team argument - that prosecutors need to prove that a crime had occurred - failed. The courts only require that the opposite is not believable, which given all the facts around the case was deemed sufficient. In that question the process doesn't matter. If the court do not think it believable that copying has not occurred, any argument about "machines xeroxing their inputs and humans engaging in creative work" will be ignored.


I wonder who had much more money, PB or the RIAA


But if I put a human in the black box, that somehow now matters to your argument, because you're saying that it only holds for machines.


I don’t care if there’s a human in the box. If the box spits out training input as output it is copying it.

It doesn’t matter if there is a human doing it or not.

For your supposition to work, the input to the box would be only an abstract summary of the logical steps, and the output an exact copy of some other thing that was never an input.

In that case, yes, it would not be copying.

..but, is that the case? Hm? Is that what we care about? Is it possible to randomly generate the exact sequence of text with no matching training input? Along with the comments?

It seems fabulously unlikely, the point of being totally absurd.


I'm a proponent of not restricting (well, or trying to restrict) machine learning models and not considering them a lossy database but it must be said here if humans can recreate copyrighted works from memory and publish them, they are in trouble too.


I agree, I'm not saying that machines don't produce copies of existing data, I'm saying that's not all they produce.


I agree. The problem is that a human has ethical deterrents to avoid copying data while a machine doesn’t, so we have to rely purely on legal incentives to avoid copies from being produced.


I think the best argument here is that having the work in memory is not illegal, and human brains are not bound to copyright even when they can also be considered lossy databases. The question is where do we draw the line for a lossy database.


If you transcribe a copyrighted book by hand, that doesn't give you the right to publish it. I don't think being a human currently gives you a legal loophole to copy works so why make the comparison?


Humans (and their creativity) have special status and privileges in law.

Machines don't. It doesn't matter how fancy they are. The law doesn't care.

So yes, a human in a black box is different from a machine in a black box until laws change.


Funnily, the copyright argument is shifting towards “if resemblance is substantial, that’s infringing anyway”, circumventing whole discussions around “is it infringing if it’s called learning” arguments.


Close to zero? It's too hard to create an identical picture from memory, even with the original the fakes are not bit-perfect. Computers, on the other hand, are great at copy&paste


Except, if you look closely, the AI-generated duplicates are far from identical too.


Except they're not far? We're not talking about average images with 3 eyes or similarly obviously wrong code, the original comment is specifically about storing perfect copies. Human brains can't store perfect copies (and then reproduce), especially at scale


Close to 0, but not 0

https://www.nationalgeographic.com/science/article/autism-ar...

Stephen Wiltshire is cool.


But how is this but relevant?


the human painter will undoubtedly be able to put out 1 billion exact copies of paintings based on their memory and skills in under 10 seconds so there is no reason to consider any difference in the cases.


Is number of copies and speed of making them a significant determinant in whether copyright infringement took place or not?


an opposite question - you seem to think that a machine has the same protections against charges of copyright infringement as a human does - why?

They don't have the same rights to register a copyright.


Why is registering a copyright relevant in cases of copyright infringement? Copyright infringement is infringement of the right of someone else against copying.


They do copy code sometimes but most of the code they generate is new, I think some (maybe not many but some) people have the impression that they literally search for similar code and copy it exactly 100% of the time which isn’t true


> in fact, have embedded full copies of the source images

few images of of millions


It's clearly not a copy stop arguing in bad faith.


> ‘There are legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing", and it's different tools and rhetoric from the ones we have for dealing with the problem of "stealing or copying existing things".’

Copyright law applies in both cases. If you create work that’s substantially similar to an existing one, you’re risking copyright infringement even if none of the original work’s content was reproduced in the mechanical sense.

If this were not the case, why would cover bands pay the original artists for rights to the songs?

“I learned to play ‘Yesterday’ by heart” doesn’t mean you can do anything with the song without paying the Beatles. The same applies to machine learning, if all the model has learned is to imitate copyrighted works.


Yes, I do think that LLMs are probably doing copyright infringement. There is a long and rich heritage, in tech and elsewhere, of arguing that copyright infringement is not stealing, it is its own separate (though similar) crime and usually a lesser crime to some degree. Maybe my original point could be rephrased snarkily as “you may have a point but until you stop sounding like the RIAA calling everything ‘stealing’, a lot of people are going to ignore you”.


If I type out a copyrighted program character-by-character from my memory, does this somehow get exempted from copyright law?


Trying to be more precise:

If the LLM reproduces a significant portion of a program token-by-token, is it a derivative work and is it not fair use?


Now, I'm not a lawyer, so this isn't legal advice, but...

A derivative work is not fair use. If you end up with a significant portion of another person's program in yours, (such that a substantial portion of your program is in some way related to their program), that will likely be a derivative work - but the definition of a derivative work depends on your jurisdiction and use-case. If you're unlawfully using the source material to produce a derivative work, you cannot copyright that derivative work under 17 U.S.C. § 103(a), and under the same section, you can only copyright your modifications, not the original.

It would be hard to argue fair use in this case; fair use only really applies for parodies/criticism, reporting, and scholarly works - and generally that's an affirmative defense, rather than an express or implied right you have.

Honestly, Copilot is difficult because Copilot can't be the author of the code; the person who used Copilot is the "author" of the code, and I think they'd be the ones liable for copyright infringement if copyrighted content ends up in their code.

To argue someone performed copyright infringement, all you need is to prove (1) a valid copyright exists; (2) that the person had access to the work; (3) the person had the opportunity to steal the work; and (4) that protected elements of the work had been copied (afaik generally under a "substantially similar" standard). Copilot offers an easy way to check both (2) and (3) - a copyright holder could argue that people had access to their code through Copilot, and that Copilot offered an opportunity to steal the work.


> the problem is that it's synthesising code identical to existing code

No, the problem is that it's illegal to do so.


> it seems to me that Copilot truly does synthesise code from some store of knowledge

It's a common mistake because we are not used to LLMs.


How is any of that relevant? If i memorize your work and dissect my brain and run strings on the output you won't find anything either. Same happens if you put it in a .zip file. But both, and the language model, can spit it back out and infringe on your rights.


humans have learnt by looking at a few examples, and from there can code almost anything. Humans can also learn without examples, although it's harder. But I expect the 1st C programmers didn't have a K&R book to learn from.

AI instead has looked at all the existing code and it has learnt to do no more than what the existing code can do.

Do you see a difference?


No, because I can ask it for a piece of code that isn't in the training set and it will write it for me.


You are missing the point.

Humans make sense of the world with an extremely limited amount of information. I have not read all of GitHub. Comparatively, I have not read a fraction of it. And I could not read all of GitHub even if I wanted to.

However, I do not need to read all of GitHub anyway because, as a human, I am capable of understanding.

Current generation "AI" is not capable of understanding. Current generation "AI" merely computes the probability of any given word appearing next in this sentence based on an utterly absurd amount of raw data.

That is not what humans do at all. We do not need this information. We cannot even process it.


We do not know whether or not that is what humans do or how qualitatively different it is to what humans do, because we don't know all that much about human reasoning works.

Put another way: Some Markov models are Turing complete, so "merely" computing probabilities can be Turing complete with only minor steps, so trying to downplay the potential capabilities of models like this by handwaving about things we don't know is foolish.

We don't need that scale of input, but we also don't know if LLMs need that much input to do well, or if our current training protocols are simply poor. With ongoing work on reducing the training cost, it is at a minimum clear that current training methods are far from optimal.


You make a logical leap. "Machines learn differently from humans, therefore they cannot understand". My definition for "understanding" isn't just "whatever humans (and nobody else) do".


Whatever your definition of "understanding" may be, we appear to agree that these are different things which is all parent asked about:

> Do you see a difference?


you can ask for things that are combinations of stuff that is in the training set.


> AI instead has looked at all the existing code and it has learnt to do no more than what the existing code can do.

Are you sure about that?


> Are you sure about that?

Yes I'm quite sure. Unless you're claiming that source code existed before humans?


I’m sorry but I don’t understand you.

It sounds like what you are saying is that the AI cannot write code which would do something it hasn’t seen in the training set. That is how I interpreted the “it has learnt to do no more than what the existing code can do”. Do i understand you right?

If so, how do you know that? Are you talking about the limitations of a particular implementation or a limitation of all AIs as a concept?


> If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code

If it was possible to point where the pixel values are actually stored in a jpeg file (i.e. run dd on the file and it spits out the pixel values), I would be a lot more sympathetic to the view that LLMs are not stealing code.


In the very next sentence after the one you quote, I say “Even if you decode the weights from floats to strings, you won't find the strings of code stored anywhere.”

If you decode the jpeg file you do get the pixel values.


The whole debate is that when you decode the weights (i.e. "run AI on some prompt") you do in fact get training code reproduced verbatim. The fact that we do not have tools to analyze this decoding function analytically is orthogonal.


You don't get the exact pixel values decoding a jpeg. But for a lot of uses, what you get is close enough.

One could describe llms similarly.


Aren't we just talking about a novel compression technique here. Like there was a voice codec Google released a while ago that essentially modeled your voice and then fed a piano roll of what you said through it.

I don't really care whether there's a lossless, lossy, or AI compression of my stuff. I care that it's my stuff.


Every time I see this argument about LLMs I can't help but think of Borges' Pierre Menard, Author of the Quixote.


If we switch out LLMs for LZMA algorithms your reasoning still makes sense. Which is a real problem with your reasoning.


would you be able to elaborate on the, this isn't my field i would be interested to learn an alternative way to talk about all this.

> legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing"


Copyright infringement, intellectual property, patents, things like that.


What happens if I create a 10 lines function character by character identical to some proprietary or GPLled piece of code, without ever looking at that code nor knowing that it existed?

I expect that the copyright holders could reach to me, tell me that my code is identical to theirs, believe or not that I independently created the same code, and at least ask me to stop distributing it.

Of course

  t += total(e)
is easier to defend than

  monthlyTotal += dailyTotal(expenses)
because the chances that I didn't copy their code get slimmer and slimmer as the names of the identifiers and the structure of the code get more complex.

If I actually looked at their code, it would be wiser from me to at least change the names, maybe also a little bit the structure of the code. That's the difference between being inspired by something and copying it.

TL;DR: if Copilot generates a copy, it's a copy.


> What happens if I create a 10 lines function character by character identical to some proprietary or GPLled piece of code, without ever looking at that code nor knowing that it existed?

Then you're in the clear, though you may need to convince a court that that's what happened. (Patents and trademarks could still be issues, but there's no copyright issue)


I know the law doesn't work this way, but wonder if code that can be "learned" this easily should really be copyrightable, etc. These LLMs don't keep a copy of some code to paste in verbatim.


LLMs totally do, and the fact can be confirmed by experiment. See "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4": https://arxiv.org/abs/2305.00118


I'm really baffled by all this discussion on copyrights in the age of AI. The Copilot does not 'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it. IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas.


The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim.

> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.

If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.


"The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim."

But only snippets as far as I can tell.

This is the codeexample linked from the author:

https://web.archive.org/web/20221017081115/https://nitter.ne...

It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?

(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

And just slightly changing the code seems trivial, at what point will it be acceptable?

I just don't think spending much energy there is really beneficial for anyone.

I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?


> (Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

This. People seem to forget that generative AIs don't just spit out copyrighted work at random, of their own accord. You have to prompt them. And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you? After all, it's you who supplied the missing, highly specific input, that made the AI reproduce a work from the training set.

I maintain that, if we want to make comparisons between transformer models (particularly LLMs) and humans, then the AI isn't like an adult human - it's best thought of as having a mentality of a four year old kid. That is, highly trusting, very naive. It will do its best to fulfill what you ask for, because why wouldn't it? At the point of asking, you and your query are its whole world, and it wasn't trained to distrust the user.


But this means that Microsoft is publishing a black box (Copilot) that contains GPL code.

If we think of Copilot as a (de)compression algorithm plus the compressed blob that the algorithm uses as its database, the algorithm is fine but the contents of the database pretty clearly violate GPL.


While I do believe that thinking and compression will turn out to be fundamentally the same thing, the split you propose is unclear with NN-based models. Code and data are fundamentally the same thing. The distinction we usually make between them is just a simplification, that's mostly useful but sometimes misleading. Transformer models are one of those cases where the distinction clearly doesn't make any sense.


>And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you?

If you, not I, uploaded my GPL'ed code to Github is the blame on you then?


> If you, not I, uploaded my GPL'ed code to Github is the blame on you then?

Definitely not me - if your code is GPL'ed, then I'm legally free to upload it to Github, and to an extent even ethically - I am exercising one of my software freedoms.

(Note that even TFA recognizes this and admits it's making an ethical plea, not a legal one.)

Github using that code to train Copilot is potentially questionable. Github distributing Copilot (or access to it) is a contested issue. Copilot spitting out significant parts of GPL-ed code without attaching the license, or otherwise meeting the license conditions, is a potential problem. You incorporating that code into software you distribute is a clear-cut GPL violation.


The GitHub terms of service state that you must give certain rights to your code. If you didn't have those rights, but they use them anyway, whose fault is that?


>And just slightly changing the code seems trivial, at what point will it be acceptable?

If I start creating a car by using a blueprint of Fords to create something at what point will it be acceptable? I'd say even if you rework everything completely Ford would still have a case to sue you. I can't see how this is any different. My code is my code and no matter how much you change it, it is still under the same licence as it started out with. If you want it not to be then don't start with a part of my code as a base. In my opinion the case is pretty clear: This is only going on because Microsoft has lots of money and lawyers. A small company doing this would be crushed.


Easy. People get to throw rocks at the shiny new thing. To my untrained eye the entire idea of copyrighting a piece of text is ridiculous. Let me phrase it in an entirely different way from how any other person seems to be approaching it.

If a medical procedure is proven to be life-saving, what happens worldwide? Doctors are forced to update their procedures and knowledge base to include the new information, and can get sued for doing something less efficient or more dangerous, by comparison.

If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

I hear an awful lot of people complain all the time about climate change and how bad computers are for the environment, there are even sections on AI model cards devoted to proving how much greenhouse gases have been pushed into the environment, yet none of those virtue signalling idiots are anywhere to be seen when you ask them why they aren't attacking the bureaucracy of copyright and law in the world of computer science.

An arbitrary example that is tangentially related: One could argue that the company sitting on the largest database of self-driving data for public roads is also the one that must be held responsible if other companies require access to such data for safety reasons (aka, human lives would be endangered as a consequence of not having access to all relevant data). See how this same argument can easily be made for any license sitting on top of performance critical code?

So where are these people advocating for climate activism and whatever, when this issue of copyright comes up? Certainly if OpenAI was forced to open source their models, substantial computing resources would not have been wasted training competing open source products, thus killing the planet some more.

So, please forgive me if I find the entire field to be redundant and largely harmful for human life all over.


Yes, of course copyright is dumb and we'd all be better off without it. Duh.

The problem here is that Microsoft is effectively saying, "copyright for me but not for thee." As long as Microsoft gets a state-enforced monopoly on their code, I should get one too.


> If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

If you don't "slap a license on it" it is unusable by default due to copyright.


Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.


> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

That's just copying with extra steps.

The way to do it legally is to have 1 person read the code, and then write up a document that describes functionally what the code does. Then, a second person implements software just from the notes.

That's the method Compaq used to re-implement the original PC BIOS from IBM.


Indeed. Case closed. If an AI produces verbatim code owned by somebody else and you cannot prove that the AI hasn't been trained on that code, we shall treat the case in exact the same way as we would treat it when humans are involved.

Except that with AI we can more easily (in principle) provide provable provenance of training set and (again in principle) reproduce the model and prove whether it could create the copyrighted work also without having had access to the work in its training set


>The way to do it legally is to have 1 person read the code

wasn't it to have one person run tests of what happened when different things were done, and then write up a document describing the functionality?

In other words I think one person reading the code is still in violation?


> Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.

https://en.wikipedia.org/wiki/Clean_room_design


yes, reading that description it seems pretty clear to me that they did not read the code but they had access to the working system and then

>by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design.

reverse engineering is not 'reading the code'.


Theoretically maybe, then they would have to prove they did so without having knowledge about the infringed code in court. You can't make that claim for AI that was trained on the infringed code3.


Yes, that's why any serious effort in producing software compatible with GPL-ed software requires the team writing code not to look at the original code at all. Usually a person (or small team) reads the original software and produces a spec, then another team implements the spec. This reduces the chance of accidentally copying GPL-ed code.


> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

Maybe, but that would still be copyright infringement. See My Sweet Lord.


It’s not accidental. Not infringing copyright isn’t part of the objective function like it would be for a human.


Not learning or not being inspired by copyrighted code is not a human function either though.


Has a human ever memorised verbatim the whole of github?

If someone somehow managed to do that and then happened to have accidentally copied someone's code, how believable would their argument be?


> Has a human ever memorised verbatim the whole of github?

No, and humans who have read copyrighted code are often prevented from working on clean room implementations of similar projects for this exact reason, so that those humans don't accidentally include something they learned from existing code.

Developers that worked on Windows internals are barred from working on WINE or ReactOS for this exact reason.


Hasn't that all been excessively played through in music copyright questions? With the difference that the parody exception that protects e.g. the entire The Rutles catalogue won't get you far in code...


> this would be a clear violation of the licence

Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.

Copyright is a lot less black and white than most here seem to believe.


That’s part of the rub. YouTube doesn’t break copyright law if a user uploads copyrighted material without proper rights. Now, if YT was a free for all, then yeah. But given it does have copyright reporting functionality and automated systems, it can claim it’s doing a best faith effort to minimize copyright infringement.

Copilot similarly isn’t the one checking in the code. So it’s on each user. That said, Copilot at some point probably needs to add some type of copyright detection heuristics. It already has a suppression feature, but it probably also needs to have some type of checker once code is committed and at that point Copilot generated code needs to be cross-referenced against code Copilot was trained on.


> If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

We aren't talking verbatim generation of entire packages of code here, are we? Code snippets are surely covered under fair use?


It would almost surely be fair use to include a snippet of code from a different library in your (inline) documentation to argue that your code reimplements a bug for compatibility reasons.

In general it is not fair use if you are using the material for the same scope as the original author[0] or if you are doing it just to namedrop/quote/omage the original.

It is possible to argue that a snippet can be too small to be protected, but that would not be because of fair use.

[0] Suppose that some Author B did as above and copied a snippet of code in their docstring to exlain buggy behaviour of a library they were reimplementing. If you are then trying to reimplement B's libary you can copy the same snippet B copied, but you likely cannot copy the paragraph written by B where they explain the how and the why of the bug.


> Code snippets are surely covered under fair use?

...for "purposes such as commentary, criticism, news reporting, and scholarly reports"? Sure.

For a commercial product? Best check with your lawyer...


Oracle would like to have a word..


The Fair Use concept is specific to the USA.


> it's that it spits out GPL code verbatim

It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.


Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.

In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?


This seems like a really, really easy problem to fix.

It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.

If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.



So we should attack the problem of proprietary code. Maybe from Right to Repair angle. I believe there should be no such thing as closed source code.


Closed source code is beige corp-speak, its true name is 'malware'.


In Linus Torvald's book "Just For Fun", there's a chapter about copyright where he presents both the upsides and downsides of it in a pretty much balanced way. I think it's worth reading.


Bit of a false equation to act as though a massive computer system is the same as any individual.

People put code on github to be read by anyone (assuming a public repository), but the terms of use are governed by the license. Now you've got a system that ignores the license and scrapes your data for its own purpose. You can pretend it's human but the capabilities aren't the same. (Humans generally don't spend a month being trained on all github code and remember large chunks of it for regurgitation at superhuman speeds, nor can they be horizontally scaled after learning.)

You can still be of the opinion that this is fine, and I may or may not be fine with it as well, I just don't think the stated reason holds up to logic and other opinions ought to "baffle" you


And GitHub’s EULA gives it the right to train Copilot on public code you host on GitHub.


The issue, though, is not the code I personally upload to my own public repositories, but the code that someone else uploads to Github by cloning my repository held somewhere else than Github.

Personally I have eschewed any personal use of Github since the MS aquisition and only ever use it where that's mandated by a client (so not my code). If you clone my code from elsewhere into a Github repo, that's just rude and contrary to me every intent and wish.

I think it's time to add a "No GitHub" clause as an optional add-on to the various open-source licenses.


So then the person who uploaded your code to GitHub has committed a copyright violation and I’m sure GitHub would honor to remove your code from the model training corpus as it was illegally uploaded to GitHub.


It’s not necessarily a copyright violation if the license permits copying. Under a permissive license, you are expressly permitted to copy the code and distribute copies provided you comply with whatever conditions the license mandates, without an explicit blessing of the copyright holders. Most popular licenses do not include a prohibition on training AI models. Maybe people should start including a clause.


Many popular licenses include a prohibition on being used to create proprietary software. GitHub Copilot is proprietary.


That's great, but GP's argument was

> Copilot does not 'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it.

Not "the terms of use you agreed to allow them to do it". Different argument with different amount of merit in my opinion


Agreed. I was just saying in the current environment GitHub has that license, nobody else has. So if the courts decide one day that because machines learn differently from humans, they will allow copyright holders to add a license exception that disallows machine training, then GitHub will benefit from this. It’s kind of ironical. What’s best for society is to not have any such law enacted and continue to allow open source models to progress alongside proprietary ones (in addition to more level competitive dynamics on the proprietary side).


They could just train a model on GPL code that can only be used on GPL code.

For MIT licenses that's impossible currently because of the requirement to mention the authors.


Copilot has been caught multiple times reproducing code verbatim. At some point it spat out some guy's complete "about me" blog page. That's not learning, that's copying in a roundabout way.

Also, AI doesn't learn "like a human". Neural networks are an extremely simplistic representation of a biological brain and the details of how learning and human memory works aren't even all that clear yet.

Open source code usually comes with expectations for the people who use it. That expectation can be as simple as requiring a reference back to the authors, adding a license file to clarify what the source was based on, or in more extreme cases putting licensing requirements on the final product.

Unless Microsoft complies with the various project licenses, I don't see why this is antithetical to the idea of open source at all.


No disrespect but I am baffled by your statement that it learns, even to go so far as to say as a human coder would learn.

I don't really want this to comment to be perceived as flame bait (AI seems to be a very sensitive topic in the same sense as crypto currency), so instead let me just pose a simple question. If Copilot really learns as a human, then why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?


I think the comment was trying to draw the distinction between a database and a language model. The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller. This should tell us that a language model cannot reproduce copyrighted code byte for byte because the original data simply doesn't exist. Similarly, when you and I read a block of code, it leaves our memory pretty quickly and we wouldn't be able to reproduce it byte for byte if we wanted. We say the model learns like a human because it is able to extract generalised patterns from viewing many examples. That doesn't mean it learns exactly like a human but it's also definitely not a database.

The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.


I see what you're going for, and I respect your point of view, but also respectfully I think the logic is a little circular.

To say "it's not a database, it's a language model, and that means it extracts generalized patterns from viewing examples, just like humans" to me that just means that occasionally humans behave like language models. That doesn't mean though that therefore it thinks like a human, but rather sometimes humans think like a language model (a fundamental algorithm), which is circular. It hardly makes sense to justify that a language model learns like a human, just because people also occasionally copy patterns and search/replace values and variable names.

To really make the comparison honest, we have to be more clear about the hypothetical humans in question. For a human who has truly learned from looking at many examples, we could have a conversation with them and they would demonstrate a deeper sense of understanding behind the meaning of what they copied. This is something a LLM could not do. On the other hand, if a person really had no idea, like someone who copied answers from someone else in a test, we'd just say well you don't really understand this and you're just x degrees away from having copied their answers verbatim. I believe LLMs are emulating this behavior and not the former.

I mean, how many times in your life have you talked to a human being who clearly had no idea what they were doing because they copied something and didn't understand it all? If that's the analogy that's being made then I'd say it's a bad one, because it is actually choosing the one time where humans don't understand what they've done as a false equivalence to language models thinking like a human.

Basically, sometimes humans meaninglessly parrot things too.


> The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller.

This just means it's a really efficient lossy compression algorithm, not that it learns like a human.


> why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.

Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.

A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?


What they are saying is that if you’ve studied computer science , you should be able to write a computer program without storing millions or billions of lines of code from GitHub in your brain.

A CS graduate could workout how to write software without doing that.

So they’re just pointing out the difference in “learning”.


LLM's are not storing millions or billions lines of code, and neither do we. Both store something more general and abstract.

But I'm saying there's a big difference between a CS graduate and some current LLM that learns from "the CS curriculum". A CS graduate can ask questions, use google to learn about things outside of school, work on hobby projects, study existing code outside of what's shown in university, get compiler feedback when things go wrong, etc.

All a language model can do is read text and try to predict what comes next.


We do but we also simulate it doing homework very well.


AI doesn't "learn". It's statistical inference if anything.

If I took two copy-righted pictures and layered them on top of each other at 50% opacity. Would that be OK or copy right infringement?

AI models just use more weights/biases and more images (or any input).


And what is LEARNING in your opinion?


Cambridge dictionary has it as: "knowledge or a piece of information obtained by study or experience".

If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned? Or the application of a statistical inference model? This alone is probably far enough abstracted to never be an ethical or legal issue. However, if I had a model that was only "trained" on Stephen King books, and used it to write a novel, would that be OK? Or do you think it would be in the realm of copyright infringement?

By your definition anything a computer does means it has learned it. If I copy and paste a picture, has the computer "learned" it while it reads out the data byte-by-byte? That sure sounds like it is "studying" the picture.

"AI" and "ML" are just statistics powered by computers that can do billions of calculations per second. It is not special, it is not "learning". To portray some value to it as something else is disingenuous at best, and fraud at worst.


Your polaroid example would require someone to write code that does that one specific thing. You could also argue that this would violate copyright if it was trained on some photographer's specific unique style, made as an app and marketed as being able to mimic the photographer's style. But in your example you have 1000 random polaroid images of unknown origin, so somehow it becomes abstract enough that it doesn't become an issue.

In your stephen king example I would say it's still learned, because the "code" is a general language model that can learn anything. It's just you decided to only train it on stephen king novels. If you have an image model that trained 100% on public domain images and finetune it to replicate a specific artist's style I would personally think the finetuned model and its creator is maybe violating copyright.

But when it comes to learning I would say when you write a program whose purpose is to learn the next word or pixel, but it's up to the computer to figure out how to do that, the computer is learning when you feed it input data. It's the program's job to figure out the best way to predict, not the programmer. (it's not that black and white given that the programmer will also sometimes guide the program, but you get the idea)

When you write a program that does one or several things, it's not learning.

I think it's something to do with the difference between emergent behavior from simple rules and intentional behavior from complex rules.


I think you're using fancy language like "general language model" to obscure the facts.

If I created a program to read words from the input and assign weights based on previous words, I could feed in any data. Just like the polaroid example. (I suggested that the polaroid example was abstract enough not to be an ethical/legal problem because I believe it is mostly transformative, unless the colours themselves were copyrighted or a distinct enough work in themselves.)

Now If I only feed in Stephen King books and let it run, suddenly it outputs phrases, wording, place names, character names, adjectives all from Stephen King's repertoire. Is this a 'general language model'? Should this by copyright exempt? I don't think this is transformative enough at all. I've just mangled copyrighted works together, probably not enough to stand-up against a copyright claim.

I think people use AI and ML as buzzwords to try and obfuscate what's actually happening. If we were talking about AI and ML that doesn't need training on any licensed or copyrighted work (including 'public domain') then we can have a different conversation, but at the moment it's obscured copyright theft.


I can agree it's obscure in the sense that we shrug when asked about how it works. If you specifically train a model to mimic a specific style I can get behind it leaning more towards theft, or at least being immoral regardless of laws.

If you train a model to replicate 10000 specific artists, I could also get behind it being more like theft.

But if the intention was to train with random data (and some of it could be copyrighted) just like your polaroid example to generate anything you want, I'm not so sure anymore.

I feel the intent is the most important part here. But then again I don't know the intent behind these companies, and I guess you don't either. Maybe no single person working in these companies know the intent either.

It also gets murky when you have prompts that can refer to specific artists and when people who use the models explicitly try to copy an artists style. In the case of stable diffusion, if the CEO's to be believed the clip model had learned to associate images of greg ruktowski and other artists to images that were not theirs but in a similar style[0]

Even murkier is when you have a base model trained on public data, but people finetune at home to replicate some specific artist's style.

[0] https://twitter.com/EMostaque/status/1571634871084236801


> If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned?

You wouldn't. LUT would.


It's data. No one owns data.


Can I have you’re credit card number, expiry and verification number please? Also your DNA ?

Since it’s data that should be cool right ?


Equating human cognition with machine algorithms is the root of the issue, and a significant part of its "legitimacy" comes from the need for "AI" companies to push their products as effective, and there's no better marketing than to equate humans to machines. Not even novel.


It requires abstraction. Something that LLMs are not capable of, beyond trivial amounts.


TRAINING your 3rd eye/branch predictor

if(nonfree_software){

// unhappy path

}


You can make out the two original copyrighted pictures in that case, and all you did was using 50% opacity which might not be very transformative, so probably?

In my mind (and I suspect others too) in machine learning context, statistical inference and learning became synonymous with all the recent development.

The way I see it, there's now a discussion around copyright because people have different fundemental views on what learning is and what it means to be a human that don't really surface.


If "like a human" is enough to get human rights then why did I get a parking ticket even when I argued that my car just stands there like a human ? This really isn't as good a defense as people portray. There are a lot of rights and privileges granted to humans but not to objects - we can all agree on that I think.


And if you need a person with supercharged rights and a slippery amount of liability...form a corporation.


There is a difference between a person learning and a commercial product learning from someone else’s work, probably ignoring all the licenses.


To be fair, when a programmer learns from publicly available but not public-domain code, and then applies the ideas, patterns, idioms and common implementations in their daily job as a software developer, the result is very much a "commercial product" (the dev company, the programmer themselves if a freelancer) learning from someone else's work and ignoring all the licenses.

The only leap here is the fact that the programmer has outsourced the learning to a tool that does it for them, which they then use to do their job, just as before.


No, the difference is that OpenAI has a huge competitive advantage due to direct partnership with Github, which is owned by Microsoft. In fact, it's even worse. With OpenAI making money from GPT, Github has even less incentive to make data easily available to others because that would allow for competition to come in. I wouldn't be surprised if Github starts locking down their APIs in the near future to prevent competitors from getting data for their models.

Nobody is arguing against uploading code. It's about Github/Microsoft specifically.


I agree there's a difference in the ease of access, a competitive advantage, sure. And I get that people writing public-source (however licensed) software don't want to make it easier for them (as in, Microsoft) to make money off of "learning" (of the machine type) from it. That's fair.

However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.

I mean, for the rest of the content all the new fancy LLMs have been trained with, there wasn't a Github equivalent. They just used massive scraped dumps of text from wherever they could find them, which most definitely included trillions of lines of very much copyrighted text.

In short: not only I don't really see an issue with Copilot-like AIs learning from publicly available code (as I described in the GP comment) but I also think if you publish code anywhere at all it's inevitable that it'll end up in Copilot, regardless of where you host it. If you want to make it more expensive for Microsoft to scrape it, sure, go ahead, but I don't think it matters in the long run.


However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.

I’d be quite careful with of this view.

By your logic, it should be ok to take the Linux kernel, copy it, build it, then sell it and give nothing back to the community that built it. Then just blame it on the authors for uploading it to the internet ?


> all this discussion on copyrights in the age of AI.

copyright is a thing, AI do not change that.

> does not 'steal' or and reproduce our code - it simply LEARNS from it as a human

And here we have the central problem, does it act like a human or does it not act like a human? Humans copy things they learn all the time, some of us know various songs by heart, others will even quote entire movies from memory. If AI can learn and reproduce things like humans do then you need to take steps to ensure that the output is properly stripped from any content that might infringe on existing copyrighted works.


There is a definite difference between singing a song while walking down the street and writing down the lyrics, putting it in a database, claiming it’s my content and then selling it on, even if it’s slightly rehashed.


I would have no problem if such AI systems are also completely open source, can be run by me on my system and come with all models to use them also easily available (again in some form of opensource license). I genuinely don't see that happening in the future with BigTech. As such, as a proponent of FSF GPL philosophy, I have no interest in supporting such systems with my hard work, my source code. So yes, I do consider it stealing - my hard labour in any GPL opensource work is meant for the public good (for example, to preserve our right to repair by ensuring the source code is always available through the GPL license). Any corporate that uses my work, for profit, without either paying me or blocking the public good that I am trying for is simply exploiting me and the goodwill of others like me.


Copilot does not steal. Copilot does not learn. If you want to apply these concepts to LLMs, first prove how an LLM is human and then explain why it doesn’t have human rights.

Rather, Copilot is a tool. Microsoft/ClosedAI operate this tool. Commercially. They crawl original works and through running ML on it automatically generate and sell derivative works from those original works, without any consent or compensation. They are the ones who violate copyright, not Copilot.


Whether an LLM actually learns is completely tangential to the topic at hand. A human coder who learned from copyrighted code and then reproduced that code (intentionally or not) would be in violation of the copyright. This is why projects like Wine are so careful about doing clean room implementations.

As an aside, it seems really strange to invoke "open source ideas" as an argument in favor of a for-profit company building a closed source product that relies on millions of lines of open source code.


It’s also fair to say that a lot of this carefulness has probably made life difficult for the developers of wine, but they wanted to avoid Microsoft’s legal team. So they respected the copyright laws.

Here is Microsoft doing as Microsoft does…


I'm in several communities for smaller/niche languages and asking questions about things that have few sources make it much more clear that it's not "learning" but grabbing passages/chunks of source. Maybe with subjects that have more coverage it can assimilate more "original" sounding output.


Plenty of people already argued that LLMs don't actually learn like a human. However, you should keep in mind the reason why clean-room reverse engineering exists: humans learn from source material. FLOSS RE projects (e.g. nouveau) typically don't like leaks, because some contributors might be exposed to copyrighted material. Sometimes, the opposite happens: people working on proprietary software are not allowed to see the source of a FLOSS alternative.


> it simply LEARNS from it as a human coder would learn from it.

It doesn't LEARN anything, let alone like a human coder would. It has absolutely zero understanding. It's not actually intelligent. It's a highly tuned mathematical model that predicts what the next word should be.


I can also learn things with no understanding (like a foreign word), I doubt that would make me immune to copyright ?


If you were to learn a phrase that insulted the king in Thai, and said it in Thailand, you would end up in jail. Doesn't matter if you understood what the phrase said. Ignorance doesn't make you immune to consequences.


Your comment implies that we’re in some age of AGI, but we’re not there yet. Some argue that we’re not even close, but who knows, that’s all speculation.

> it simply LEARNS from it as a human coder would learn from it.

The LLM doesn’t learn, the authors of the LLM are encoding copyright protected content into a model using gradient decent and other techniques.

Now as far as I understand the law, that’s OK. The problems arise when distribution of the model comes into play.

I’m curious, are you a programmer yourself? Don’t take this the wrong way, but I want to understand the background of people who coming to the kind of conclusion you seemed to arrive at about how LLMs work.


> it simply LEARNS from it as a human coder would learn from it

What humans do to learn is intuitive, but it is not simple. What the machine does is also not simple, it involves some tricky math.

Precisely if the process was simple, then it could be more easily argued that the machine is "just copying" - that is simple.

There's a lot of nuance here.

What the machine is doing "looks similar to what humans do from the exterior", the same way that a plane flying "looks similar" to a flying bird. But the airplane does not flap its wings.

> kind of irrational and antithetical to open source ideas

Open source ideas are not the only ideas in town.


Humans don't learn an algorithm by memorizing a particular implementation character by character.


That's all the more reason for the utility of solutions like Copilot? Humans are limited in both time and memory.

Though, GitHub would do well to also bake-in approp attributions if a significant portion of the generated code is a copypasta.


Neither does copilot.


But it does though. There have been many times where this was the case.


It only happens if you bait it really hard and push it into a corner. That's not representative at all. I use Copilot to write highly niched code that's based on my own repo. It's simply amazing at understanding the context and suggest things I was about to write anyway. Nothing it produces is just copypasted character by character. Not even close.


As others have pointed out, it means the model contains copyrighted material. So I guess that’s totally illegal. Like if I ripped a Windows ISO, zipped it up and shared it with half the world. You know what would happen to me don’t you ?


Not the same thing at all. The data isn't just sitting there in a store inside the model that you can query. No-one would be able to look at the raw data and find any copyrighted material, even if all it was trained on was copyrighted code (which I agree is an issue).


There’s a lot of misconceptions here but LLMs and stable diffusion have spat out copyrighted material verbatim.

So that’s not accurate.


What is not accurate? They are still not storing any material internally, even if the patterns they have learned can cause them to output copyrighted material verbatim. People need to break out of the mental model that an LLM is just a bunch of pointers fetching data from an internal data store.


Have a read through other comments on this thread, you'll see some good examples.


And airplanes don't flap their wings, but we still agree that they're flying, just as birds do.


There are people who do it... I personally know a guy whit photographic memory


He don't get an exemption from copyright law, or do he?


Humans are intentionally loading up giant sets of curated data for training, purposes, into a super computer to produce a model which is an black box and have provided zero attribution or credit to those who made this work possible. Humans are tuning these models to produce the results you see.

In the case of ChatGPT-x, Open AI company which is disguised as a not for profit with a goal of producing ever more powerful models that may eventually be capable of replacing you at work while seemingly not having any plan to give back to those who’s work was used to make them insane amounts of money.

They haven’t even given back any of their research. So it’s ok to take everyone’s open source work and not give back is it ?

This isn’t some cute little robot who wakes up in the morning and decided it wants to be a coder. This is a multi-national company who has created the narrative you’re repeating. They know exactly what they’re doing.


"Learning" is a technical term, AI doesn't really learn the same way a human does. There is a huge difference between allowing your fellow human beings to learn from you and allowing corporations to appropriate your knowledge by passing it through a stochastic shuffler.


Individuals can train their own LLMs too.


Copilot is run by a corporation, and the model is owned by the corporation - despite being trained on open source data.

In general individuals will have problems with the first L of LLMs - unless the community invents a way to democratise LLMs and deep learning in general. So far deep learning space a much less friendly place for individuals than software was when ideals of open source movement were formed.


A full LLM is too expensive for individuals to train, but LoRAs aren't.

There are multiple open source LLMs out there that can be extended.

We can already see it in AI art scene. People are training their own checkpoints and LoRAs of celebrities, art styles and other stuff that aren't included in base models.

Some artists demand to be excluded from base model training datasets, but there's nothing they can do against individuals who want to copy their style - other than not posting their art publicly at all.

I see the same thing here. If your source code is public - someone will find a way to train an AI on it.


>it simply LEARNS from it as a human coder would learn from it

I thought that was a sarcastic remark, given the capitalization of 'learn', but followed by IMHO dispelled that part.

We have no idea how humans learn, and the 'AI' has a statistical approach, not much more than that.


A human who learns to copy code letter for letter does just that: copies code. Same with an AI.

The interesting debate should be what happens in the gray area, when you read a lot of code and learns patterns and ideas.


Code, is at best, a trade secret (it is also data). Keep it close to your chest, or don't.


But.. to be clear what you can and can't do with certain code depends on the license. Imagine code that is "open source" as in openly visible and available, yet the license explicitly forbids the use of it to train any AI/LLM. Now how could the creator enforce that? Don't get me wrong, I am aware that the enforcement of such licenses is already hard (even for organizations like the FSF).. but now you are going up against something automated where you might not even know what exactly happens.


Potayto potahto. We all know there's a difference between training a machine learning model and learning a skill as a human being. Even if you can trick yourself into believing AI is just kinda like how human brains work maybe, the obvious difference is that you can't just grow yourself a second brain and treat it like a slave whereas having more money means you can build a bigger and better AI and throw more resources at operating it.

Intellectual property is a nebulous concept to begin with, if you really try to understand it. There's a reason copyright claim systems like those at YouTube don't really concern themselves with ownership (that's what DMCA claims are for) but instead with the arbitrary terms of service that don't require you to have a degree in order to determine the boundaries of "fair use" (even if it mimics legal language to dictate these terms and their exemptions).

The problem isn't AI. The problem is property. Ever since Enclosure we've been trying to dull the edges of property rights to make sure people can actually still survive despite them. At some point you have to ask yourself if maybe the problem isn't how sharp the blade you're cutting yourself is but whether you should instead stop cutting. We can have "free culture" but then we can't have private property.


> IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas

You may be right that this is antithetical to "open source" ideas, as Tim O'Reilly would've defined it - a la MIT/BSD/&c., but it's very much in line with copyleft ideas as RMS would've defined it - a la GPL/EUPL/&c. - which is what's being explicitly discussed in this article.

The two are not the same: "open source" is about widespread "open" use of source code, copyleft is much more discerning and aims to carefully temper reuse in such a way that prioritises end user liberty.


> it simply LEARNS from it as a human coder would learn from it.

This is really not how LLMs work.


A key difference is that a company is making a proprietary paid product out of the learnings from your code. This has nothing to do with open source.

If the data could only used by other open source projects, e.g. open source AI models, I don't think anyone would complain.

You could argue "well, but anyone can use the code on Github" and while that's technically true, it's obvious that with both Github and OpenAI being owned by Microsoft, OpenAI gets a huge competitive advantage due to internal partnerships.


Imagine if folks got royalties on commits, or the language model was required to be open as well.


The company that trains/owns the AI steals the content.


> it simply LEARNS from it as a human coder would learn from it

Does it though? It "learns" correlations between tokens/sequences. A human coder would look at a piece of code and learn an algorithm. The AI "learns" token structure. A human reproducing original code verbatim would be incidental. AI (language model, at least) producing algorithm-implementing code would be incidental.


If that were true, Copilot would have been scanning windows and office source code. But we don't see that.


Nobody wants that.


I want that. I very much want someone to take one of the Windows code leaks, use it to train a LLM, and then make a fork of ReactOS with AI-completed implementations of everything ReactOS hasn't yet finished. Because then we could find out if Microsoft really believes that LLMs are fair use:)


Apes love moralizing and being indignant. This joker wants to share open source code and restrict what other people do with it.


So, like any license except public domain?

Have you personally ever put out something in public domain?


I’ve released plenty under GPL. Not possible to assign to public domain everywhere.


So you restrict what to do with it…


> It means that they have the right to share the code of others on GitHub, as long as they respect the terms of license. This is totally legal. But then, Copilot will be able to analyze the code and violates the license terms, which isn’t.

Both claims here are incorrect even though pretty much everyone gets this wrong. When someone who has obtained the right to redistribute some code only under the GPL uploads it to GitHub, that person (and that person only) violates the terms of the GPL. The GPL requires further redistribution only under its own terms, but uploads to GitHub come with a grant of a too-permissive license that a GPL licensee does not have the right to grant.

When GitHub proceeds to use the uploaded code to train copilot, they (probably) are abiding by the terms of this new license they have been (fraudulently) granted. They are not bound by the GPL, that's not how licenses work: they've got the other one. Now, GitHub has a big weakness here which is that they ought to know they're being granted licenses that the putative licensors have no right to grant. But that still would not make them in violation of the GPL, just of the original copyright.


> They are not bound by the GPL

but they are bound by the original copyright and are in violation of it


Let's leave copilot/AI aside for a moment.

Do you actually have the necessary rights to upload some else's code to github?

When you upload to github, you give it special rights not merely for redistribution and CI stuff.

You give Github the rights to use the code for other github projects. That alone might not be compatible with some licenses (think GPL virality).

So if your software is BSD or anything without attribution I can probably upload it without problems.

But if your license requires so much as attribution, can I give Github the rights to use the code for any other internal project they might have?

Remember that in Github TOS you give grants for any github service, and in some special cases it requires this to be without attribution.

IANAL but I think I lack the necessary rights here, regardless of copilot.


IANAL but I believe you are incorrect about the rights you grant GitHub. They are still bound by the license you set in your repository, including any copyleft in GPL or attribution requirements. GitHub TOS sections D.4-D.7 spell out the license that you must grant GitHub in order to host code there. “You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time.” Notably, see also section D.3 which states “If you upload Content that already comes with a license granting GitHub the permissions we need to run our Service, no additional license is required.” I would think any open source license would be sufficient license for those things, like storing and displaying code, and I would think this clause is primarily for code without a clear license not already granting these terms.

To your point about redistribution, see this clause in D.4: “This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.”


Yes, the granted rights seem to be mostly about stuff related to repository management, and not e.g.: using the software (except audio/video).

But section D.7 ("Moral Rights") is about waiving moral rights, which at least in Italy is not legal, and then says: "To the extent this agreement is not enforceable by applicable law, you grant GitHub the rights we need to use Your Content without attribution".

It mostly seems like poor wording of that section not being limited to archiving/displaying, but there is a case where you grant Github something without need of attribution.

Which is against most Open source licenses, which means that I can't upload someone else's code.

Again IANAL, and theoretically open source projects want maximum distribution, but I think this does mean that technically I would grant to Github something I can't grant.


> They are still bound by the license you set in your repository, including any copyleft in GPL or attribution requirements.

Then what gives them the right to use the code for creating proprietary software (Copilot)?


> So if your software is BSD or anything without attribution I can probably upload it without problems.

BSD and MIT licenses absolutely do require an attribution.


most BSD versions require attribution, but there are multiple versions of BSD and at least one (the most basic) does not require attribution

https://en.wikipedia.org/wiki/BSD_licenses


I understand the sentiment, but I think it is misguided and therefore - counterproductive.

First, LLMs learn patterns, not just copy and paste. If they generate verbatim copies of any non-trivial-enough part that would be the subject of a copyright license, yes, it would be copyright infringement. Yet, could anyone give such practical examples? And if so, how do they differ from a software engineer who copies and pastes code?

Second, if the code is hosted anywhere else, there is no guarantee that Copilot (or another model) won't learn from that. The only way to make sure no one and nothing will learn from open-source code is to make it as closed as possible.

Third, for me, the crucial part of open-source code is maintenance. GitHub is there and works well both as a platform for creation (I consider GitHub the most productive social network) and an archive. "No GitHub" (even as a mirror) means that the code is likely to be stored in places less likely to engage collaborators and less likely to last long.


> If they generate verbatim copies of any non-trivial-enough part that would be the subject of a copyright license, yes, it would be copyright infringement. Yet, could anyone give such practical examples?

Yes, there are many such examples.

https://twitter.com/docsparse/status/1581461734665367554

https://twitter.com/mitsuhiko/status/1410886329924194309

https://codeium.com/blog/copilot-trains-on-gpl-codeium-does-...


> there are many such examples.

It's always the same two examples, and I would not classify that as "many", especially since that fast inverse square root function has been shown to be on GitHub and other sites countless of times with all sorts of different licences (which is wrong, but copilot doesn't seem to do better or worse than humans in this regard).

That codeium.com is just asking leading questions, or the AI equivalent of that.


What so special about the code here:

https://twitter.com/docsparse/status/1581461734665367554

It implements common operation in a standard way as far as I can tell. AI re-implementation is not identical.

It seems more like multiple discovery than plagiarism let alone re-producing a copy without a license.


To be fair that is not verbatim.

It is mutatis mutandis the same but is that a problem? I'm sure many would say so, I'm not convinced.

Ultimately if his code is out there a Google search could bring up a snippet without the license visible and I might copy paste that. The crux is the same code might be presented without context.

Copilot is just a tool and the personal responsible for it's safe usage is the human behind it.

In my world view, if I copy a picture off Google image search ultimately I am morally the one who infringed copyright not Google.


> In my world view, if I copy a picture off Google image search ultimately I am morally the one who infringed copyright not Google.

I have an idea why, but... why exactly? What about a web scraper (that I made, similarly to that of Google) that downloads images? What if it is randomly downloading images and not intentionally a specific one?


Valid points. But I don't want my code to be used by big corporations and monopolies to train closed source LLMs that they're going to sell. Shouldn't I get to have a say in that?

For example, GPL controls what kind of projects can use my source code. Maybe there could be an addendum to GPL that requires all LLMs trained on the source code to be open source. Sure, that won't guarantee that Copilot-like bots won't be trained using my code. But it does give me a legal framework to stop big corporations from profiting off such Copilot-like bots without making them open-source as well.


A genuine Q: if the LLM was from a purely non-profit company that gave out their AI for free, would you mind your code being used? Would you in fact be proud that it has made a useful contribution? Assuming that the outcome does not affect your income.


Not the original commenter, but if the model was publicly released and open, I definitely wouldn't have a problem with that.


This is the FOSS license I want


> how do they differ from a software engineer who copies and pastes code?

They differ because author of the code did not agreed to use his code to teach LLM.

> Second, if the code is hosted anywhere else, there is no guarantee that Copilot (or another model) won't learn from that

This issue should be resolved not by author of the code, but by Copilot or any other LLM team.

> Third, for me, the crucial part of open-source code is maintenance.

Even if author is against it? This is a valid argument, but results is not very different from pirating of software.


You look at the problem from principles, while I look for the outcomes.

When to the third point - well, it is up to the author, and I respect that (regardless if I would do the same thing). People have the right to not share it at all, or share it as a copyrighted piece of software, or with any other limitations. Though, all limitations (and copyleft is a limitation) affect its usage.


I get the sentiment, but people can and should do whatever you permit them to in your license. If you don't want your code hosted in one place, say so in your license.


From TFA

> Is this a legal document?

> No, it isn’t. If the project is under an open source license, it means that everyone can share a copy – even on GitHub – of the licensed material under certain conditions. A license restricting this right wouldn’t be open source anymore. However, since GitHub may not respect the terms of licensed code that is hosted on their servers, not uploading the code of others there is, in fact, an ethical choice.

emphasis mine. It's a "please be nice", not a "I want to enforce things"


I don’t get this argument. “Here’s a legal document we wrote that says that you’re allowed to redistribute our code. But aside from it we’d like to you to not redistribute it on platforms we don’t like for whatever reason. But we won’t spell it out explicitly in the document itself because reasons, instead we’re guilt tripping everyone who does”.

I respect developers right to put any restrictions on the code they share with the world. But I believe it should either be explicit or not restricted at all. Either write the license that says exactly what you want or otherwise don’t shame people into the desired behavior.

Edit: one could even add a more generic statement to the license stating that it’s forbidden to share the code on any platform that would use it to train their AIs per their ToS, so you don’t need to single out GitHub and potentially others in the future.


You don't get ethics and morality not bound by law?


Ethics is subjective, the law less so. Most people don’t find GitHub to be unethical so the author will have a hard time convincing people without using the license terms.


Well, that's the exact reason why they wrote this page: to explain to people why they think GitHub is unethical, and maybe convince them. It's the same as calls to boycott various other companies: they haven't necessarily done anything illegal, but if you convince enough people not to use their products/services anymore, you might make an impact...


The author really doesn't have to convince anyone, "please don't do $X" is more than enough to state their wish for you not to do $X.


i think it is quite scary that so many in here would just not honor a simple "please don't do it."

it is a wish. if someone says "please don't wear shoes in my home" i hope you would honor their very simple and understandable personal wish without setting up a contract for it?

i mean, just be a bit more human, please.


My aunt once told me "Please don't think negatively of religious people."

Stallman would prefer I not use any closed source software to read his blog, including OS, drivers, web browser, etc.

People routinely ignore unreasonable requests. Asking me to not wear shoes in your home is reasonable. Asking me not to give a copy to Joe after telling me I can give a copy to whomever I want is unreasonable.


How is it unreasonable to want you to not host someone's code on one explicit other platform?

We already established no one is forcing you and if you don't respect the author you don't get to be respected for your decision to ignore them and will earn snarky remarks. (and rightfully so, in my opinion)


> Asking me not to give a copy to Joe after telling me I can give a copy to whomever I want is unreasonable.

This is a repetition of your claim, not an argument for it. Counterpoint: It's entirely reasonable to ask you not to give a copy to Joe.


I see it more like an artist telling me “please hang this picture in this orientation” but I prefer it differently and don’t see any reason why me hanging it the way I want in my home affects the author. My copy on GitHub doesn’t detract from their copy.


The author might also be using this a stop-gap until a FOSS licences comes out with similar terms, but don't want to make the current license nonfree because that has different complications.


This is neither ethics nor morality, it's just someone's desire. Which is fine, people should have desires towards their society.


There is no law that says you have to say “Good morning”, but yet parents teach their children to do.

How is this case different?


A licensor cannot predict the future. When the GPL was written decades ago, nobody predicted that BigTech would start using it on their servers to offer it as "services", and claim that they didn't need to distribute the source code of the customisations they made because they were (technically) not distributing the software itself but only running it on their servers. Anyone who understood the intent and philosophy of the GPL license understood this as a bogus and unethical argument. But it was believed to be legally tenable (1). So the AGPL license was created to counter this move and preserve the original philosophy of the GPL that users of a software should have access to the source code.

(1) Though I don't know if this has been actually tested in court - courts in India have more freedom to broadly interpret social contracts like the GPL, unlike the US courts, and a positive outcome in favour of upholding the license even in such cases could be possible).


> Anyone who understood the intent and philosophy of the GPL license understood this as a bogus and unethical argument.

I disagree here. The idea, intent, philosophy is one (crucial) thing, the resulting practical artefact (here the license) is another. It works exactly as it was designed to work.

People/companies modifying GPL software for their own use (internal or external) without redistributing the software itself (so without requirement to redistribute the code) existed before SAAS grew on, only at the time, the small scale of this made it a bargain that was "interesting" only depending on one's capacity/hubris to maintain an internal fork on their own.

*aaS hugely tipped the scale, and the side effects, but the mechanics are the same.

And yes, that may not have been the original intent, and the AGPL is as valid a license as a reaction to provide a new tool more in line with the original intent, but that doesn't make the use of the existing GPL all within what it actually enables anyone to, invalid or unethical.

(but maybe only in a specific perspective of the framework of the original intent)


There's a ton of stuff people do because they were asked, not because they're legally required.


When you decide a license, you can't know what currently existing or future platforms will some day start to violate an aspect of the license or of copyright itself. Does it make sense to add a retroactive clause, then?


I don’t have any ethical dilemma by posting some open source code on the site I prefer.


It would greatly improve by some text revisions: "my code", "We", on a generic page without clear authorship or object, on a dedicated subdomain, on a hosting domain, without a signature or contact or kind of option for a webring.

It took me some time to get that this is a generic call (to be followed and reused), still with no clear ownership, rather than a specific claim to a specific code/project (or is it?).


Ethically, you'd think Copilot aligns with the ethos Open Source software. It isn't far off that multiple open source Copilot equivalent get as good as the commerical one. And we are all here for it!


>> Is this a legal document?

>> No, it isn’t.

My limited understanding of international copyright law is that copyright disclaimers/licenses per Berne Convention are documents stating the author's wishes no different than this one. When a judge is going to be shown this page where the author states his explicit wish to not have his code uploaded code on Github, in direct conflict with his wishes stated in the LICENSE.txt file, I don't see how it will hold up in court as free software code.


> If you don't want your code hosted in one place, say so in your license.

Also don't forget to go back in time to let your former self know which hosting providers will violate your license.


There's an argument to be made that a neural net learning from your proprietary "source available" code isn't violating copyright. It's not an opinion I would trust the judge to take over necessarily (they might, or not) and hinge my business on, but it's an opinion one can have, so I don't know that github violated any licenses here as afaik there weren't any at the time which specifically stated that <insert definition which can tell what-we-call-AI training and human training apart> is not permitted. At least not until you successfully sue them for it in some jurisdiction.


There have been quite a few cases of the neural nets spitting out the code they've been trained on verbatim, including comments IIRC. They're not just "learning" (if they're "learning" at all).


Well, it's a bit of a unique situation in that they don't directly care about the code being on github, but the secondary effect that it being on github has - microsoft violating the code license terms (which are in the license)


I would question why the "AI" community is being left to their own anarchist devices when they are clearly the core of the problem, instead arguing about licensing pedantics and semantics.


I mean, MS is getting sued over Copilot: https://news.ycombinator.com/item?id=33485544


nice.


It was always pretty easy to violate the GPL and get away with it. Repercussions only happen if you successfully get sued.


I don't think you got the sentiment.

It's pretty clear in the article, I hesitate just just cut and paste it here, but the idea is that Github is not respecting some terms of some licenses, so if you upload another authors code to github, you are exposing the author rights to github's depradations.


people can and should do whatever you permit them to in your license

Github has been consistently arguing that they're not bound by the distribution license when training their Copilot model, so I'm not sure what difference that would make.


> I get the sentiment, but people can and should do whatever you permit them to in your license. If you don't want your code hosted in one place, say so in your license.

I mean even doing that will not protect anybody's code, Microsoft doesn't care, they might also be scanning gitlab or bitbucket and doing model training on these. The only way to protect your code at that point is to stay closed source. It's as simple as that.

It's no different from all these models trained on content without permission, whether it be articles or photos,... until all these corporations are sued or copyright laws change to adapt ML, they'll train their model on content regardless of the its license since they are getting away with it.


Restaurant owners also didn't care about renumerating composers of the melodies they were playing. Then copyright laws happened.


> people can and should do whatever you permit them to in your license

I agree with 'can', but I find 'should' a weird choice of words here.

I don't think we are better off by phrasing hyper-specific licenses (or laws).


I think this is more likely to actually work. People don't read licenses and there is no CI that automatically enforces them. They might notice this badge though.


On the other hand, it requires only one determined person out of hundreds of thousands who, for whatever reason (trolling, principled open source maximalist, AI researcher, code archiver, etc), decides to mirror all code marked with this badge to GitHub.


Everywhere I've worked in the last few years has license checking as part of CI pipelines. Our builds fail if GPL'd code is detected in some .jar or whatever. (fossology and x-ray both provide this)


Due to their TOS I believe it's illegal to upload any code you don't have full rights for to GitHub (if it wasn't already uploaded by someone who has full rights for it, which isn't the same as it already being on github).

This is even true for e.g. MIT licenses as while they are very permissive they still require a form of attributions which Copilot doesn't provide.

Also for anyone arguing that such ML models "lern", at least the moment they are not exatcly perfect sized or sized below that they are basically guaranteed to verbatime encode partial copies of code, it's fundamental consequence of their design. And while this encoding is "encoded" in some way it's not transformative in the definition of copyright law AFIK. I.e. any big model net is guaranteed to commit hard to trace copyright infringement.


You have the full rights to upload any open source code. The entire point of an open source license is to give others the rights to do what they want with it. Unless the license has a specific exclusion for github or github like services, you definitely have that right.

> This is even true for e.g. MIT licenses as while they are very permissive they still require a form of attributions which Copilot doesn't provide.

those two things have nothing to do with each other. MIT gives me the license to upload. Copilot NOT giving me the license is irrelevant because the generated code isn't distributed under under an opensource license and thus has no relevance to the discussion of MIT or any other Open Source license.

additionally, requiring attribution and having the right to upload it are separate things. A license MAY require attribution and it may not. Rights can be granted without such a thing. See Public Domain, and CC0 for examples.


no, you agree to give certain rights to GitHub when you upload. Rights you do not have with most open source licenses because they do require attribution.

> generated code isn't distributed under under an opensource license

you still have to comply with licenses and if code pilot spits out code which contains nearly verbatim code which was MIT licensed it is not copilot which needs to grant you the license but the original code owner (or you need to at least properly attribute it, through in case of e.g. GPL that is not enough)

but through the act of emitting code via code pilot github does distribute code (without proper attribution) and they do so by getting rights from the uploader to be able to do so. Except that most people uploadig do not have such rights for anything which isn't their code.


Most open source licenses do not give you the right to do whatever you want. Especially copyleft licenses.


How has it come to that? The tech industry has declared in turn that privacy is dead, that labor contracts are dead, that tax obligations are dead and now that copyright is dead. Why is the most promising tool towards a better society appropriated in such a destructive manner?

People must separate their fascination with tech as such, from the predominant tech business models that are basically - you can't put lipstick on a pig - parasitic.


It pays far too much and fat wads of cash are in everyone's mind as these debates surface. "Better for everyone", "human equality", "UBI", yada yada. Debaters then like to name-drop "No true scotsman" and "purity spiralling" for either side. It's just that tech marketing claims it's a small group of technical wizards, while in reality it's the largest novel corporate interest of the last fifty years and it hasn't been about the little guy for decades. The industry is practically vampiric on naive hopefuls. Well, at least some of them get paid.


Copyright is dead only for the little guy. The rich and powerful have their money printing machines copyrighted ad infinitum.


What is puzzling is that its not just the little guy being undermined now. The discussion here is about open source code but take any domain that makes available expert knowledge online. Say medical or financial advice. We are talking about powerful sectors, if not cartels themselves.


the problem with this approach is that humans suck and this is an invitation for trolls to upload your code to github. i don't even know how you're supposed to solve this.

you could proactively scan github for your code and try to get them to purge it if you find it i suppose, if that would remove your code from copilot. but even that is not a great solution because you would need to prove you're the actual author, and github would probably need to be involved in building a mechanism to do so, but they don't give a shit.

i think the reality is the LLMs have eroded copyright protections and trying to fight it isn't likely to pay off


I think fighting this is very worthwhile. For example, I think the fight is the reason why The Stack, dataset produced from GitHub, has a form to request to remove code.

https://www.bigcode-project.org/docs/about/the-stack/

I am hopeful that we can make the removal request form the standard industry practice.


do you know if the stack anticipated people who will claim ownership of and request removal of repositories that they do not own? it seems like complying would require github to expend a lot of effort to combat abuse.


The Stack's removal request form is currently processed manually.


oh sorry i misunderstood, i thought this was a project designed to request removal of code from github/copilot


It's very simple. If it's open source and the license allows re-publishing code (with copyright information intact), people can republish the code. You don't get to discriminate, cherry-pick, etc. Depending on the license, there may be lots of rights and requirements. Especially with GPL style licenses. Especially AGPLv3 seems to impose a lot of restrictions. Which is a reason for that license to not be popular with corporations. But even that license allows republishing code as long as you preserve license and copyright information.

This generally is the whole point of open sourcing code: to allow others to republish with or without modifications. If you don't want that, don't open source your code. Github is a perfectly valid and legal place to publish code. Lots of people have done so.

The jury is still out on whether using open source code as training data is fair use or constitutes a copyright violation. But the safest assumption at this point is that it is a combination of fair use and hard to prove that any violation happened to begin with (i.e. good luck stamping your feet in anger in a court room).

Which will no doubt make for some fun court cases but will also take many many years to come to any kind of conclusion. I would recommend not to get your hopes up on that. Historically, copyright law has not been updated a lot to deal with any kind of technical change. And that's just inside the US. The world is bigger than that of course. A safe assumption is that judges will seek to interpret any new cases under centuries of existing interpretations. Because that's what they do. Any changes to the law would have to go through politicians. And then the interpretation of that relative to other laws is again up to judges. Which is why we have such wonderful things as DMCA and a few other attempts to close loopholes in copright law.

Of course time is not your friend here. By the time this rolls through the courts some years down the line, existing practice will have evolved to be hopelessly dependent on language models for just about anything. Including interpreting the law (e.g. chat gpt seems to be acing exams) and doing all sorts of complicated things in science, engineering and technology. So, any legal outcome that that is somehow illegal is going to be combination of unpopular, economically damaging, and therefore unlikely. Think big corporations ki


> Historically, copyright law has not been updated a lot to deal with any kind of technical change.

Ok, not a lot, but it has been updated. Chip mask is a good example: https://en.wikipedia.org/wiki/Semiconductor_Chip_Protection_.... If language model gets as important as chip mask, I don't see why similar update wouldn't be made.


That's a big if. Also, this is nearly forty years ago. With chip designs, there was a clear cut case of people taking each other's designs and some of the parties were billion $ chip companies being very grumpy about other companies taking their heavily patented solutions and copying those. Which translates into a lot of political pressure and lobby power and some political action. I don't see that happening for language models.

Most of the people complaining seem to be a narrow subset of open source developers that like open source but not people doing anything for profit with their source code. I.e. not your typical multi billion dollar companies that are worried about their IP getting stolen. So, not a lot of lobbying power or ability to get organized.


>It's very simple. If it's open source and the license allows re-publishing code (with copyright information intact)...

If it is that simple then clearly Copilot need to include the licence of the code when they republish it. Changing GPL'ed code doesn't make it not-GPL code at any point. If it started out as GPL it is still GPL later in the line when a user sees it in Copilot.


I don't understand why people freak out about their code being used by others when they publicly release it. Is it lack of attribution? Does attribution really do anything if your not already famous? If you don't want your code known, don't make it open-source.


I think the main complaint behind it is that a massive corporation is using code and blatantly ignoring the software licenses.

It's one thing when individual devs or small teams don't respect a software license, ideally everyone would understand and follow them but that's not realistic. When Microsoft damn well understands the licenses and ignores them to train an AI and sell a product, that's pretty damn egregious.


Microsoft doesn't "ignore" the license, there is a disagreement on what the license and copyright actually mandates. Those are quite different things.

Given the current state of the law and how licenses are written it's impossible declare that "yes, this is absolutely illegal". People may not like it, and (arguably) it may go against the spirit, but that is also a very different thing. Right now the situation is just ambiguous and unclear.

All of this absolutely matters, because the solution to "Microsoft is acting illegally" is a lawsuit (which, I believe, is already underway) while the solution to "the license and law is unclear" is different licences and/or laws. The solutions are completely different.


Software licenses have always been a great area. At best this argument is that it's OK for Microsoft (and others) to abuse licensed content created by others because proving it was technically done illegally is difficult.

When Copilot inserts a code snippet that matches perfectly with one of the samples it was trained on, and that sample had a copyleft license, how is that not a license violation? Or is the main argument that copyleft licenses aren't enforceable in general?


> Given the current state of the law and how licenses are written it's impossible declare that "yes, this is absolutely illegal."

This is a reason for Microsoft (which you have given to them) that explains why they're ignoring the license, and why you think it's fine. It's a common refrain: "Everything is so complicated." It doesn't change the fact that they're ignoring the license.


The argument boils down to "we think it's fair use". This has been explained quite a few times. Will that hold up in court? I don't know. Should that hold up in court? Each can have their own opinion on that. But it's definitely been explained, as well as discussed at great length many times before on HN.


The issue I have here is that if these kind of criticisms start to take hold, it will perfectly pave a path for these same corporations to have more power handed to them. If more laws get passed requiring stringent restrictions on training data, tests to prove that copyrighted works cannot be produced by your model etc, this makes open source models which could potentially compete with these corporate models much more risky and expensive to develop.


Well I wouldn't actually mind it being more difficult or impossible for these LLMs to be developed at all, but that's a whole different discussion!


It's ironic that the movie studios want license agreements for content they own to cover distinct uses (streaming vs DVD's for example), yet they expect actors to agree to allow their voice, likeness, mannerisms, etc. to be usable by an AI for future projects (only the studio's own projects of course!).

If I own "the thing", I want you to pay for each new, distinct kind of use for it. But if it's your thing I want to use, I want you to have a permissive license.


The ship has sailed and the cat is out of the bag. By the time any of this shakes out everyone will be using copilot and the lawyers will have ground any argument into dust. Good luck putting it back in the bag.

After you have cleaned everything off of GitHub and separated yourself from the ecosystem it (AI bots) will be everywhere and your code gobbled up again.

The only thing that is worth a fuck is a working product. Not your boilerplate or fancy snippets that you want to claim ownership over so as to stop the evil Microsoft from benefitting the community.

Are we devs or luddites?


I think it's very common for developers to overestimate the value of their own code. You see this in the video game community, where people are afraid to open-source their Minecraft plugin because they think it'll be "stolen". They choose the future where nobody builds on their plugin (and no amount of attribution will ever feel like enough, so there's some level of "theft" felt by the original author).

Most open-source projects are not valuable, and even if they are popular, they could be easily replaced in a few days by another programmer. And usually the license doesn't matter, because nobody forks the project anyway, all development is done in a central repository and the project owner signs off on all of it


Thanks to github code search, I found at least on project which has this badge that someone decided to upload to github anyway.


GPL and AGPL allows you to add more clauses to the license if they don't clash with any existing ones. The FSF should certainly share some guidelines on how we can include and extend the license through an additional "anti-ML / anti-AI" clause.


There's no clause you can add to A/GPL that will help with this. You can't forbid use cases because that would clash with the license, and Github already blatantly ignores the attribution clauses, so there's nothing you can do there either.


GitHub’s EULA gives GitHub permission to train Copilot on public code you host on GitHub regardless of the license you have chosen for that code.

Even without this, in terms of copyright, since Copilot doesn’t do what your public code does, and it only uses your code to train, it is a transformative use, and would be fair use. It’s possible that a court case will find otherwise, but I think that’s unlikely. The only case I think it will become disallowed, is if Congress passes a law about it.

If Congress does pass such a law, GitHub’s market power in this domain only goes up, since the EULA gives it the covenants.


An EULA doesn't trump copyright laws and spitting out code it was trained on is clearly not transformative.


Spitting out lines or paragraphs from a repo is most likely fair use.

Unless you can reproduce a substantial portion of a repo, I think it’s going to be an uphill battle to argue it isn’t fair use. Though I suspect Copilot’s suppression feature will make doing so impossible.


Assuming the reports of it producing the fast inverse square root function, comments included, from the Quake 3 engine were true; spitting out the whole function, comments included, doesn't look OK to me.

Either way, the whole copilot thing smells of 'the issue of copyright infringement is more copyright infringement' to me.


Oracle would like to have a word..


But not every author will have agreed to the EULA, if the project includes code by people not in GitHub. E.g. if there is a GitHub mirror of a project that is not hosted by the author, if a project received a patch via email instead of a PR, etc.


That is a very good point, and perhaps could be used as a starting point for a license clause to restrict hosting in places whose EULA doesn't respect the true intent of the license.


I think it is ethical and reasonable to include clauses in the A/GPL restricting where you can host the source code if the hosting service has opensource hostile terms in their ToS. US case law already allows this by treating such license as a "new" license different to the GPL.

(And I don't see why we cannot add a clause saying that the source code is only meant for human developers and use of the source code in any machine learning system or to train any AI systems is prohibited without explicit case-by-case permission).


It may be ethical and reasonable, but that would make the license incompatible with A/GPL and a non-free license according to FSF.


Has FSF actually said this? I don't see how it makes it a non-free license - the original intent behind the xGPL is to ensure that a user of the software also has the access to the source code, along with the knowledge of who created it (attribution). This is to protect our right to repair. So if any part of your source code is used by someone else, even by an AI system, to generate a software, it should also inherit the same viral property of the license - be licensed under GPL with attribution. If it does not, it is similar to a non-free software using your GPL code in the project, which is already prohibited by the license.


well, the FSF certainly is looking at it, but it is a new area of copyright that is unresolved and it's not clear what to do about it, or whether it actually infringes copyrights, etc.


Do you happen to have a link to where the FSF said they're working on it? I would be interested to know what legal status they speculate this might have, as in, reading a more informed opinion than random comments here that mostly just assertively assume one way or another.


This is misguided. What you want is adding a special clause to your license that disallows usage for training LLMs. Whether the code is on GitHub or not, it’ll be used to train models if it’s publicly available and the license allows it.


That would make the code not Open Source.

What many people actually want is for LLMs to respect Open Source licenses, and propagate those licenses to the derived works they create.


> That would make the code not Open Source.

So? Maybe making so much code open source without any restrictions was a mistake in the first place. I know that I don’t want trillion dollars megacorps benefiting from my free open source code in any way. That would include LLM training.


The contents of the license are irrelevant, because such training is not being done under that license at all (or else it would already be violating it, failing attribution at the least—and it seems likely to be fundamentally impossible to comply, just like a human could not comply if human learning was subject to such licenses). It’s depending upon copyright restrictions not applying to it (under “fair use” doctrine).


Ah yes. More "Free" as in "Do As You're Told" from people who call their source more open than anyone else's. Coming up on a decade later, and if anything we have more to learn: https://marktarver.com/free-as-in-do-as-your-told.html


I have a simple question: why does AI get unlimited training data, but human beings can't have a universal library, without limits?


The people with the money run the AI so they get to selectively enforce the rules.


I think of large models, ones such as CoPilot, as lossy compression with content addressable retrieval. If you type the first few parts of some content it has stored then it will retrieve the rest for you.

The blocks retrieved are very small and many of them occur frequently. “if” followed by “(“ for example — hardly worthy of copyright, but we also know that they were literally taken from copyrighted material.

(I don’t think the model starts out with any existing knowledge of syntax / grammar of, say, Python?)

Even if some of that material was public domain, a lot of it wasn’t and at best requires attribution; at worst, full licensing conditions.

To put it another way: it doesn’t matter how many vegetable ingredients they throw into the sausage or how elaborate the sausage making machine is: if they put pork in, the stuff that comes out ain’t vegetarian.


> But then, Copilot will be able to analyze the code and violates the license terms, which isn’t.

I'm not sure about this. In most cases the result would be similar to a human reading a lot of opensource and later when writing use patterns that they'd learned. It's only in the edge cases where there's clear 'plagiarism' on a niche prompt that it would be problematic. An more direct solution isn't to take everything off GitHub, but rather to not allow Copilot to do near-literal copy/paste.

If we moved opensource to BitBucket, there's no protection that it wouldn't do the same as Copilot. Attack the problem directly.

A way to think of this banner is that of signing a publicly visible petition to make Copilot behave as humans abiding to licenses do.


> Even if a project is not hosted on GitHub, other people have the legal right (depending on the license) to redistribute the source code. It means that they have the right to share the code of others on GitHub, as long as they respect the terms of license. This is totally legal. But then, Copilot will be able to analyze the code and violates the license terms, which isn’t.

While encouraging people to not distribute code via Github may mitigate the issue some, the actual issue is how Github has mass-automated the process of violating open source licenses. Github should pay a fine for every suggestion Copilot produces that violates a software license, plain and simple. Don't blame the people that unknowingly upload code to the training dataset.


The business model for disruptive "innovation":

Weasel wording to evade legislation. It's not an unlicensed taxi, it's ride sharing. It's not an illegal hotel, it's couch surfing. It's not code licence infringement because it's learning.

It's lawyers finding loopholes for finance to avoid expenses that gives them an edge over the poor suckers who play by the rules. The tech is just a tool to this purpose.

And most people forget TANSTAAFL. The costs for the cars and their infrastructure, the load tourists put on a place, and the effort for writing the code are still there. The "innovators" just found a way to make somebody else pay for what should be their cost center.


Why HN is so defensive against copilot but have a completely different opinion about MidJourney / StableDiffusion? Both are generative softwares which when generate a piece verbatim only means over fitting / over training on that particular example.

The tone on one of these tools is hypocritical. When it comes to digital art general sentiment is that it's inevitable and artists need to up their game. This sentiment is not being repeated for code generators.


Because digital art is trained on the end result. The equivalent would be learning to write software by looking at compiled binaries.

Learning from source code is like learning from Photoshop files or Logic projects. Artists usually do not share them. In music, not even the labels get anything than the final mixdown (the music equivalent of the binary); the stems or projects remain with the studio - just like the source code in software projects tends to remain with the studios.

In software, we started showing and sharing source code under the assumption that it makes other humans making software better. People still rarely do this in music, as holding up industry secrets is deemed more important than fostering a community of personal growth.

We don't open source our code so an AI can grow. We do it so that humans can grow. There's no need to publish our code to distribute our final binary. This breaks the common understanding of why it's a good idea to share your source code, and as a result, people might become more protective of their code again.

In many ways, your comment is similar to all the comments saying "but you open-sourced it, surely you must know that people will then put it in their commercial code?"


> We don't open source our code so an AI can grow. We do it so that humans can grow.

I open source my code so that programmers and users can grow. I don't care if that programmer or user is a meatbag or a machine (or both! or neither!).

What I do care about is that the license terms of said code are respected. The vast majority of my code may be under something permissive like MIT or (lately) ISC, but I do make giving credit where credit's due a condition of using the code I've written, for good reason.

That's where tools like Copilot make a misstep: by ignoring the conditions I've placed on the use of my intellectual property. Plagiarism is plagiarism, regardless if it's a human or AI or dolphin or Martian or whatever doing it.

That's also where tools like Copilot differ from e.g. StableDiffusion. AI-generated art doesn't (usually) involve copying and pasting snippets of existing artwork into a new work the way Copilot has been demonstrated to do on multiple occasions.

(My other "problem" is that I can guarantee Microsoft will assert double-standards when it comes to Copilot infringing on e.g. the GPL v. Copilot infringing on Microsoft's own EULAs - and I really really really want to see that happen via someone tricking Copilot into ingesting the Windows source code and vomiting that into Copilot users' IDEs verbatim)


People share arts because they want to help other humans grow too, not the AI.

Also the code is the end product. Similarly, artists don't share their project file because it’s not the point. This is like saying programmers not sharing their vim macros.


> People share arts because they want to help other humans grow too, not the AI.

Hold on just a second here. How do you know what "people" "want"? Who put you in charge of speaking for them?

Yes, there have been some loud complaints, but given that millions of people are involved, the overwhelming majority of whom haven't expressed an opinion one way or the other, I think it's a bit premature for this kind of blanket statement.

Secondly, there's a difference between what people want and what people are legally entitled to. If you ask people if they want a million dollars, most of them will say "Sure!". That doesn't mean they're going to get it.

Existing copyright controls the making of copies. That's it. There's a fudge factor in there called "fair use" that controls whether or not something constitutes an infringing copy.

Whether AI training data falls within that or not is going to have to be decided in the courts or by some type of government action. It's clearly not an exact copy...the actual pixels in the original work aren't anywhere in the database. But is what is in there close enough to be considered a "derivative work"? I don't know, and neither do you.

Again, it's way too soon to be making blanket statements on the issue.


More important than an opinion of a court is the opinion of lawmakers. After all, copyright laws have a purpose and AI subverts that purpose. Lawmakers will have to decide whether the benefit of AI to society is greater than the benefit of having works published. We will see laws adapted accordingly.


> After all, copyright laws have a purpose and AI subverts that purpose.

While it may differ in other countries, in the United States the purpose of intellectual property laws, as expressed in the Constitution, is to "To promote the Progress of Science and useful Arts by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries".

Enriching the copyright holders (on the rare occasions that actually occurs) is a secondary consequence, not the prime purpose.

Does AI "subvert" "promoting the Progress of Science and useful Arts"? I don't think so. Quite the contrary... I think it advances the progress of science and the useful arts, if anything.

It's pretty well established that a description of a copyrighted work is not protected, even an extremely-detailed description (see, for example, the way that Phoenix assigned one team to write an extremely-detailed description of the IBM PC BIOS, then gave that description to a second clean-room team that hadn't seen any of the actual source code. The second team then produced a clone of the BIOS that could be sold without paying IBM anything).

The data stored in these models seems more like a "description" rather than a "copy" to me -- though, of course, there's no guessing what a court or legislature will decide.


> Does AI "subvert" "promoting the Progress of Science and useful Arts"?

Absolutely, it may. That was my point that this is exactly to be observed and decided.


> Does AI "subvert" "promoting the Progress of Science and useful Arts"? I don't think so. Quite the contrary... I think it advances the progress of science and the useful arts, if anything

It does subvert that purpose to the extent that it makes some people no longer willing to share their works. The entire purpose of copyright is to encourage the sharing of works.


Do you get access to the raw model to advance the science with ? Or the watered down version ?


Okay, I shouldn’t have used “want”. But helping others and collaboration is not something alien in the art world. Artist publish their work almost for the same reason programmers do:’

1. To showcase their work

2. To get feedback

Also, I don’t think it’s fair that we have to wait for the majority response before we can form an opinion of whether or not something is good for them. Do we have to ask millions of people if they like it if their health benefits, social security or left thumb gets removed before we can say that they definitely won’t like it?

I think it is pretty safe to say that artists don’t like having their whole livelihood overnight. (Inb4 get better jerbs)


When you want to help other humans grow in music production in an open-source way, you have to share your studio project file. Try finding some of these... you'll find YouTube videos where people walk through aspects of their projects, but you won't find the actual studio project file.


Billie Eilish famously released her files for her first hit song and it includes a bunch of takes etc that they didn't use.

Apparently it's directly available in Logic (or was)

https://old.reddit.com/r/Logic_Studio/comments/gic2r2/logic_...


> We do it so that humans can grow.

I think you’re being a little too altruistic. When I publish code, it’s not about helping people grow. It’s so that others can see the code they are using and that if there is a bug or mistake, they can correct it. Hopefully they share that fix back, but it’s not required.


I think that people do it for a whole bunch of different reasons. Helping others is one of them, yours is another. But I remember that back at the start of code-sharing, the usual reason people gave for doing it was to help advance overall knowledge.


>> "but you open-sourced it, surely you must know that people will then put it in their commercial code?"

this is a general problem in open source. And with obfusication in an AI it is just unfair. If AI / codepilot would reference the source in a nice way, then I woundn't mind.


Especially given the history of the last 30 years where major open source licenses carefully fought over just how much "commercial code" could use the works involved.


Since you mentioned music, can we draw any parallel with the recent Ed Sheeran case? You can't copyright a chord progression, or a rhythm. So can you claim license violation for code which is similar in how it arranges standard design patterns and well-known language idioms? Unless you can find a line-for-line match in a substantial portion of the code, I don't see how you can make a claim for license violation, whether a person or an AI wrote the code in question.


The problem is not "training using GPL", it's stripping the license and providing the resulting code to anyone, for any purpose.

Copilot can provide very specific implementations of problems with "in the style of $NAME" prompts. Consider a case where you put your life's work as GPL licensed code, and someone can reproduce + adapt your highly optimized matrix multiplication code with a simple prompt, without license. Even if it's not a "textbook license violation", you're lifting one's GPL licensed function and landing it to your codebase without its license. If your code base is not licensed under the same GPL version (or later if the repo allows), it's both unethical and license breach at the same time. Adaptation of the code doesn't matter.

Same is true for more strict, source-available licenses. They are open source, but not open to be reused. What will happen if you put a function derived from a codebase with these strict licenses with or without knowledge? You're again in a dangerous grey area from both license and legal perspective.

The issue we discuss is neither straightforward nor simple to navigate. I left GitHub because of this, and may tag my repositories with this badge.

Open source means nothing if copyleft is taken out of the picture, and licenses are simply ignored.


"The problem is not "training using GPL", it's stripping the license and providing the resulting code to anyone, for any purpose."

This.

Since Copilot arrived I thought that most open source developers would be fine with it if github simply even tried to acknowledge the original licences in any way.

I personally would be ok with even a very indirect aggregate group based thing that should be no burden at all for github. They make a big list with everyone's name in it and call it the copilot contributors, and provide some kind of page for it, then when copilot spits out code, it includes a link to that page, and/or a user includes that in the credits/authors for their project like any other credited source.

No excuses about how impractical it would be to cite 3 other authors for every line of output.

But they don't even do that tiny bit. They don't try and fall short, they don't even try. But they still take the goods. The goods are already free, and yet they still manage to steal them.


> We do it so that humans can grow.

When was this decided? I don't remember any licences or foundational essays saying this.


> People still rarely do this in music, as holding up industry secrets is deemed more important than fostering a community of personal growth.

This is absolutely not the reason and it's surprising to read this opinion stated as a fact.

First of all, not revealing multi-tracks doesn't prevent personal growth. Human ear is pretty good at discerning individual elements of a mix. It's all in the final product. If you can't learn from it or can't hear something, either it's an unimportant part of the mix, your ear is not good enough yet to distill any meaningful information from the balances of the mix in question, or it's just not a good mix.

Secondly, multi-tracks are not the source code. Putting it in relevant terms, it's just the assets. What's important are processes and artistic decisions made while processing them and combining them in their final form.

Not revealing stems/multi-tracks is not a matter of gatekeeping, it's a safety measure to prevent unauthorized use of individual composition elements and creation of "alternative" mixes which break artistic integrity of the piece.

Music is art (at least a considerable portion of it), and and artist needs as much control over the final product as possible. Mixing is a part of it.


Gatekeeping isn't boolean, it falls on a spectrum. If your rationale about preventing un 'authorized' use is in fact true, the process must be effectively keeping some notional gate some kind of closed even if it's only from people you (or maybe even the original artist) consider to be acting 'unscrupulously.'

If musicians' priorities were for the world to share in their musical indulgence not just in listening but also in like-minded creation, they would release as much of their mixing tools/backgrounds/stems/etc as possible. Since the capitalist framework is a foregone conclusion, they need to keep some secrets in their sauce to produce scarcity to justify a price for their labor so they can eat... Because the farmers and butchers all keep gates of their own kind, too :)


> If musicians' priorities were for the world to share in their musical indulgence not just in listening but also in like-minded creation, they would release as much of their mixing tools/backgrounds/stems/etc as possible.

They do. For a price. When you know how to ask.

The thing is, if you really want to « like-minded create », you do not need, not even want their stems (you can find most of them anyway online, if only for learning).

If your thing is not « consumption » but creation, your own voice/process/way matters more (and is more fun) than starting from the studio separate tracks.

As for the music _business_ side of things, attention (hence scarcity management) on and availability/capacity to provide are the two main levers you have if you want to live from it. As with any other business actually.


I don't really get the point you're arguing for.

I disputed a specific mistaken proposition, that "holding up industry secrets is deemed more important than fostering a community of personal growth". It's a mistake on several levels: there's no dichotomy as it's presented in this sentence, and the reasons for not providing multi-tracks and project files are different.

> If musicians' priorities were for the world to share in their musical indulgence not just in listening but also in like-minded creation

So? They have other priorities so this is not what they do. Teaching is an industry that's only an adjacent and derivative activity to the cultural sphere of arts. It's of no concern for the artist, unless they choose to capitalize on their skills/knowledge/experience.


I'm agreeing with you that there are many reasons why the music Industry doesn't share stems but also concluding that many among those could be rationally considered gatekeeping by various groups. I'm not disparaging the industry as a whole or the individuals who comprise it, but the reality is simply messy. Many signed artists I'm sure wish they could distribute their stems or more of their process, but are hindered/prevented from doing so by labels. Seems close to what gatekeeping means to me, but that's just like my opinion, man...


> The equivalent would be learning to write software by looking at compiled binaries.

Only if Copilot was issuing compiled binaries. In this case the end result is the code.


To be fair, artists share lyrics, since everything else can be easily scanned by ear. Negatives are frequently the only source. You do not need any negatives to recreate a photograph. Just like you do not need masters or sketches to recreate a song or a popular paining. Most modern photographs are not even produced by photographers - the camera would do all the work, your job is to pick one frame from 9528 you took while your finger got stuck on a button. Moving code somewhere else or omitting the source would not bring back dead ethics of greed-driven modern society and definitely can not slow down reversing in any way.


Code is the end product. We store code in our repos, not binaries.


> The equivalent would be learning to write software by looking at compiled binaries.

Challenge accepted?

With the speed at which things are advancing....


Because HN doesn't have opinions on things. You're reading the opinions of individuals who frequently disagree with each other, and the groups of individuals commenting differ from post to post. Why does this need explaining so often?

FTR I personally think that training commercial ML algorithms on bulk data produced by people that haven't released it to the public domain or licensed it for such use is ethically dubious (regardless of the legality) whether the data consists of code, art, metrics or anything else.


HN has an average opinion. If 90% of comments on one article say capital punishment is bad and 90% of comments on another article say that paedophiles should be lynched then the overwhelming likelihood is that they are being hypocritical. The idea that it happened to be a completely different subset of HN's audience commenting on each case is statistically implausible.


> The idea that it happened to be a completely different subset of HN's audience commenting on each case is statistically implausible.

This doesn't seem obvious to me at all. It seems not only plausible that this could be the case, but unsurprising.

Almost nobody here reads every story in the feed. Any given story on HN is going to be read by a collection of people that differs from story to story. And most of those people won't make any comment whatsoever. It seems unsurprising to me that a story on, say, capital punishment could get a large number of people who are passionate about that topic, and a story on pedophilia could get a large number of people who are passionate about that topic, and the majority of commenters in both could hold opposite opinions without even one of them being hypocritical.


> Because HN doesn't have opinions on things.

HN is actually one of the most homogeneous public communities I've seen.

And judging by how few downmodded posts are around, self censoring.


Don't confuse basic manners with homogeneity. Just because the rest of the Internet is a dumpster fire, doesn't mean people posting here have limited range of views.


HN sprouts weirdly localized hiveminds - every post is its own little hivemind, where a "correct opinion" is formed, and opposing opinions downvoted and/or ridiculed.

But then the next day a different post on the same topic could have an entirely opposite hivemind, often no discernible reason.


> opposing opinions downvoted and/or ridiculed.

I'm not so sure this is a huge effect. I often express "incorrect opinions", and get some amount of downvoting or ridicule, but not usually to a significant degree. When I've had my comments dogpiled, I can see that it happened because of how I stated my opinion more than because of the opinion itself.


It often depends on the first comment which gets traction (>5 upvotes)


For evidence, the occasionally highly-upvoted Jacobin Magazine here at ground zero of popular capitalist discourse: https://news.ycombinator.com/from?site=jacobinmag.com / https://news.ycombinator.com/from?site=jacobin.com

Sometimes the same story will be flag killed one day, front page the next.


It's almost as if calling HN "homogeneous" is incorrect.


>And judging by how few downmodded posts are around, self censoring.

Look in a thread where China is mentioned and you'll see the real face of many HN'ers - if they aren't removed before you see it. Luckily the MOD'ing is pretty good.


I don’t understand. China is just a country. Real face? Usually I feel like the problem with HN is that every one assumes they are right or that everything is mostly subjective except the few things they view as objective. Then if you aren’t civil you are somehow wrong. Finally, if you’re espousing far left or far right opinions, you get a bad reception.

Unfortunately since most people aren’t experts on most things, this turns into a bunch of fugazi. People who are financially or academically successful in business or engineering or marketing think they understand politics too.

This is what happens with topics like China. I don’t care about China much. So I end up defending it because the negativity towards it by the Global North is staggeringly inconsistent.


"Everyone I know who's right agree with me!"


Okay. Anything of substance?


It was just a humorous way of agreeing with the point you were making. Sorry to bother you.


Interesting. How were you agreeing with me? This doesn’t make sense.


> Why HN is so defensive against copilot but have a completely different opinion about MidJourney / StableDiffusion?

I've seen this question posted on HN so many times in defence of Copilot but I have yet to encounter people who fit what you're describing: people in general raise a lot of issues with MidJourney/StableDiffusion/&c. too.

HN is particularly defensive when it comes to Copilot because Copilot deals with sourcecode which is central to the daily vocation of a large portion of HN's readership. HN focuses on Copilot because they're subject matter experts in what Copilot trains on.

Also - as mentioned in siblings - while criticisms of training material sourcing by various ML models is common, there are quite a few details that makes Copilot's sourcing that little bit worse.

> it's inevitable and artists need to up their game

If this is your answer it seems you didn't even bother to read the title (never mind the article). There are two issues with Generative AI to discuss: input and output. Your answer addresses output - i.e. fear that what is produced may compete with humans. This article, and the vast majority of HN criticism of Copilot exclusively addresses input: objection to copyrighted works being used to train models without author consent. These are two entirely separate discussions.


> Why HN is so defensive against copilot but have a completely different opinion about MidJourney / StableDiffusion?

Is it, though? My impression is that HN opinions on possible copyright issues are generally mixed for both, and mostly dismissive about the job replacement stuff for both.


Maybe because the main users of HN are software developers and their work is used to train Copilot while probably a minority is creating images used to train MidJourney / StableDiffusion. A question of whether one's own work is used for this purpose or whether one is only a beneficiary.


> Why HN is so defensive against copilot but have a completely different opinion about MidJourney / StableDiffusion

There's a miss match between producers of the content used to create the models and those using it. At the moment its mostly software developers who are using the output of Copilot, whilst its mostly non artists who are using MidJourney/StableDiffusion.


It's NIMBYism for jobs. We need a new acronym for it.

My effortless attempt: OMIJIB(Only My Information Job Is Brilliant)? O-My-Jib


This is pretty brilliant after reading all the responses to why HN is so obviously anti-specific things more than others.


This is pretty brilliant after reading all the responses to why HN is so obviously anti-specific things more than others.

I can’t handle hearing another person say “HN is a big place”. Sure. It’s also incredibly obvious how homogeneous it is and how HN essentially forces you to toe the line.


HN has never once forced me to toe any line.


Okay thanks. Anything of substance?


I made a reasonably substantial point. You asserted that "HN essentially forces you to toe the line" and I brought up a counterexample to demonstrate that your assertion might not be entirely correct.


The person who responded to you showed the obvious logic.

I’m a leftist in tech. HN is not the place where you can publicly behave as an assertive leftist otherwise would.


You can toe a line without toeing a line. That's the group involved in enforcing the line.


Because MS owns github and MS therefore WILL abuse whatever goes there. Not 'if' but 'when'.

Source: over 20 years of working with their products and seeing the way they treat everyone else


Because people are protective of their own ground, but open for conquering others ground.

There are probably not as many artists here, as there are coders. So the general opinion will be against all tools which will harm the coders ground, but more open for the benefits AI will bring for the laymen, when they must not fear harm from them.


Because HN doesn't draw obviously.


Because code is “my stuff” and copilot changes our jobs while so generated art doesn’t affect me and is cool.

There’s no real difference it is just wagon circling once a thing threatens somebody personally.

The tech community has been one of the big proponents of “data wants to be free” and “the entire idea of intellectual property is bs” for a long time and that’s suddenly reversed.


I dunno if HN has one particular opinion on _anything_. The echo chamber isn't quite that strong here. However, I'd posit that the main difference is that there are more software engineers on HN than digital artists.


HN is not a person


I’d say it’s because of personal feelings and that the technologies didn’t get trained on their data and because of that, disassociation makes those other technologies easier to use.


There is a funny question people ask in India - If it comes out of your mouth it must be blood but the same from mine - is the red sauce I ate ?


100% everyone wants to do away with everyone else.


While I sympathise with the remarks, I think these arguments are wrong. In any programming language there are only that many ways to write something. For instance, iterate over a list of items and do some processing to them. So when someone publishes their code under GPL, suddenly millions of projects that come up with exactly the same code can be in a violation? I get this is complex, but copyrighting code is akin to copyrighting maths formulas. Imagine if a company selling bottled water could also copyright water. That's very much what is happening.


It sucks that people will be out of commission once AI can do the same work basically for free. For years I have always felt that with more tax and a sort of “freedom fund” more people will be able to make “free” stuff in public for the global good. The only way to get there is to stop paying for stuff such that we eventually get to a utopian future as seen in Star Trek. Although the intervening period would be chaotic. For example, we pay tolls on some roads but we don’t pay cash every time we want to walk down the street - that idea sounds crazy. Likewise, if we could support the work of creators with UBI then we could all enjoy free software, media, books, etc. I know I would keep creating, because I love it. But for now the middleman wants to get his share so it seems sensible that we must pay companies for their produce directly.

To be even more extreme, a _lot_ of people think if anything is public, it is fair gain to consume and reuse. If you didn’t have to worry about getting paid, would you care about someone enjoying and using your work? Perhaps the opposite - that would be your motivation.


One relatively simple answer is to just add to the license the term that it nor its derivative works can not be uploaded to Github or be used to train any AI. It will be wildly breached because no one cares about the license but it does at least give you some legal recourse if you choose to pursue it. You aren't required to just use GPL V2-3 you can amend the terms however you wish.


To avoid the proliferation of a million different licenses, it would be better to have a single or few "No GitHub" licenses.

Such a license would not be considered "open" by the OSI because it imposes onerous conditions.

Even if they were made, I wonder how much such licenses would be used. If the permissive people don't like copyleft because of the conditions it imposes, they must hate this. Likewise the copyleft folks want freedom enforced, not taken away.


Isn't e.g. ChatGPT trained on data all over the internet? That means even if you upload your code to GitLab or your own public Gitea instance, it will still be used to train AIs?!

I don't see the point of this because quite frankly - if you want to prevent AIs using your data, you already lost that battle in the moment you uploaded your Code to the publicly accessible internet.


Where does OpenAI state that they only train ChatGPT-3.5 or GPT-4 on code from GitHub? The model for GitHub Copilot X clearly has a (human) language understanding that you can't get from source code (or source code comments), so they are trained on much more data than GitHub has and there is no reason to believe OpenAI would limit themselves to that.


Please upload all open source code to everywhere that will accept it. Make everything indestructible. Make it as easy as possible to find an authentic copy. Stick every bit of code with license that allows it in every free and proprietary neural network.

Did you write the absolute best implementation of X? I want to see it everywhere. Everywhere. In every single place where X is needed or discussed. Where all fine X are sold. Don't you? Or do you genuinely want to narrow the people who see your impl by some signifnicant percentage because you got frustrated with the ugly capitalism of a recent distribution mechanism?

If I invented Golden rice[0] I'd like to think I'd allow it to be sold at Panda Express and all the other evil capitalist chains, not just the local co-op whose business practices I prefer.

0. https://en.wikipedia.org/wiki/Golden_rice


If I make a helmet with tracking to help rescue miners after a collapse I will not want to see it used by Qatar to track slaves: https://news.ycombinator.com/item?id=33675370

This may seem an extreme comparison, but to one who deeply cares about Free Software as defined by the FSF, proprietary software is bad. That's the difference between people who prefer Apache/BSD/MIT licenses versus those who prefer the (A)GPL.


That’s like saying that pacifists who refuse to own guns don’t really think war is bad, because they won’t take up arms to implement peace.

Fighting fire with fire is one option. Fighting fire with water is another.


Copilot and other code-authoring LLMs are one of the biggest innovations in software engineering in recent years. I use it daily and can't imagine going back to work without it — adopting it in 2021 was the same shock of a productivity boost as when I learned vim.

Yes, I know that it can occasionally break license by producing licensed code verbatim. But in my almost 2 years of using it daily I have never seen it happen first-hand, and I don't see how this licence infringement could actually do any significant damage to anyone — so while I acknowledge that this problem exist, I refuse to accept that it's as significant enough as people make it out to be.

For a long time, it seemed like technical progress in computing have stopped, and now that AIs and LLMs are finally bringing exciting new technology to life, it's very sad to see exactly the people that should be excited and inspired about it — software engineers — fighting against it.


All the people wanting to profit off other people's work for free sind arguing in favor of Microsoft.

Crypto bros are now AI bros huh?


As far as I know, rights and licenses don't work with suggestions like "please". Either embed it in your COPYING/LICENSE files so the initial 'uploading to github' action would be illegal, or take legal action agains the main offender (which, in this case GitHub).


Of course this isn't just a problem with copyleft licenses on github, but also with non-open code on github. Only there the problem may be less visible.

Ideally, github should check the license of the code it's using to feed copilot, and only use code with permissive licenses.


What is stopping Github from crawling whatever other Open source platform you choose?

I don't think this will fix anything.

Actually developers are the only ones that stand to lose here since now open source will be spread on multiple platforms, making it harder to find what you want


> What is stopping Github from crawling whatever other Open source platform you choose?

At the very least, they can't claim that you gave them special rights through the terms of service.

Also, rate limiting.


Don't share and license your code permissively then.

People want the benefits of publicly sharing stuff, but then they want to prohibit others from learning from what they share.

There are many options to keep things private. The downside is that you won't get the same exposure.


Copilot was not only trained on permissively licensed code. It’s trained on all public repos, even if the code is copyrighted (which is the default absent a more permissive license)


If the copyrighted code was uploaded to GitHub by the owner, there's no problem with this. When you upload code to GitHub, one of the rights that you grant to GitHub is the right to use your content for "improving the Service over time". See D.4. License Grant to Us in the GitHub Terms of Service. Once it is up there, you also grant other users certain rights, like viewing public repos and forking repos into their own copies. See D.5. License Grant to Other Users. Even with the most restrictive protections in place, using GitHub requires you to give up certain rights.

A question would be if creating and training Copilot is "improving the Service over time". I would suspect that it would be, though.

There are still some open questions around what happens when Copilot suggests code verbatim, but these are mostly for the users of Copilot. Although I would hope that GitHub is thinking about offering information to ensure that users understand the source of code they use, if it may be protected, and what licenses it may be offered under. There are still some interesting legal questions here, but I don't think that the training of Copilot is one of them.

A more interesting question would be what GitHub does if someone uploads someone else's copyright-protected code to GitHub and it is used for training Copilot before it is removed. If you don't own the copyright, you can't grant GitHub the rights needed to use that code for anything, including improving the service.


> A question would be if creating and training Copilot is "improving the Service over time". I would suspect that it would be, though.

Definitely an interesting case to be had, but I'd argue that it does not. They're using their customers' code to create an entirely new product that would not be possible without it, not just improving their ability to host a Git repo. Otherwise, what standard is beyond "improving the service over time?" Can they do anything with the code they host as long as it improves their service? What about sell bootleg copies of it and use the proceeds to upgrade their servers?


However D4 also explicitly says "This license does not grant GitHub the right to sell Your Content". One could argue that because Copilot is a commerical product it is in fact selling (a derivative of) user code, and thus the grant in D4 does not apply.


> but then they want to prohibit others from learning from what they share

The linked-to document explicitly DOES NOT prohibit others from learning what they share.

Quoting it: "If the project is under an open source license, it means that everyone can share a copy – even on GitHub – of the licensed material under certain conditions. A license restricting this right wouldn’t be open source anymore. However, since GitHub may not respect the terms of licensed code that is hosted on their servers, not uploading the code of others there is, in fact, an ethical choice."


Adultery isn't illegal, but I might fairly consider someone engaging in it without their partner's consent to be in the wrong.


"No Adultery" is typically a term of entering a relationship. We can liken that to code licensing. Cheating is explicitly established as being _against_ the ad-hoc contract of the relationship.

Conversely, open source licenses explicitly state that an end-user may further distribute that source code to anywhere they wish.


This person is adding their "No GitHub" term with this blog post.


While I understand the author's concern, it seems a bit naïve to assume that Microsoft would ONLY learn from code in GitHub. Geek blogs are spectacular places to ingest code from because they give a ton of context and explanations for what's going on. Way more than raw source code ever does. If you want an AI to understand what's happening in code via the English phrases you use to describe what you want then you _need_ to train it on code that has similar English phrases describing what it's doing.

Ingesting from public sources outside of GitHub will just become more necessary as they work to improve these things.


Honestly I think we should explicitly forbid using the material as training data in our Foss licenses. Unless the weights and the network model are made public I dont want any of my code to contribute to such an AI.


Naomi Klein's recent essay about AI^[1] suggests that the real "hallucination" problem in AI is that it's promoters are not seeing the real world clearly. Some of her points about the effect of AI rollout on human employment may be germane to the present discussion.

1. https://www.theguardian.com/commentisfree/2023/may/08/ai-mac...


You're preaching to the cult here. There is a marked lack of AI criticism and skepticism on this forum. There's a few are trying to gaslight skeptics (yes, gaslight) that current AI is close to being sentient, and/or humans are not much smarter than it, and/or we should consider and evaluate its output as if it were created by a human.

I don't know if it's laughable, or utterly terrifying giving such a powerful tool into the hands of reckless wannabe entrepreneurs.


It's hype for suckers intentionally created by the mega corps (remember "Sparks of AGI"?). The real goal is to get away with theft long enough until the regulators give up and accept their right to be above the law (aka "disruption"). And then there's the army of intentionally misguided youtube creators that want to get views by showing how they created a website without knowing html


I'm having a hard time buying that people are genuinely worried about their code being copied without proper attribution. While it is possible for CoPilot to generate copyrighted code, this typically occurs only under intentional efforts and only for a few lines of code. It's just not an actual issue.

And something tells me that even if CoPilot would be entirely prevented from doing that, they would still not be happy about CoPilot using their code for training. The copyright issue is just a convenient pretext.


This seems like a very reasonable position to me, but they should add the restriction to their licensing rather than just nicely asking other devs to pretty-please don't do this.

Although I don't use github (for reasons unrelated to to copilot), public access to my code was eliminated when I took my websites down while I look for a way to deal with AI scraping. I'm eagerly watching what others do, hoping that someone will have a great idea of how to deal with this before I do.


This assumes that GitHub Copilot only gets its data from GitHub and that GitHub Copilot is unique among tools.

Remember the old saying: Anything published on the internet stays on the internet.

What prohibits someone from crawling code from other sites and building a GitHub Copilot equivalent?

Considering how ChatGPT style bots are often trained on public websites that is likely to already be true even.


I think the concept of open source should simply extend to AI models. I'm fine with AI being trained on open source code, if the entire model is also released as open source. We need GPL analog to AI training - license that allows you training models, but only if they are released as open source. An infective AI training license.


The funniest outcome of all this would be a ruling that the entire copilot model is GPL.


Isn't GPL the toxic stuff that real projects avoid like the plague?


This will always be an issue as long as people can fork the code, so one might say we need a license that prevents a module from being used in ML training, better yet, we need a way for a commented line or something that'd pork the training pipeline if found in the source training, and removing it would violate the license.


Hmm a thought.

With neural networks it's impossible to actually describe what the network does. Including proving that it has or has not used GPLed code for a certain input/output set.

One could argue that all code output is GPL with the associated restrictions unless said network has provably never been trained on GPLed code.


Seems like Github should train one LLM per license type. There'd be a GPL LLM that's trained on GPL code, an MIT LLM trained on MIT licensed code, etc. Then Copilot users could select the LLM for a license appropriate for the work they're doing.


That's an interesting PoV. I hadn't thought of it.

But I don't know how much success the author will have, in his endeavor.

The horses have fled. Closing the barn doors, does nothing. The Rubicon has been crossed. The die is cast. Iacta alea est, etc.


Well not really, code is living and breathing and changing so if we updated all licenses today to forbid the use of code to train llms which will be used in a proprietary setting it would be fine?


You can say anything you want in your licence, I guarantee you people will break it.

If you don't want your code public, don't make it so. Because licenses are a thing of trust and you shouldn't trust anybody.


It will not change anything. AI farms will git-clone from any public repository. Someone concerned should just patent-troll any software that may or may not use a copilot-ed part of their code.


You could make this into an addendum of whatever license you want no? That way your license includes both reuse and attribution and also, you are not allowed to upload to GitHub.


Copyright regulation for intellectual property should not exist.


    *.code-workspace
    .github/
    .vscode/
in your .gitignore is another way to express that you don’t want Microsoft as a part of your project.


Maybe it's a good point for a mini-ask HN: assuming you agree with the sentiment, what is your preferred alternative to GitHub?


I use a self hosted gitea instance. Haven't had any issues with it, i can highly recommend it. Not super useful if you want contributions though xD


Sourcehut. I like how they approach things there, and am happy to pay for my account to support them.


I would guess that GitLab is the most used one right after GitHub, but I may be incorrect.


I still like Gitlab a lot and use it wherever I can. I think that it was once talked about very favorably on HN, but it seems to have lost its popularity after its IPO (or even earlier).


There’s also the swath of non-Git DVCS options like Fossil, Mercurial, darcs, Pijul, et. al. and they have their own forge hosting options.


sourcehut - the hacker's forge


Self-hosted Gitea.


Closed source. I can't see the point of publishing anything innovative if it is just going to be vacuumed up by some giant corporation and sold back out to users at $10 a month.


Codeberg seems like a decent host for Free Software projects. There is also Sourceforge, but I don't know if people have already forgiven and forgotten the mistakes made in the past.



Self hosted GitBlit.

https://www.gitblit.com/


git-remote-ipfs or GitTorrent.


Ai is like the tornado cash of copyright infringement and ip theft. It’s a mixer that exists to skirt existing law


Is Copilot trained only on github hosted software? What stops them from using sources from codeberg it they wanted to?


I wonder if we will start seeing Open Source licenses that explicitly forbid using code for model trainings.


(apologies for the sarcasm, but)

According to the Hitchhikers Guide to the Galaxy…

A new coding highway was planned years ago. You could have filed a complaint in the Implications of Future AI Department in the basement of Bill Gates’ mansion, but you didn’t.

The highway is being constructed and you’re just going to have to deal with it.

Here’s some Vogon poetry to make you feel worse and a towel to cry in (only one of its many uses).

The guide also suggests gardening as a way to reduce anxiety.


There should be a new open-source license specifically designed to restrict the use of code for commercial purposes, whether it involves training machine learning models or not. It could be similar to GPL but tailored specifically for the field of machine learning, ensuring that the code cannot be utilized for ML commercial purposes or trained without limitation.


People aren’t necessarily opposed to commercial use (which many licenses allow), but are upset that license terms like attribution requirements are completely being ignored by Microsoft/Github, turning Copilot into a sort of laundering machine for open-source content.


That's what I am trying to say. Personally, I don't mind if my code is being used for commercial or not. I do mind when Microsoft uses my code without a reference and asks me to pay for copilot just because they decided to put it behind a paywall.


Why?! Why restrict the advancement of a field? Most of the times the code uploaded to github came from another source (not original content, very little code is original).

Nobody can stop this anymore. If it's not github that's only taking code from their own platform, there will come others that scrape the internet and use everything, just like MidJourney. Embrace it.

I'm more against Art AI generators, because they produce end results and will fully replace the need to know how to draw. Copilot is just providing simple snippets, it's not thinking for you, so it won't replace the engineers completely, at least for now.


Still, it's good to be an option. If you see this as an advancement you can choose something like MIT if you don't like your code to be trained by a paid copilot then you can choose something like MIT-Human only.

It's a better solution than don't use GitHub at all. And whether we like it or not, a new license should be introduced to address the new data model.


I you put restrictions like this on your code, then it's no longer open source.


You're correct in the common-sense interpretation of the phrase "open source", but unfortunately groups like the Open Source Initiative have redefined it to mean "anything on a list of licenses we approve" I.E. GPL/MIT/BSD. Personally I think that people trying to enforce this definition in online discussions are wasting their time, because it'll never catch on. We should stop trying to make "fetch" happen.


There is a difference between open source and free.


GNU GPLv3 wants a word. It has way more “restrictions” (restricting one entity to free another)


Copyright laws do not apply. It’s trained intelligence, in a similar way as human intelligence. If you apply copyright laws to LLM you should apply them to human intelligence too as it’s the same process (with a different scale)

Yes programming is going away, so is most intellectual and artistics tasks.


So many copyleft people writing about their copyrights feels... weird.


Noted


Read through the github "problems" following the link and it reads as "for profit organisation makes for profit tools"... Great.

I think github, after stackoverflow, is the best thing that happened to developers.


... until for profiting goes against developing a good service.

> I think github, after stackoverflow, is the best thing that happened to developers.

I don't think github is bad tooling (there are worse), but the best thing for developer? I don't see anything on github that is greatly better than what we have in other services.


Github was a huge step up at the time, compared to Sourceforge and Google Code and the like. Git itself helped, since checking out code from Subversion and CVS didn't give you the history, so you'd lose that if you wanted to publish your own fork.


I think git itself is better, personally.


I think the same thing. Before git I was using TFVC and git was huge improvement in pretty much every aspect of code version control.


[flagged]


You've clearly used a throwaway just to be contrarian, but have you considered not posting at all if it adds no value to the discussion?


I think it does.

See, if a comment like that crops up on a civilised society like this, it demonstrates that the barren wastelands of “the outside world” will have similar examples.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: