Hacker News new | past | comments | ask | show | jobs | submit login

The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim.

> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.

If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.




"The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim."

But only snippets as far as I can tell.

This is the codeexample linked from the author:

https://web.archive.org/web/20221017081115/https://nitter.ne...

It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?

(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

And just slightly changing the code seems trivial, at what point will it be acceptable?

I just don't think spending much energy there is really beneficial for anyone.

I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?


> (Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

This. People seem to forget that generative AIs don't just spit out copyrighted work at random, of their own accord. You have to prompt them. And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you? After all, it's you who supplied the missing, highly specific input, that made the AI reproduce a work from the training set.

I maintain that, if we want to make comparisons between transformer models (particularly LLMs) and humans, then the AI isn't like an adult human - it's best thought of as having a mentality of a four year old kid. That is, highly trusting, very naive. It will do its best to fulfill what you ask for, because why wouldn't it? At the point of asking, you and your query are its whole world, and it wasn't trained to distrust the user.


But this means that Microsoft is publishing a black box (Copilot) that contains GPL code.

If we think of Copilot as a (de)compression algorithm plus the compressed blob that the algorithm uses as its database, the algorithm is fine but the contents of the database pretty clearly violate GPL.


While I do believe that thinking and compression will turn out to be fundamentally the same thing, the split you propose is unclear with NN-based models. Code and data are fundamentally the same thing. The distinction we usually make between them is just a simplification, that's mostly useful but sometimes misleading. Transformer models are one of those cases where the distinction clearly doesn't make any sense.


>And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you?

If you, not I, uploaded my GPL'ed code to Github is the blame on you then?


> If you, not I, uploaded my GPL'ed code to Github is the blame on you then?

Definitely not me - if your code is GPL'ed, then I'm legally free to upload it to Github, and to an extent even ethically - I am exercising one of my software freedoms.

(Note that even TFA recognizes this and admits it's making an ethical plea, not a legal one.)

Github using that code to train Copilot is potentially questionable. Github distributing Copilot (or access to it) is a contested issue. Copilot spitting out significant parts of GPL-ed code without attaching the license, or otherwise meeting the license conditions, is a potential problem. You incorporating that code into software you distribute is a clear-cut GPL violation.


The GitHub terms of service state that you must give certain rights to your code. If you didn't have those rights, but they use them anyway, whose fault is that?


>And just slightly changing the code seems trivial, at what point will it be acceptable?

If I start creating a car by using a blueprint of Fords to create something at what point will it be acceptable? I'd say even if you rework everything completely Ford would still have a case to sue you. I can't see how this is any different. My code is my code and no matter how much you change it, it is still under the same licence as it started out with. If you want it not to be then don't start with a part of my code as a base. In my opinion the case is pretty clear: This is only going on because Microsoft has lots of money and lawyers. A small company doing this would be crushed.


Easy. People get to throw rocks at the shiny new thing. To my untrained eye the entire idea of copyrighting a piece of text is ridiculous. Let me phrase it in an entirely different way from how any other person seems to be approaching it.

If a medical procedure is proven to be life-saving, what happens worldwide? Doctors are forced to update their procedures and knowledge base to include the new information, and can get sued for doing something less efficient or more dangerous, by comparison.

If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

I hear an awful lot of people complain all the time about climate change and how bad computers are for the environment, there are even sections on AI model cards devoted to proving how much greenhouse gases have been pushed into the environment, yet none of those virtue signalling idiots are anywhere to be seen when you ask them why they aren't attacking the bureaucracy of copyright and law in the world of computer science.

An arbitrary example that is tangentially related: One could argue that the company sitting on the largest database of self-driving data for public roads is also the one that must be held responsible if other companies require access to such data for safety reasons (aka, human lives would be endangered as a consequence of not having access to all relevant data). See how this same argument can easily be made for any license sitting on top of performance critical code?

So where are these people advocating for climate activism and whatever, when this issue of copyright comes up? Certainly if OpenAI was forced to open source their models, substantial computing resources would not have been wasted training competing open source products, thus killing the planet some more.

So, please forgive me if I find the entire field to be redundant and largely harmful for human life all over.


Yes, of course copyright is dumb and we'd all be better off without it. Duh.

The problem here is that Microsoft is effectively saying, "copyright for me but not for thee." As long as Microsoft gets a state-enforced monopoly on their code, I should get one too.


> If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

If you don't "slap a license on it" it is unusable by default due to copyright.


Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.


> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

That's just copying with extra steps.

The way to do it legally is to have 1 person read the code, and then write up a document that describes functionally what the code does. Then, a second person implements software just from the notes.

That's the method Compaq used to re-implement the original PC BIOS from IBM.


Indeed. Case closed. If an AI produces verbatim code owned by somebody else and you cannot prove that the AI hasn't been trained on that code, we shall treat the case in exact the same way as we would treat it when humans are involved.

Except that with AI we can more easily (in principle) provide provable provenance of training set and (again in principle) reproduce the model and prove whether it could create the copyrighted work also without having had access to the work in its training set


>The way to do it legally is to have 1 person read the code

wasn't it to have one person run tests of what happened when different things were done, and then write up a document describing the functionality?

In other words I think one person reading the code is still in violation?


> Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.

https://en.wikipedia.org/wiki/Clean_room_design


yes, reading that description it seems pretty clear to me that they did not read the code but they had access to the working system and then

>by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design.

reverse engineering is not 'reading the code'.


Theoretically maybe, then they would have to prove they did so without having knowledge about the infringed code in court. You can't make that claim for AI that was trained on the infringed code3.


Yes, that's why any serious effort in producing software compatible with GPL-ed software requires the team writing code not to look at the original code at all. Usually a person (or small team) reads the original software and produces a spec, then another team implements the spec. This reduces the chance of accidentally copying GPL-ed code.


> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

Maybe, but that would still be copyright infringement. See My Sweet Lord.


It’s not accidental. Not infringing copyright isn’t part of the objective function like it would be for a human.


Not learning or not being inspired by copyrighted code is not a human function either though.


Has a human ever memorised verbatim the whole of github?

If someone somehow managed to do that and then happened to have accidentally copied someone's code, how believable would their argument be?


> Has a human ever memorised verbatim the whole of github?

No, and humans who have read copyrighted code are often prevented from working on clean room implementations of similar projects for this exact reason, so that those humans don't accidentally include something they learned from existing code.

Developers that worked on Windows internals are barred from working on WINE or ReactOS for this exact reason.


Hasn't that all been excessively played through in music copyright questions? With the difference that the parody exception that protects e.g. the entire The Rutles catalogue won't get you far in code...


> this would be a clear violation of the licence

Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.

Copyright is a lot less black and white than most here seem to believe.


That’s part of the rub. YouTube doesn’t break copyright law if a user uploads copyrighted material without proper rights. Now, if YT was a free for all, then yeah. But given it does have copyright reporting functionality and automated systems, it can claim it’s doing a best faith effort to minimize copyright infringement.

Copilot similarly isn’t the one checking in the code. So it’s on each user. That said, Copilot at some point probably needs to add some type of copyright detection heuristics. It already has a suppression feature, but it probably also needs to have some type of checker once code is committed and at that point Copilot generated code needs to be cross-referenced against code Copilot was trained on.


> If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

We aren't talking verbatim generation of entire packages of code here, are we? Code snippets are surely covered under fair use?


It would almost surely be fair use to include a snippet of code from a different library in your (inline) documentation to argue that your code reimplements a bug for compatibility reasons.

In general it is not fair use if you are using the material for the same scope as the original author[0] or if you are doing it just to namedrop/quote/omage the original.

It is possible to argue that a snippet can be too small to be protected, but that would not be because of fair use.

[0] Suppose that some Author B did as above and copied a snippet of code in their docstring to exlain buggy behaviour of a library they were reimplementing. If you are then trying to reimplement B's libary you can copy the same snippet B copied, but you likely cannot copy the paragraph written by B where they explain the how and the why of the bug.


> Code snippets are surely covered under fair use?

...for "purposes such as commentary, criticism, news reporting, and scholarly reports"? Sure.

For a commercial product? Best check with your lawyer...


Oracle would like to have a word..


The Fair Use concept is specific to the USA.


> it's that it spits out GPL code verbatim

It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.


Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.

In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?


This seems like a really, really easy problem to fix.

It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.

If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.



So we should attack the problem of proprietary code. Maybe from Right to Repair angle. I believe there should be no such thing as closed source code.


Closed source code is beige corp-speak, its true name is 'malware'.


In Linus Torvald's book "Just For Fun", there's a chapter about copyright where he presents both the upsides and downsides of it in a pretty much balanced way. I think it's worth reading.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: