Hacker News new | past | comments | ask | show | jobs | submit login
Should GitHub be sued for training Copilot on GPL code? (fosspost.org)
200 points by deadbunny on June 23, 2022 | hide | past | favorite | 300 comments



The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.


>> "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

IANAL, but isnt the concept of "derived data" pretty standard? You dont need to copy data for it to be infringing. I've tackled derived data clauses regularly when negotiating data contracts at work and there is always verbiage and discussion around it (e.g., are we allowed to publish an average of the purchased data)


An average is not related to artistic aspects of the data and so can't be a derivative in the copyright sense (based on international law - one of the principle Conventions is fully titled "the Berne convention for the protection of literary and artistic laws", that's because copyright protects literary and artistic works).

Provided you have rights to access a body of statistics, then copyright has nothing to say -- save overreaching national caselaw (!) -- on your derivation of mathematical, technical, or scientific data from that work.

But a contractual clause, in general, doesn't care about copyright; of you've contracted not to derive data from a work then that's orthogonal to copyright.

IANA(IP)L, this is my opinion and unrelated to my employment.


> The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

Of course it isn't the same as a human programmer doing anything. It's a complex piece of software, which we happen to misuse the term "AI" to describe, but it is not intelligent.


Technically just advanced indexing of code snippets


Precisely. I don't see a difference between what Google indexes on its search engine, and what CoPilot can recommend. Google has been and still does get slapped on the wrist when they don't respond to take down requests. It seems this is missing from CoPilot currently, and will open them up to a number of lawsuits in the future if it continues to operate as it does.


Except Google isn’t creating anything new, Copilot is. I’ve had it kick out some very interesting short stories based on Sherlock Holmes, so if I published those would they infringe?


You have no idea where they came from, or whether they are as new as you assume. Maybe they'd infringe, maybe they wouldn't!


Google returned nothing which contained exact matches of some of the more interesting dialogue. It would be a serious find - worthy of a paper - to disprove that GPT-3 is generating novel text/code.


It's easy to prove that it sometimes regurgitates text verbatim — just play with it for a while. Having certainty that any given span of text/code is novel is extraordinarily difficult.


Where does human creativity come from?


> I don't see a difference between what Google indexes on its search engine, and what CoPilot can recommend.

That is an extremely disingenuous take. It produces novel output, so is not merely an “index” in any sense of the word.


Each search results page is novel and unique. Even two different users making the same query will get different results thanks to the "search personalization" Google is doing these days.


Google search doesn't synthesize anything. It collects results and orders them according to an algorithm. Copilot and similar language models can synthesize new text. That's clearly different than just presenting existing text.


Copilot can't create new novel concepts. In the end it is just an complex mathematical formula that return a set of code references with a set of length determined by the math.

The illusion of creativity is similar to that of technology. Sufficient advanced technology is indistinguishable from magic, and sufficient advanced math is indistinguishable from intelligence. The relation between AI and math is the same as the relation between magic and technology.


All of the large language models can emit text that has never been seen in the training set (unless you go so far to consider each character to be a snippet).


They can also emit text that it verbatim copied.

Infringement isn't about how the infringing system works, it's about the product of that work.


> Infringement isn't about how the infringing system works, it's about the product of that work.

Exactly this. It makes zero difference that you produced your infringing work with the help of a program that happens to be extremely complex and marketed as "AI".


So do smaller models and I have to note smaller models are better at that.


If it gives the same output as a human programmer for the same input, why would it be legally relevant if one system has the intelligent property?


But that's the whole point of copyright. The same piece of code you copy from a Google search can legally be used by you if your developer came up with it, and not if Oracle came up with it. Where you copied it from is the entire point.


of course it's not intelligent, but we still have to decide how the law applies to the actions of software, or otherwise re-frame the whole thing to include the co-pilot developers doing the copyright infringement when they trained the model - not the current discussion which gives agency to the IDE plug-in "choosing" a code snippet to paste.


I don't see how the way the software was built is particularly relevant.

It's just a tool used by the developer; the onus is on the developer to ensure they don't infringe the licenses of the source code they incorporate in their software. Since Copilot makes it impossible to know where it's barfing code up from and what license that code is under, a developer who cares about not getting sued probably needs to avoid using Copilot.


Eh, the law is about making a copy. If my IDE plug-in fills in some code, the question is, did I copy the code, did the robot copy the code, or did the developers that wrote "cp github.db ~/trainingset" copy the code?


The authors of the tool created something that can be used for copyright infringement.

The tool itself lacks agency, it did what it was programmed to do.

If you took the tool's suggestions and proceeded to published a derivative work, you may have infringed.

This really doesn't feel any different from P2P filesharing services. Rightsholders have targeted tool publishers in the past, because they are the largest single target and not anonymous; but ultimately the infringement is performed by the end user.


This isn't complicated at all. You copied the code, which isn't an issue until you then go on to do something which infringes the license (e.g. publish under a different license, publish binaries without publishing source, publish without attribution, whatever it is that the license requires).


A law that uses the archaic terms "copy and paste", referring to a time when people would make an analog photocopy of a document written using a typewriter, trim it out with scissors or a knife, and glue it to their book with the pasty remains from boiling animal collagen cannot be trusted to apply word-for-word in a time when technology has obsoleted the glue, typewriter, xerox machine, and even the paper.

It is not the same as a human, no, but it's not hard to choose a definition of the word "intelligent" that can accurately describe something that can be done by a program.

When a human walks around a puddle, are they demonstrating intelligence? When a horse avoids stepping in a hole, is the horse intelligent? When a robotic vacuum avoids a stairway, is it intelligent? When a self-driving car avoids a bollard, is that intelligent?

Whether there's a being inside the device that believes it experiences consciousness or not, the same outcome happens. A Searle's Chinese Room that produces copies of Chinese IP, a trained monkey that does so, or a human that does the same thing, the outcome is very similar.


Perhaps it's a little bit like employing a human programmer with an eidetic memory who occasionally remembers entire largish functions.

If he were able to remember a large enough piece of copyrighted code, and reused it, then it still wouldn't be fair use, even if he changed a variable name here or there, or the license message.


Yeah, that's definitely the impression I get from the few Copilot examples I've seen. I've not personally used Copilot so I refrained from making absolute statements about its behavior in my top comment.

But I think the conclusion most people are settling on is that it's definitely infringing.


A possible response that I'd predict from GitHub would be to attribute much/all of the responsibility to the user.

The argument would be along the lines of: you as the user are the one who asked the eidetic programmer (nice terminology, @bencollier49) to produce code for your project; all we did is make the programmer available to you.


Relevant parts from the Copilot FAQ (https://github.com/features/copilot/):

Does GitHub own the code generated by GitHub Copilot?

GitHub Copilot is a tool, like a compiler or a pen. GitHub does not own the suggestions GitHub Copilot generates. The code you write with GitHub Copilot’s help belongs to you, and you are responsible for it. We recommend that you carefully test, review, and vet the code before pushing it to production, as you would with any code you write that incorporates material you did not independently originate.

Does GitHub Copilot recite code from the training set?

The vast majority of the code that GitHub Copilot suggests has never been seen before. Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set. Previous research showed that many of these cases happen when GitHub Copilot is unable to glean sufficient context from the code you are writing, or when there is a common, perhaps even universal, solution to the problem.


I've used Copilot for months and honestly it's become one of my most favorite inventions in all of programming- and this is key- even when it screws up (such as by suggesting Ruby-syntax code to autocomplete Elixir code). It tickles the "childlike joy" funnybone in me, the same one that got me into programming to begin with. I don't know how long it will take for typing "#ANSI yellow" (for example) and autocompleting to the right codes to get old, or every time it autocompletes anything considered "boilerplate," but it hasn't, yet!

You know, pretty much all of programming can be summed up as "tedious labor elimination," and this tool directs that same labor elimination at the work of programming itself (I no longer have to constantly google syntax idiosyncrasies etc.), and NOW coders are pissed? I don't get it. Eat your own dog food, people, because this is what it looks and tastes like.

As to the copyright infringement or licensing-violation claims, I have yet to see it autocomplete an entire algorithm correctly, or one copied verbatim from somewhere, although that could be mitigated. You still have to pay attention (kind of like Tesla autopilot), it's not going to eliminate your job.


No one is complaining about copilot making programming easier or automating it.

We're upset because it's quite literally infringing on intellectual property. Infringing on intellectual property that's been set aside for the exclusive use of the commons.


god bless AI for moving human society beyond silly notions like ideas-as-property

copyright was established to increase the innovation and creative will of the arts and sciences, what could increase that creative force more than an AI assistant who has seen every creative work ever made?


I'd be all for returning ALL code to the commons.

Except that is not what is happening here. The problem is that AI is being used to take code, which was provided to the commons under the explicit condition that anything built with it is also released under the same terms, is now being fed to a magic mystery machine to produce code that can supposedly legally be witheld from the comments. The only code that this affects is the one that was already shared - you won't see Microsoft feeding Windows and Office source code into Copilot anytime soon.


Do you use it? Have you ever used it? How many people making negative comments about it here have actually used it? I don't actually believe many have. I suggest at least trying it out before lighting your torches.

If it infringes everyone equally and everyone equally benefits from the infringement, has a net wrong actually occurred? (which of course begs the "do the ends justify the means" question...)

I don't see how this is any different a form of "infringement" than me copying and pasting snippets of other peoples' code, and then modifying it to suit my particular context, without specific attribution, except that the latter is a much more laborious and time-consuming process than copilot autocomplete, and programming is all about tedium elimination


> If it infringes everyone equally and everyone equally benefits from the infringement, has a net wrong actually occurred?

It’s not done equally though. Copyleft code is extremely likely to be on GitHub somewhere, while internal proprietary code is often not. Copilot will thus have been trained more on the former than the latter.

> I don't see how this is any different a form of "infringement" than me copying and pasting snippets of other peoples' code, and then modifying it to suit my particular context, without specific attribution

It’s no different, but that is also copyright infringement.


> but that is also copyright infringement.

so basically all of Stackoverflow is copyright infringement and has been for decades? Find me the programmer who has never either 1) copied and pasted directly from the internet, or 2) taken an idea found on the internet and massaged it for their own purposes. I mean... this is basically why programming is so lucrative IMHO. Everyone is piggybacking off of everyone else's work (at least in open source)


The tens of thousands of developers in a company I am familiar with have taken a basic training on intellectual property concepts and software licenses.

A typical case mentioned in the training is that code from StackOverflow is (probably) licenses under CC-BY-SA 4.0 and as such it can never be copied inside their proprietary-licensed code base.


This is something I wish more companies would do. It’s sorely needed.


(Recent) StackOverflow contributions are licensed under CC BY-SA 4.0 by default (though the author can of course release it under any additional licenses they choose): https://stackoverflow.com/help/licensing

If the code is really sufficiently trivial (and I’d guess that most code samples you’ll find on StackOverflow are) you may have a fair use argument in the US. Generally speaking though (and especially for anything nontrivial) you need to respect the license. CC BY-SA 4.0 is one-way compatible with GPLv3, though, so that helps if you’re including it in a GPLv3 codebase: https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-...


Define "sufficiently trivial"


It's fuzzy and imprecise, as many legal concepts are. Small/unoriginal enough snippets may not even be copyrightable:

https://en.wikipedia.org/wiki/Threshold_of_originality

Then even if it is copyrightable, under some circumstances your use of it may be considered fair use anyway:

https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors

Or potentially de minimis:

https://blogs.library.unt.edu/copyright/2017/09/05/the-de-mi...

But when in doubt, ask for permission or ask a lawyer.


Even apart from copyright aspect, it would be nice if we as programmers would improve our attitude towards attribution. If researchers can cite the work that has influenced theirs without legal threats than so can we.


Github explicitly leaves out proprietary code bases, included microsoft windows source code (Microsoft own github and uses it for their own products).

If Microsoft included their own source code when training copilot then at least they would be intellectually honest, but they don't. They only consider GPL and other free and open source code to be up for grabs.


This kind of reminds me of when someone reverse engineers a piece of software to document interfaces, protocols or APIs for the purpose of writing compatible software. Then a second person not involved in the RE process implement compatible software from the documentation the first person wrote.

This is to avoid any contamination and verbatim copies of code. Once you have read a piece of code there is a risk of "contamination" and you will be influenced by it. It does not matter if you directly copy it, write it out from memory or use an AI to regurgitate it. It will be a copy of the code. To me this is very clear.


This sounds like “taint” in the M&A space. I’ve very limited experience of it and would be interested in hearing more from the better informed folks on this topic!

My limited experience: my then-employer opted not to acquire a company after doing due diligence. Ultimately we decided that the price of acquisition (both paid out, and also incurred in internal time) was below the cost of building a comparable product ourselves.

As the dev who did the tech portion of the due diligence I was now “tainted” by my knowledge of their system. As a result I could not work directly on the effort to build our own comparable solution.


Another example is Wine: Anyone who has seen the Windows source code is not allowed to contribute [0]

[0] https://wiki.winehq.org/Developer_FAQ#Who_can.27t_contribute...


A human who will type out the fast inverse square root algorithm line by line won't be exempt from copyright/license infringement just because he remembered it from the top of their head. However, using the same concepts is likely to be fine outside silly jurisdictions where software patents are a thing.

The difference is that AI isn't able to grasp concepts, it's only capable of rehashing patterns. If it is able to understand concepts then it should be shut down and researched immediately, because it's either close to gaining consciousness or already has done so.

The core of copilot is a file or a block of memory laying out a bunch of floating points that get processed and turned into code. This arrangement of floats is derived from source code, with licenses and copyright notices.

I don't think it's any different from turning code into a compiled program. Any developer will understand that a compiled version of GPL code is a derived work and subject to the GPL license. Why would a compiler that turns code into floats be any different? Sure, those floats get mixed up with the floats from other source code, but linking to GPL'd code does something very similar and is also covered by the license.

It's possible to consider copilot similar to hashing: a SHA hash of a binary isn't subject to the binary's license, that'd be silly. However, hashes are inherently one-way, and copilot isn't.

A question I'd like to ask Microsoft is "if I steal the Windows source code and train an AI on it, can that AI be freely distributed and used for Wine/ReactOS/etc?" If Microsoft sticks to the stance that AI isn't subject to the licenses on software then a leaked source AI should be fine, but if they want to protect their intellectual property then they will send cease and desist letters to anyone even thinking about using such an AI model for code completion. My expectation is that Microsoft will act against such an AI.

Regardless, the fact that Github did not ask permission or provide an opt out before training started is a huge middle finger to all open source developers. Even if they can get away with this stuff legally, this approach has surely offended many open source developers who want big tech companies to abide by their code licenses. I don't do much open source work myself but I've been offended by the whole process from the day copilot rolled out and I don't believe I'm alone in this.


> A human who will type out the fast inverse square root algorithm line by line won't be exempt from copyright/license infringement just because he remembered it from the top of their head.

A human would probably try to defend against a copyright infringement suit over that by arguing something like the following.

There isn't sufficient creative expression in fast inverse square root (FISR) to be copyrightable. There is plenty of creativity in that thing, but it is in things that are not copyrightable such as the underlying mathematics that it is using. Copyright covers expression of ideas, not use of ideas (that's patents) or the ideas themselves.

The expression in FISR that they probably are copying from is pretty much all just in choosing the names of variables, and most implementations I've seen just use pretty normal names that follow normal naming conventions that people use when they aren't putting any thought into naming their variables.

That level of expression is arguably not creative enough to support copyright, at least in the US after Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991) [1].

(I'm assuming that the human didn't do anything stupid, like reproduce the comments too).

[1] https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....


I think the FISR is one of the few algorithms that I would actually consider creatively enough to match the creativity requirement. It's counter intuitive math that I would think the vast majority of programmers would never be able to come up with. It's an elegant bit twiddling algorithm that requires one or two blog posts to truly understand, it's not something you read and think "oh, that makes sense, moving on".

Algorithms for generic mathematical operations such as the dot product or matrix multiplication are often trivial to deduce, though optimizer vectorized versions perhaps less so. Most helper functions are unoriginal enough that no reasonable copyright law would protect them, which is also the case for (too) many cases of patented code.

The copyright question does ignore the code license question, though. If a complicated algorithm like FISR is not original enough the what protects any boring old operating system code? What stands in the way of publicly hosting Microsoft's leaked sources, as clearly the code is all quite trivial? There is very little in an operating system that other operating system developers haven't thought of or would reasonably have come up with had they been constrained to the same restrictions.

The variable names are one thing, though they could be chosen much more descriptively. However, the system also output the comment "// what the fuck?" which is not only terribly nondescriptive, it's also something that the system couldn't have come up with if it would have learned from code in any practical form.

The suit you linked is about the difference between information and creativity. However, the case surrounds a data set, something simply factual, rather than a composed piece of information such as code or a book. Code listed on Github is not similar to the listings in a phone book. If they were, all software copyright, proprietary or otherwise, goes down the drain. I think that's impractical to say the least.


Algorithms are patentable, not copyrightable.

FISR could have been patented (and be now in the public domain anyway), but only it's specific implementation in DOOM is covered by copyright.

Also, your argument follows a composition fallacy: emergent properties exist, and thus you cannot simply say that because each individual piece of a whole is trivial, the whole is trivial. Heck, software pretty much by definition goes against that. For relevant precedent, there is no shortage of information that becomes classified when in aggregate. Knowing where a certain piece of infrastructure is isn't likely classified, but knowing where all the strategically important pieces of infrastructure are certainly is.

Which is why the question isn't whether the users of Copilot are infringing someone's GPL (they'd likely have a solid defense based on the individual piece not being sufficient to hold copyright protection), it's whether Copilot itself constitutes a derivative work of its input data, which it consumed as whole (copyrighted) works.


I'm curious as to what distinction you draw between rehashing patterns and grasping a concept.


That's a philosophical question that nobody can know a definitive answer to.

Personally I'd say the difference is understanding why a certain pattern works rather than blindly inserting whatever works. It's the classic Chinese Room thought experiment.


Just reading certain code is enough to taint a human programmer though. Some companies have policies against hiring developers with experience on some OSS projects because they have their own clean room implementation they want to protect.


> Some companies have policies against hiring developers with experience on some OSS projects

Can you please elaborate on this?


They're basically following this process to build their products: https://en.m.wikipedia.org/wiki/Clean_room_design


Season 1 of halt and catch fire


Right. I don't see any way this is legal.


Never heard of single one. I bet you just invented it.


take Windows NT source, train your local version, deploy it on the internet, advertize it does Windows code completion

wait for Microsodlft lawyer to get answer to your original question


On your tangential note: I always assumed many in the FLOSS side are actually against most cases of copyright as applied to software, but since it is the regulatory standard, they put a strong emphasis on making it work for their purposes, thus the somewhat ironic “copyleft”. It’s a “don’t hate the player hate the game” situation for them


This is definitely the orthodox take. If shared source code was the norm and software wasn't subject to copyright (or really if either of those two conditions were met), there'd be no need for FOSS as an ideology. The purpose of copyleft is to ensure that there's a permanent bulwark against code meant for the commons being co-opted by proprietary software vendors and having changes walled off from the community who created the software in the first place.


Source code is essential to FOSS, a public domain binary-only copy of Microsoft Windows definitely would not be FOSS. This is the second item of the open source definition.

https://opensource.org/osd


Sure, that is a useful condition and is a no brainer to add if you need leverage copyright anyway.

But would it be enough to spur the open source movement on its own if you could legally decompile all binaries and redistribute that? Probably not.

Its not like source vs. binary is a clear distinction - between code obfuscation, generated code, transpilation, etc. there is a lot of wiggle room what should or should not be OK.


The GPL makes it a pretty clear distinction, "preferred form for modification" is pretty clear, but decided on a case-by-case basis. Obfuscated code is not source, generated code is not source, transpilation is often not source but could be depending on how you use it afterwards, bitmap images are often not source but they can be, executables are usually not source but could be, videos are not source but could be. Some links discussing what source is here:

https://www.inventati.org/frx/essays/softfrdm/whatissource.h... https://b.mtjm.eu/source-code-data-fonts-free-distros.html https://wiki.freedesktop.org/www/Games/Upstream/#source https://compliance.guide/pristine https://opengameart.org/forumtopic/source-required-for-art-l... https://wiki.debian.org/rly-free-software


People on the FLOSS side are for software freedom, copyleft is just one of the tools we can use within the current regulatory framework of copyright. If copyright ever went away, we would have to use different tools but would have different opportunities too.


Those people these days are a vanishing minority. This is not the early-00s anymore.

The reality is that, nowadays, the overwhelming majority of developers touches FOSS code every day and just assumes they're entitled to use it as they see fit. The folks that came up with "copyleft" or care about licenses, are very much not in the driving seat. Blame FAANGs and their hatred for GPL.


I think the problem goes a bit deeper than that. From an IP perspective, I think it's reasonable to consider that training an AI on some form of work is using said work to build a new one, just like it would be if it was manually copied in or reproduced.

The problem is that, iirc, GPL didn't consider this at all and still uses language focused on copying code, so something like copilot might slip through the cracks of those definitions.

Then again, the license uses this language when it allows usage of the code in the first place, so one could say that either a) this usage is covered by the license, in which case all conditions apply, or b) it is not covered by the license, in which case... github wouldn't be allowed to use the code at all.

To give an analogy: I think feeding code into an AI is essentially analogous to compiling the code. A machine turns it into something more usable and the original human-written content isn't part of the result anymore, but the intellectual property gets dragged through the process nonetheless. Why would it be any different just because the mechanism of transforming the code into executable software gets a bit more complicated through the usage of AI?


> Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS?

It literally can't do it in an "in a non-infringing way" as it wasn't made to do it "in a non-infringing way".

People were able to get copy-pasted code verbatim. It means it does not know whether what it does infringe on the GPL or not.

Let say you find a human that never knew anything about copyright and you show him a bunch of Disney movies and you ask him to make you a movie and he literally copy one of their movie. Does it make it non-infringing? (Funny thing is, even people aware of copyrights does infringe it... so yeah hard to say even a machine could make some non-infringing content).

The solution would be to at least make him aware of copyrights and works with that, but first is it even possible, and seconds, is it even enough...

Sadly nothing will ever be done, at least not until it we feed it Disney movies and it start to affect their bottom lines.


> On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.

Definitely. Many of my acquaintances complaining about Github Copilot without trying it themselves regularly pirate movies, shows and music. They also always cheer if there is some court ruling against Facebook or Google, no matter what the actual case is even about.

> The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

It seems to me that the regurgitation only happens if you post the first half of the code, expecting the second half. I imagine that the software sees how several hundred repositories (which are all forks) have a very similar pattern and tells you the best fitting approximation of how they continue, which is again very similar.

In the future I can definitely see Github updating their license and some kind of exodus by FOSSers towards GitLab. But I believe that many open source projects will just put up with it, similar to how Youtubers and Twitch streamers want to stay on the premier platform.


An interesting (though not conclusive) test of whether Microsoft is confident that Copilot is not copyright-infringing is whether they’d be willing to release a mode trained only on proprietary Microsoft code.


I was wondering about something similar. Can we get Copilot to come up with some non-free algorithm under copyright from another big company (with lawyers), e.g. Oracle, Microsoft, Google, etc? This is a little difficult because it would need to be non-free but public, be specific enough that we can recognize it, and I think Copilot has protections against outputting code verbatim (but it could be made to output code with variable names changed or similar).

That is probably the way to kickstart a legal discussion about Copilot.


I really like this idea


People keep on talking about the GPL in these cases, but there’s absolutely nothing whatsoever special about the GPL: any code that is not public domain (or under a public-domain-equivalent license) is equally affected. Any mention of the GPL is a red herring.

Copilot is completely depending on the legal theory of being effectively exempt from copyright, under fair use doctrine; if that legal theory falls apart, the entire space (and a lot of other machine learning stuff) is utterly doomed.

Will it, won’t it, should it, shouldn’t it? Dunno.

(And when people say that it should just say what license the code it generates is under and what attribution or similar is required: Copilot can’t tell whether it’s reproducing copyrightable chunks of code, or indeed where what it produces came from, by the very nature of machine learning techniques. The whole verbatim reproduction issue demonstrates this—they’re trying to avoid such reproductions, which a cynic might say is because it weakens their fair use claim, but it’s not easy to do.)


The GPL license prohibits building a competitive solution from the licensed software, and 'derivative works' also require the GPL license.

So I think it is relevant here because there's a gray area around whether or not training a model is like linking to a GPL licensed software(not derivative with caveats) or deriving from one.

By the way, Free and Open-source software licenses are not public domain (or 'public-domain-equivalent'), The copyright holder of the software licenses it to whomever, but the holder still retains their copyright.


You’re missing the point: Copilot depends on being exempt from copyright restrictions, so that the license, any license, is irrelevant. Simplified, copyright law says “the creator/rights-holder owns the thing and you can’t do anything with it unless they let you (grant you license), or one of these general conditions holds”, and Copilot is not using your code under rights-holder permission, but under the general condition of “fair use”.

If the fair use doctrine fails and the license is relevant, there’s still nothing special about GPL, because almost all licenses would be being violated in some way (most commonly starting with attribution requirements). In this situation, Copilot will certainly be discontinued immediately.


I see what you mean. It'll probably come down to how the AI is trained then because Copilot reproducing entire code blocks verbatim from the source material would not meet the criteria for Fair Use.


> ‘derivative works’ require the GPL license

This keeps coming up, but if you look at the text of the GPL the word ‘derivative’ literally never appears. GPL in fact explicitly exempts code that is accessed over a web service from needing to be shared, as is the case with copilot.


It has the same concept:

To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work.

A "covered work" means either the unmodified Program or a work based on the Program.


Copyright law hold that’s derivative works don’t require copyright permission. So again, copilot is in the clear. Copilot users may not be, but that’s up to them to determine at time of commit.


Derivative works don't need to be referenced in the GPL as it's a concept from copyright law. See https://en.wikipedia.org/wiki/Derivative_work


That just further supports the claim that copilot is in the clear - it is clearly a separate work with many underlying works.

Of course the code copilot generates may violate GPL, but that’s up to the tool’s wielder to determine. Just as it is when searching for code on the internet, consulting books, recalling past knowledge, etc.

I don’t even use copilot (I had early access and discovered programming languages are better for unambiguous encoding logic than English, go figure). I’m just sick of all these supposed craftsmen blaming their tools rather than holding themselves accountable for what they commit.


Copyright law doesn't work that way. I'd bet this is going to be litigated to a final decision before anyone can say with certainty if/when a NN is violating copyright (i.e. via the training data) or not.


Copilot is not a competing solution, it's a knowledge base about text, like encyclopedia. As for snippets it produces, those might be copyrightable if they pass the copyrightability threshold. If it provides you kilobytes of text at once, that would be bad. A middle ground would be Copilot tracking how much code under incompatible licenses it pasted and stop at, say, 200 LOC.


GitHub want to treat GPL as special in this case. They choose to not use proprietary code for training copilot, for the obvious reasons of getting sued by companies that uses github, and instead put a bet that using GPL and other FLOSS licenses won't cost them more than what they earn by developing copilot.

Copilot is thus completely depending on this economical bet.

It for those reasons why we could not write a "Cosinger" or Comusician" that is trained on music found on youtube. It would be sued into oblivion the first time any 2-3 notes could be linked to a specific copyrighted song. If copilot survive long term we might see a similar project trained on creative common music, including CC-NC no-derivs, but music labels might own a few of those and their guns would be quite large.


> Copilot is completely depending on the legal theory of being effectively exempt from copyright, under fair use doctrine; if that legal theory falls apart, the entire space (and a lot of other machine learning stuff) is utterly doomed.

That's really depends on the country.

For example Japan has a law[0] that's allows usage of any copyrighted materials for machine learning and other data analysis. You can also do it for commercial purposes. There are some limitations (you can't share the dataset itself, but you can share the model), but overall it sounds good.

[0] https://storialaw.jp/en/service/bigdata/bigdata-12


IANAL but as far as I see it the case of copilot could be described as an ML that will output sometimes parts of the training dataset itself.


Did you know that the first of the two configuration options for Copilot is an option to allow or disallow it to suggest samples that match public code?

https://imgur.com/D2DDuY8


I did not. That is a curious option to place, given how it would seem to weaken the fair use argument (since they’re providing a way of consciously allowing probable copyright infringement, rather than just treating the issue as inherent in the nature of learning, or a temporary bug that they’re steadily fixing). But still, to defend my earlier parenthetical claim that Copilot can’t tell whether it’s reproducing copyrightable chunks of code: this option is consistent with what I intended to convey in the paragraph as a whole, since they are developing extra bits around the edges to mitigate the issue, but it’s not possible for them to do it completely—it’s more like a game of whack-a-mole, fixing this class of undesirable reproduction here, that instance there, without causing too much damage to the perceived-legitimate output.


> since they’re providing a way of consciously allowing probable copyright infringement

I think the claim that it's "probable copyright infringement" is nowhere near proven.

GitHub likely gave that option to satisfy user's lawyers who might have a higher threshold for "clean room" implementations or "no open source". Not as any kind of implication of copyright infringement.

Fair use works as an argument to the usability of Copilot.

Derivative works, per US Copyright law, are not infringement, either.


I’m talking in the context of verbatim reproduction, which “suggestions matching public code” sounds an awful lot like. At that point, I think “probable copyright infringement” is a fair description.


Your work as a whole may be innovative.

is 'if err != nil {' your original work? Or is it 'commonly accepted knowledge' as a Go programmer?


It looks like that filter does an exact comparison ignoring whitespace. To be more effective it would need to ignore things like variable renames and trivial transformations (for(;;) becoming while(true) or whatever).

In other words we're getting into cheat-detection software territory, which sounds difficult to get right in general.


It seems like the configuration option offered should be "allow/disallow other users to copy my code without attribution" rather than "allow/disallow me to copy other users' code without attribution"


i'm not buying the argument that copilot is infringing copyright. If someone learnt how to program via reading open source projects, you don't get to claim that their future work is derivative.


CoPilot didn't "learn how to program", it is reproducing blocks of code from other projects, some of which explicitly state that using their code requires attribution or other forms of acknowledgement. It is facilitating infringement of their licenses.


The line-by-line nature of CoPilot may make that difficult to establish.


You can open a sidebar (in both Rider and VSCode) which displays full blocks of code


That's a good point, but until it drops in whole blocks, I think the liability might lie more with the user. Kinda like all of Tesla's non-alpha driver assistance features where to use them the user has to opt-in and agree they will maintain full control of the car even when using these features.


> some of which explicitly state that using their code requires attribution or other forms of acknowledgement

US Copyright law states that fair use and derivative work are not infringing - and said law supersedes licensing.


> US Copyright law states that […] derivative work are not infringing

It says no such thing:

https://www.copyright.gov/title17/92chap1.html#106


This is only partially correct.

Since I have two free complimentary months I decided to sign up even though I'm not super thrilled with it (see previous comments). I was given two options:

1. allow code from public repositories 2. allow copilot to learn from my code

I disabled both of these options. Presumably I am now using an AI model which learns and suggests based on the context of my project.


It would be interesting to start verbatim copying some open source GitHub projects with these setting disabled and see if it magically knows what comes next (ie it does have prior knowledge of published code even with this turned off)


I'm not sure why my comment is getting downvoted.

I just gave it a test run. I have a function with this code:

  if (!card.IsFaceUp && !card.IsBlocked)
  {
    FlipTableauCard(card);
    card.SetIsBlocked(false);
    break;
  }
I then added this comment afterwards:

  // if the card is face up, flip it
And this is what copilot produced:

  if (card.IsFaceUp)
  {
    FlipTableauCard(card);
    card.SetIsBlocked(false);
  }
I'm pretty positive that is code generated based on my comment and the surrounding code.


> Presumably I am now using an AI model which learns and suggests based on the context of my project.

The "allow code from public repositories" doesn't do what you think it does. All it does is add an extra filtering step to avoid producing code found in it's training set verbatim. The model you are using was still trained on those repositories, it's not limited to your project.


Thank you, I am aware it was trained on those repositories. I am not OK with this business model.

But my comment still stands. You can turn off the verbatim copying feature that people keep talking about and the "AI model" will generate code based on your own codebase.

When I'm using it with Unity or a JS project that has NPM modules, does it use those as context to fill in some code as well? No clue.

Was it trained on open source code and is that ethically and legally shady? Yes.

Is it copying verbatim at this point? No.

Does it help me be a better programmer and will I pay for it? No and only if I forget to cancel my trial subscription.


> i'm not buying the argument that copilot is infringing copyright

I wrote some code. Released it under the GPL, and my only expectation is that if you use my code in your product you make source available to users (and GPL does require you tell the user how to get that source code). That on small requirement, the one thing that I'm asking you to do if you want to use my code, is not being respected by Copilot. It recommends my code, and obfuscates it, and does not tell the user where it was synthesized from, nor provides a way to get to the original source. From a certain point of view, Codepilot could be seen as a willful infringement machine. It will be interesting to see how this gets sorted out.


Except, y'know, when it regurgitates copyrighted code verbatim[0] which is not even derivative but just straight up copyright infringement.

[0] https://twitter.com/mitsuhiko/status/1410886329924194309


"I believe that all generally useful information should be free. By "free" I am not referring to price, but rather to the freedom to copy the information and to adapt it to one's own uses ... When information is generally useful, redistributing it makes humanity wealthier no matter who is distributing and no matter who is receiving" --Richard M. Stallman

Ironic that those who generally purport to champion FOSS fail to understand that Free Software was all about defeating copyright. The GPL was meant to turn copyright against itself.


The GPL is a tool for defeating copyright, or at least mitigate the worst of copyright’s effects on software development. If the GPL is weakened, it will be less helpful.


As a society IMO we should be fine with this


Even down to whitespace?


Okay, but Copilot isn't a human. It's a computer program. If I wrote an algorithm that spat out copyrighted code with the variables renamed, it would be absurd to say that constitutes an original work. Copilot is much closer to that than to a human programmer.


IANAL, but my understanding is that this is not clear-cut. Clean room implementations aren't strictly required by law, but the legal standard is how similar the new work is to the copyrighted work. If an employee reads open source code and then writes substantially similar code, as copilot sometimes does, that could be found to be infringement by a court.

For this reason, I've worked at places that forbid employees even reading open-source code. If we were having difficulty with an open-soruce component to the point where we needed to look at the code, we'd hire a contractor, explain the problem, and then they'd explain a solution, and all the communications would go through a company lawyer.


That's a strawman.

If you learn to write novels by reading other authors, is that a crime? No.

If you reproduce their work, sometimes word by word, yes.


Take your example, but make it more to the point. I hire an anonymous ghost writer to produce for me a novel based on some story premise that I made up. This ghost writer decided to use a bunch of copyright protected sections in their draft because reasons. I think the ghost writer isn't committing copyright violations, but when I publish the book, I almost certainly am.

I'm less worried about MS getting sued for this and approaching 100% expecting that users are opening themselves up to legal exposure. I can't see any legal department saying go ahead with using copilot code, but by all means ask.


> a bunch of copyright protected sections in their draft

it depends on what this "bunch" means. It's not clear cut at what granular level does the copyrighted parts become so small, and sources so many, that the new works is considered transformative.


It has been shown to have reproduced code verbatim, including comments. If that code falls under a license that has requirements or restrictions, you are infringing. The problem is, how do you know when this is occurring or when copilot spliced many things together to create something sufficiently new?


> If that code falls under a license that has requirements or restrictions, you are infringing.

I think the interesting legal question here will be, are *you* (the user of the service) infringing, or is *copilot*?

I suspect Copilot's legal team has already worked license terms such that they're passing the buck onto you.


You are correct.

Copilot is a tool, much like the "copy" command.

If you choose to use what it's suggesting, then the fault is completely yours.


> I suspect Copilot's legal team has already worked license terms such that they're passing the buck onto you.

it would seem reasonable that copilot should not be liable for anything that a user instructs it to do.


That’s a good question.


the part that I haven't seen decided yet is, how much do I have to copy before it's infringement?

maybe my co-pilot reproduces code verbatim from your GPL'd project because you and a dozen other developers all copied the same solution from stack overflow.


Remember that the SCO vs IBM court case was over a very small number of code lines and if memory serves they only lost because it was deemed trivial. Triviality might be correlated with a low number of code lines, but its certainly not a given.


If I spent hours looking at the Linux kernel source, then wrote a kernel that had a lot of the same ideas and idioms, that would indeed be considered infringement.

Some open source developers are not allowed by their employers to read source code with a different license for fear of infringement.


Independent recreation of the allegedly infringing work is an absolute defense.

If you were to write all those ideas and idioms down and pass them to someone who had not seen the Linux source code, who then used it to reimplement similar functionality, neither of you would probably be guilty of copyright infringement. (there are still patents of course). https://en.m.wikipedia.org/wiki/Clean_room_design

Copyright doesn’t protect programming idioms and concepts. It protects against verbatim copying, more or less.

It all comes down to how you characterize what Copilot does. We will just have to wait for new caselaw or even legislation that accounts for autonomous systems in defining legal wrongdoing.


IANAL but I don't see a double-rot13 "machine process" as any defense against demonstrably identical code. Just because it passes through a process doesn't make it clean. There were numerous examples of whole functions and code blocks repeated verbatim from its source material. I'm not sure the state of the application now, but the effort to prove an output snippet isn't copyrighted with restrictions by the source feels like an O(N) problem that nobody would want to ever do.

At best, if copilot told you explicitly (by doing the very hard work of identifying the likely sources of the code output ) you could make some (more) informed decisions as to if it's worth the risk to include it.


> If you were to write all those ideas and idioms down and pass them to someone who had not seen the Linux source code, who then used it to reimplement similar functionality, neither of you would probably be guilty of copyright infringement. (there are still patents of course). https://en.m.wikipedia.org/wiki/Clean_room_design

This response doesn't relate to the example provided by CameronNemo, as it's a different scenario.

At any rate, there is no clean room because copilot has "seen" the literal source. It is not comparable to clean room implementation in any fashion.


a training program reads source code and produces a model

the software running the model does not have access to source code

as the parent said, it depends how you characterize it, which is why this will be decided by whoever can afford the best lawyers.


Well, the problem is with that characterization is that it's shown to be false already.

Copilot has reproduced entire functions from existing codebases, which invalidates the idea it doesn't have access to source code.


I would have to better understand language transform models to back up my argument, but my impression is that copilot is not sitting ontop of a SQL database ripping the most likely line of code out of a row of a table, rather, it is a lossy compression that happens to be able to reconstruct some data better than others


thanks for bringing up clean room design, I think it's a good analogy. The model, tho it may reproduce verbatim source code, does not contain the original source as such, right?

But if that were the case, one could get away with copying music by merely compressing it - I am not copying the data, I have a totally different set of data that happens to get decoded into a similar performance.

That's roughly analogous to a machine learning model, isn't it? compressing an enormous dataset into a "model" that is capable of being decoded in myriad ways depending on context.


I'm not buying the argument that copilot is "learning to program". It's doing nothing more than rote memorization and recall.


Some cultures unironically use this approach to educate humans.


"Some cultures use rote memorization to educate humans" != "anything educated via rote memorization necessarily has developed a human-equivalent understanding of the material"


I do consider copilot to be infringing on my AGPL code.

That's not a fundamental statement about all machine learning systems. GPT-2 did a lot more direct regurgitation than GPT-3. GPT-3 tends to be much more transformative, but does still sometimes spit out code / text verbatim.

Copilot and codex spit out close enough to my own code that it's clearly creating a derivative work, at least by my read.

This is untrodden legal ground, but I think that a lot of this comes down to issues of reasonableness. The reason I used the AGPL license was to create a commons. If copilot played within some reasonably friendly way around that commons, I might not feel bad about it.

However:

1) Copilot wants me to pay to use something derived from my own code, where I stuck a license there designed precisely NOT to be in that position.

2) Copilot provides a competitive advantage to proprietary projects who are more likely to be able to afford it, over open-source / community ones. The reason I used an AGPL license was because I thought we needed this type of code to be open and transparent. I work in a domain where transparency is essential (I don't disclose domain, but you can think of transparency in government, education, voting, medical, police, etc.)

3) I have no way to have a conversation with anyone at github / Microsoft. They took my stuff, and they won't talk to me about how they use it. It's automated systems all the way down.

4) The whole Open AI nonprofit -> for-profit transformation is just sleazeball. Given all the talk about ethical use of AI, something like this really leaves a sour taste in my mouth. I don't mind DeepMind, FAIR, etc., since they're honest about their goals. Open AI feels like a Silicon Valley get-rich-quick scheme with a lot of nice marketing copy and legally-questionable tactics.

Jury, judges, and developers are swayed by common sense. People like me can be swayed to testify one way or another based on whether we feel cheated. What Microsoft / github / Open AI did here wasn't very reasonable, friendly, or sensible.

TL;DR: I support the concept of co-pilot in essence. The specifics here feel illegal and sleazy.


> They took my stuff

I was about to admonish you for phrasing it this way when we all uploaded code willingly to github, giving up certain rights according to the ToS, but then I remembered microsoft straight up bought all of github, so "took my stuff" is pretty accurate. I would be interested to see a diff of the ToS since the purchase.


It's more complex than that.

A lot of code on github (albeit not mine) is uploaded without the original party's agreement. Richard Stallman doesn't use github, but a lot of his GPL-licensed code has been incorporated into projects hosted there. If the terms-of-service allowed github to violate GPL licenses, I think most projects would need to migrate to gitlab. It'd be neigh-impossible for project authors to know that no GPL code in their project came from someone who did not have a side-license to github.

Even if that argument fell apart somehow, their terms-of-service state (https://docs.github.com/en/site-policy/github-terms/github-t...):

    This license does not grant GitHub the right to sell Your Content. It also 
    does not grant GitHub the right to otherwise distribute or use Your Content 
    outside of our provision of the Service, except that as part of the right to 
    archive Your Content, GitHub may permit our partners to store and archive Your 
    Content in public repositories in connection with the GitHub Arctic Code Vault 
    and GitHub Archive Program.
github is now selling My Content. To add insult to injury, they're trying to sell it back to me!


If I pay a monthly fee for a search engine, and the search engine displays snippets of copyrighted work published online, is the search engine re-selling that content?


Yes.


why isnt copilot being trained on microsoft source code?


This would make a great argument in court if this ever is legally challenged. The only real argument would be because it could lead to lost income for Microsoft. In the UK, derived works which can be shown to harm the original work’s economic viability are much more likely to be seen as an infringement.


Because the 'snippets' would then be several pages long.

/joke


How do we know it hasn't?

Has anyone tried taking source from leaked copies of old MS code and tried to get copilot to reproduce it?


Whenever it comes to things like legal issues or regulations, people take such pride in being someone who understands their intricacies that the larger picture is missed. We should think about what outcome we actually want.

While technically any snippet of code can claim a copyright, it's sort of ridiculous to suggest the snippets that Copilot generates has any protectable value.

Practically Copilot is saving you the labor of writing fairly trivial but tedious code. I work on open source full time, there's no random snippet of code that I've written I'd feel upset about if someone copied.

Licenses on software are mostly performative.


> Licenses on software are mostly performative.

I seriously doubt you actually believe this, at least in an equal way. The code to Windows XP is available on the 'net, but I can't just go and compile that and start giving out copies without serious legal repercussions.

Copilot is "copyright for me but not for thee" and it's bullshit.


Copilot isn't automagically going to give you the windows XP source code tho, or anything even remotely close to a full project, it seems that even a full working/compiling function is asking too much.

It generates what a intern code monkey would generate after reading a few stack overflow posts and a few github repos.


Yep. But if it regurgitates more than a few words, that's still plagarism and probably a violation of copyright - or at least it would be if we did it with MS's code.

Again, its copyright for them, but no protections for us.


> Copilot isn't automagically going to give you the windows XP source code tho

I mean, it might. That's the whole concern. It's scraping random bits of code from all over the place.


> Windows XP [...] I can't just go and [...] start giving out copies without serious legal repercussions

1 month ago, HN front page https://news.ycombinator.com/item?id=31458635


From the about page:

"Q: Why is nothing from the Windows XP source code leak added?

A: Even though Microsoft has only taken down a few Windows modifications, they will most definitely take Windows XP Delta Edition down if there is a reference to the source code inside it. The Windows XP source code is illegal to download, fork, and redistribute, so nothing from it will ever be added."


Nice, TIL. Thanks!


I think "performative" and "random snippet" are key here.

But let's think of an example that may push the line. What would happen if someone wrote a closed source "Linux" kernel using the Linux kernel interface as a stub, and filled in the code using copilot. You'd expect some of the generated code to come from Linux, since that's the best code for the stub. Linux's use of GPL is not performative and this could create multiple instances of copied code.

But for everyone else who isn't building a closed source replacement of GPL software, I have a hard time thinking you'd be impacted.


The 'inverse square root' carmack trick is hardly trivial.


so luckily no one owns the idea that you can use both newtons method, and floating point bit-hacking, to produce a good estimate of a square root.


So if a similarly non-trivial piece of code was in a non-public-domain codebase and copilot copied that wholesale, then you'd have a problem with it?


So your position is that you don't respect copyright in the first place, but your argument is that Copilot is effectively a case for free use.


I wrote a whitepaper refuting GitHub's arguments that Copilot is fair use: https://gavinhoward.com/2021/10/my-whitepaper-about-github-c... .


I feel like the term "whitepaper" has lost all meaning. Does it just mean "thing typeset with latex" now?


It's just the term that the FSF used when I wrote it for them. For me, it means "academic paper."


I thought the distinction was: "white papers" are not necessarily peer-reviewed by an academic journal's editors; white papers tend to be internally-made documents within an organization, edited by that organization's members, for the purpose of knowledge/information.

Academic papers as a whole can include both "white papers" and "peer-reviewed journal articles" aka the "papers" that many think of. I would surely put a "white paper" onto my resume if I thought it was a great piece of work.


TIL, thank you. I think my paper is a whitepaper in that sense because it's not peer-reviewed, but it was meant as an academic-like paper.


"Whitepaper", for a long time, meant sort of the opposite of "academic paper." They weren't published in refereed conferences and journals and were usually published by corporations or governments but they resembled academic work in their contributions.

Something like the original BTC paper is a good example. It wasn't published in a conference or journal, but the level of rigor and the scale of the contribution is similar to what would be published in a conference or journal.

I think the crypto community was the end of it. Now "we are launching $CoolCoin" gets called "whitepaper."


That makes sense. Thank you for teaching me that.

I would still call my paper a whitepaper, though. While it's not rigorous, it's more about law where rigor doesn't apply in the same way. And it is not peer-reviewed, although it was reviewed and rejected by the FSF.


GitHub should not be sued for training on the data, but anyone using it should be liable for any copyright infringements it generates. That would effectively make it useless for business use cases, but it should be until the models understand copyright and plagiarism, which they do not yet.


If Microsoft asserts and represents that their tool doesn’t generate copyright-infringing code, then surely Microsoft is the party which should be liable, rather than the poor unlucky programmer who was lied to by the billion-dollar corporation’s marketing agency?


> surely Microsoft is the party which should be liable

unless microsoft is doing work-for-hire for you via copilot, i highly doubt they are liable.

You, as the person who is claiming to have produced the work (even though you were using a smart tool to help), must be the person who also is liable. Otherwise, could you not claim that the auto-correct on your word-processor is liable for copyright infringement?


An interesting exercise would be recreating an entire program/tool that is GPL using copilot and releasing under a less restrictive license. Could one could argue that the effort put into cobbling together a knock-off is enough to constitute an original work?

Running a copyrighted movie through a neural network compression algorithm and uploading it on bittorrent isn't going to stop you from being sued. Even if the output is produced by an AI.


That's probably the right distinction.

If copilot allows you to type

// source code of linux kernel

And you get the whole code, then I would consider it unoriginal

The same way your movie example, if you told it

Avengers Endgame

And it gave you the whole movie, it would also be. But what if you type (like with DallE) Spiderman fighting Thanos and you get something different, but that resembles some Endgame scene. Would that infringe copyright, be fair use, or what?


Try to publish anything with a character looking like Spiderman, Thanos, IronMan, Captain America and see what Marvel does to you…


There are many videos like that on youtube and they seem to be doing fine

For example

https://www.youtube.com/watch?v=u8LvcJoAnms


The law isn't stupid. This is one of the advantages of having humans in the loop. People can and will say "I know what you are doing, knock it off."


A lot of people asking "is this legal?", but to me the more interesting question is should this be legal?". And I would like to debate the advantages and disadvantages of this.

If we go back to the very basics, the purpose of copyright is to reward and incentivize content creators.

The purpose of fair use is to allow certain uses of a work that are beneficial to society, and don't substantially interfere with rewarding content creators. Consider for instance a parody. The parody does not substitute for the original work, so it will not interfere with the reward.

Looking at copilot through this framing, we can ask whether it will interfere with rewarding creators. If Alice writes a program, will copilot trained on that product allow Bob to create a competing product more easy, essentially freeload on Alice's creativity? Yes, to some extent it will.

On the other hand there are also advantages of allowing copilot. It promises to make software creation easier, for Bob, but also for Alice.

Now, as this is trained on code that is posted publicly but not private code it will advantage private code over publicly posted code. This has the potential to harm the free and open source software movement. This movement is a positive in the world, so harming it should be considered a bad thing.

But really, the big question is how big these effects are, and no one seems to have a good answer to that.


>>a form of laundering open source code into commercial works

If it is nothing but a glorified autocomplete that inserts SHORT snippets of code, as in a quote from a book in a book review, or the above quote, that's fine. But when it is inserting whole pages, that's beyond Fair Use, and basically a nicely scaled and laundered version of plagiarism.

Another key is now much it is modifying the code to suit the situation. Is it generating entirely new synthesized output like DALL-E 2 or GPT-3 such that the output is closer to generative creativity, or is it merely pasting in code blocks found in similar situations?

It comes down to the question of how transformative is the output, which is a key concept in copyright law.

>>“What is the difference between this and someone doing it manually? "

Another key question here. If it is merely cut/paste beyond a line or two, it's plagiarism, but if it is synthesizing new works, it's good. Same as manually: am I using your GPL work for inspiration to generate new works, or am I copy-pasting pages of code?

Does anyone have any extensive experience with Copilot to be able to highlight these differences?

EDIT: fmt, clarity


In my experience it's more generative. It responds to variable names in my code, formatting, patterns in how I'm writing (am I using ES6 features? async/await or raw promises? Etc). Granted it could still be copying, but when it's generating usually less than 5 lines at a time... I wonder if you could argue that all possible 5 line code snippets have already been written--wrt to like abstract logic/variable relationships/patterns.

Also I think it's important to note that Copilot isn't an independent AI; it's a human in the loop system. In my experience I always make adjustments to the code it generates to better fit my needs. So on top of the transformation that copilot does, I'm also stacking transformation on top of that. So the final, potentially copyrightable product is very far removed from the training data.

Copilot's training algorithm, OpenAI Codex, "is a descendant of GPT-3" under the hood. I think code is unique in that unlike text and images, there's much less variance in code. It's not as expressive as English, or visual art. So I think there might be a higher chance it'll generate something similar to its training data--but only because all code is inherently more similar.


Thanks - very helpful!!

Good points about you adding transformations on top of the Copilot output and less variation in code due to it's structured nature.

Interesting also that it responds to variable names and patterns in your writing/code. Does it also respond to comments?

Your answer points me more towards the not-infringing argument. Perhaps the best solution would be to have a companion plagiarism-checker tool that examined your code vs it's training set of GPL/MIT licensed code when you are nearly finished to flag significant copying. Shouldn't be too hard and would avoid the whole problem (and also maybe sometimes point you to a library you should be using instead of rolling your own).


Thanks for listening! The comment thread for anything copilot related is generally super polarised, and I really appreciated your calm comment and your calm response :)

If I understand your question correctly, it does respond to comments. One of the best ways to interact with copilot is to write a comment (eg "Read input until y/n is entered") and have it generate the resulting code. If you're asking if it matches commenting style, I think it does, but I haven't pushed it in that direction too much.

I think that's an interesting proposal, although it does place the onus on the developer. I would be curious to know how often developers would fail that check even without Copilot!

Overall I think there are definitely open questions, but I'm personally really excited by systems like GPT, Copilot, or Dallee. The future is not clear, but I think these tools to some extent make the internet make sense. There's way too much data online to make sense of--be it code, text, or images. Unlike a search engine which just links to hopefully relevant material, these tools "learn" from all that data and respond with an answer of sorts--not a list of references. I think one huge improvement would be making these systems more explainable. So getting a certain response, you can also see the thousands of references that were used to generate that response. That would help a lot in providing transparency in whether there is plagiarism happening, and also just be an immensely useful tool for humanity and the internet. It feels like the next logical step for the internet. I would even say it feels like the internet was built for this!


Thx for the kind words and helping me get a better understanding! I think you are really onto something about GPT/Copilot/DALL-E this being almost what the Internet was built for, or at least one of the fundamental growth stages - first communicate, then accumulate data/info online, then make it searchable, then start automatically "understanding" synthesizing it... (with the quotes around "understanding" doing a lot of work).

And a good point you make about not relying on the programmers to run the plagiarism checks - the tool should do it itself.

It sounds like this is indeed closer to an copyright-OK generating of new code, rather than mere laundering, and if it isn't yet there, it seems like the copy/paste paradigm would be a juvenile phase, and it should improve and get more "creative" and less copy/paste-ish with further development.


EDIT: It took me a while, but I found proof of Copilot suggesting a full copyrighted algorithm. I then take back these arguments as I was under the assumption the tool couldn't do this: https://twitter.com/mitsuhiko/status/1410886329924194309

Old comment for documentation's sake:

Are licenses even enforceable by law? The idea of writing some mundane basic code and then wanting to sue someone for "stealing it" just sounds ludicrous to me. True copyright has barriers to make sure you actually invented what you're trying to patent.

That's not to say that there isn't a problem here -- there's definitely an ethical component to how this product works, but this whole code licensing thing never clicked for me. Does it hold any actual power?


Licenses are indeed enforceable by law. Most popular ones (GPL, MIT, Apache) have been enforced in court.

Also Copyright isn't patent.


> "Moreover, open source developers are already suffering burnouts because of gigantic multi-billion dollar corporations taking their free code and re-bundling it as a SaaS, hence, introducing this new feature takes even more from them than there was before."

The popular open source licenses explicitly give permission for people to resell your work, it's not even buried in the small print or anything. e.g.

GPL: "Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish)".

MIT License: "including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software".

Apache License: "each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work".

This should no more cause burnout than someone buying your old car and using it for an endurance race makes you tired. Where there's burnout involved it's more likely the demands for support and fixes that head back upstream without any associated money. More concerning is Copilot trained on code which isn't GPL licensed or similar. Sharing code doesn't automatically grant anyone any license to use it for anything at all.


Except in the case of the GPL license, your rights to reuse and redistribute the source code are contingent upon your derived work also being licensed under the GPL. This is not a hypothetical untested legalese requirement but a real legal requirement that's held up in the courts of several countries.

So yes, using a substantial portion of GPL code in your proprietary software product is copyright infringement. Or even using a substantial portion of GPL code and not licensing the code under an appropriate license - https://www.gnu.org/licenses/gpl-faq.html#WhatDoesCompatMean


From MIT License;

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

I'm not confident Copilot will comply with this part of my code's license.


It depends on the interpretation of "substantial portions". Classic challenge with legal documents.

Here's a discussion about that usage in the MIT license: https://opensource.stackexchange.com/a/2188


Me either, but that has little to do with burnout from other people making money from GPL licensed code.

(That a tool exists doesn't free one from responsibility for using it; CoPilot not citing original code and its license terms seems like it rules it out for anything beyond experimentation; that part is very arguable)


> GPL: "Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish)".

Yes but if I use copilot in a private codebase how sure am I that it has not copied GPL code.


I think the risk is higher, but it's always been possible one of the developers would copy GPL code. There are tools that check your code to see if any GPL code exists. I haven't used them, but I suppose you could audit with those.


Imagine a situation where a chat bot AI is trained on a bunch of copywritted novels. And then you ask it to start reciting portions of it. Should that be illegal? It doesn’t seem like it to me.

If I were to turn around and resell it, then you could sue me; but that wouldn’t be a chat bot’s fault.

Just like it’s not the clipboard’s responsibility to ensure I’m not violating licenses, I don’t think it’s copilot’s responsibility either. Use it at your own risk.


Hm, you're suggesting that it ought to definitely be illegal if you sell the output of the chatbot trained on copyrighted material, but not if you give it away for free?

That is definitely not how US copyright law works (although you are welcome to argue that it should), that there is such a bright line around selling. It's possible for something to be a copyright violation even if you give it away for free (see torrent sites!), and it's possible for to qualify as fair use and not be a violation even if you sell it.

Under USA copyright law (others are similar but not quite the same), the first step would be deciding if the output of this chatbot counts as a "copy" or "derivative work" at all. If it does not, then there is no copyright violation whether you sell it or not. If it does, then it is a copyright violation (whether you sell it or not) unless it's use can count as "fair use". Whether the use is "commercial" is just part of one of four factors that are balanced to determine if it's fair use. For instance, if you are only using a tiny portion of the copyrighted work and it doesn't have much effect on the profits of the original copyright holder and the use is considered highly "transformative" too (sound like copilot?) -- it could well be fair use even if you are selling your output.

(Also... Github Copilot is literally selling it, right? You have to pay them to use Copilot! It would accordingly be considered a commercial use. If I make a copy of a hollywood movie and sell it to you, I'm probably violating copyright (unless I can convince a court it's fair use), regardless of what you do with it, it doesn't matter if you re-sell it or not. If you re-sell it or make another copy, you may be additionally violating copyright another time yourself.)

I do think there's a reasonable argument that copilot is fair use. It wouldn't mainly hinge on whether the output is sold or not. This is presumably the argument MS/Github would make if brought into court. Since Oracle v. Google, I've stopped trying to predict what courts will do on software-related copyright cases, the law seems to be pretty chaotic and the actions of courts unpredictable. (also i'm not a lawyer this is not legal advice).

In general, I am in favor of an expansive bounds of fair use, and think it serves "the people" to have such.


Corpus licensing is an existing issue.


The law is very big on intent. It's one thing if yeah, a corpus is full of copyrighted material, but is used for scientific, potentially humanity-advancement purposes, and completely different thing if it's being used straight for monetization while bypassing copyright holders.

A FOSS license doesn't mean it's all fucking freebies all the way down as any big tech company is quick to remind you (by attaching trademarks all over the place, for example), but if it's a big tech company taking your stuff and running with it, it's suddenly all fair game.

All in all, it's how companies have been behaving since about forever. Fuck little people all the way.


Again though, using something for scientific/scholarly purposes alone is not a 100% guarantee defense against copyright infringement in the USA. It is one aspect of a four-part fair-use defense. You can definitely still be violating copyright even with a recognized scientfic/scholarly non-profit purpose. (and be engaging in fair use without it).

I have not heard of what people have already been discussing about copyright legal issues around corpuses of Other People's Content, I am curious to read more if anyone has a link.

While I understand that in this case it seems like (or is!) a Big Corporation taking advantage of the Little People -- I would urge extreme caution in advocating for reducing and limiting "fair use" rights because in this case it will hurt the Big Corporation. It used to be clear to everyone that fair use helped the Little People against the Big Corporation copyright holders and should be encouraged. Lately, people (especially software devs, who of course write IP) have been excited about strenghtening protections for copyright in order to somehow limit the Big Corporations (see all the stuff around Amazon and open source), but this is a dangerous game. Fair use is one the only tools we have protecting us from a dystopia where you can't open your mouth or type on your keyboard without paying someone a license fee, which is what OTHER Big Corporations would love to see.


I don't know how blinkist gets away with it, honestly.


I'm not familiar with this, but would love to learn more. Links explaining what you mean by this and how it works welcome. Googling wasn't helping me.


It’s a good point. I was trained on a lot of copyright novels.

However I don’t quote verbatim which is what this tool has been doing.


how big does a snippet have to be to become verbatim?

Plenty of trigrams exist in many novels that are exactly the same, and i bet that there's plenty of n-grams in programming which would be construed as copies of each other.


I think you’d need a lawyer to answer that question. Which in itself is a problem.

The nature of using this tool could hang you in theory more than if you didn’t use it.


If I "trained" an "AI model" consisting of an executable that regurgitates its input 99% of the time on a Hollywood blockbuster and distributed it, that'd be copyright infringement. Probably still would be at 50% and 10%. So what's the threshold for you here?


> And then you ask it to start reciting portions of it. Should that be illegal

In several countries, humming a song - reciting a melody - is cut and dry de jure copyright infringement. It's just not rigorously punished.


I've seen this in reference to the birthday song but it is more like an urban legend.


I support any lawsuit against Microsoft. They deserve it.


I suspect Microsoft supports that lawsuit too.

Looking at the fact pattern in front of me, I can't imagine Copilot has gotten this far without Microsoft's lawyers being fairly confident that they have a compelling legal argument in their back pocket that not only protects their substantial investment in this project, but perhaps puts up a barrier on the limits of what can be claimed via a copyleft license.

This whole exercise screams "designed for win-win" to me. What that argument might be, I cannot guess.


I think a lot of people are focusing too much on whether an AI could ever be creative or transformative or not (FWIW I'm convinced they can't). But it's not relevant - a lazy human programmer could easily infringe by copying snippets of copyrighted code too.

If anyone shows an example of co-pilot automatically producing something too closely resembling copyrighted code that's too significant to be fair use, you got a problem regardless of what technology got you there.

If that lazy programmer "toils" away behind a corporate wall (sadly too likely), you might never know what he's up to. But you've put his "work" on display and let anyone on the internet record his behavior with this thing.

At some point, co-pilot is going to be able to "suggest" something that's traceable and someone's going to sue. I think it will come down to whether what gets produced is copyrightable or fair use, not whether it was an AI or not that scraped and copied it.


And if they win, create a precedence that extends the copyright regime into entirely new domains? That would be a massive self-own. That would be one way to get me to vote for the Pirate Party, I guess.


I don't get it. If I spend some time reading through source code in github and now produce code, how come I can be liable for that? Why we hold humans and machines to the different standards? My brain runs on neural networks too and I can learn things from reading/observing. What is the difference?


If training Copilot was copyright infringement then essentially it'd make the majority of useful/interesting machine learning models illegal as they currently exist, and make it impossible for anyone except huge multi-billion corporations to train such models in the future, since only they'd be able to legally license enough data to do so.

Models like GPT-J or DALL-E mini could not legally exist anymore. So sure, please, make this illegal, regress the whole field of machine learning 20 years back, and make it something that only billion dollar corporations can do.

We should be striving to make copyright less draconian, not more.


Not necessarily, the issue here isn't that training on public code should be illegal, it's merely that the models trained on such data should be considered derivative works under the licenses which the code is released under. GPL code must also be released under GPL, so Copilot must not charge users to use a model that is trained on and regurgitates GPL code.

> We should be striving to make copyright less draconian, not more. Agreed, and I'm not sure why you think forcing GPT and Copilot to respect open licenses will make them illegal instead of more open.


> Agreed, and I'm not sure why you think forcing GPT and Copilot to respect open licenses will make them illegal instead of more open.

I wasn't talking about Copilot. I was talking about the vast majority of other interesting models. It wouldn't make Copilot itself illegal because it was trained on explicitly licensed data (so it'd make it GPL-licensed), but it would make those other models illegal.

Take for example the GPT-J, which was trained on 825GB of data scraped off the Internet. If we assume the view that a machine learning model is a derivative work of its training data then that makes GPT-J illegal, because it was trained on a bunch of "all rights reserved" data, and there's no legal license under which it could be released. Most interesting models are like that.

> so Copilot must not charge users

That's not what the GPL says.


> Copilot must not charge users to use a model that is trained on and regurgitates GPL code.

I was with you until then. Charging for GPL code is perfectly within the licence as long as you make the source available.


Yea that's my mistake, but not really the point I was trying to make.


> If training Copilot was copyright infringement then essentially it'd make the majority of useful/interesting machine learning models illegal as they currently exist, and make it impossible for anyone except huge multi-billion corporations to train such models in the future, since only they'd be able to legally license enough data to do so.

Regarding the first part, I certainly wouldn't argue that training Copilot is copyright infringement. However, the code it spits out in its current state can in some situations be infringement.

Copyright infringement doesn't happen when you read War & Peace, it's when you take that and reproduce it verbatim or very close to it.

So to that point, your doom & gloom absolutist scenario could not play out if the product of the model was sufficiently different.

We saw Google sued by Oracle for infringement over copying APIs. Now imagine you're a not-Google sized company, are you going to take the chance that Copilot will spit out something that they consider copyrighted?

I think in terms of legal/business risk, it's just too high as it stands now.


> So to that point, your doom & gloom absolutist scenario could not play out if the product of the model was sufficiently different.

It would absolutely play out, because it is impossible to guarantee that such a model will always produce something "sufficiently different". These models are black boxes with billions of parameters. That's how they work. It's just as unrealistic as those politicians pushing through "lawful access to encrypted data" that'd effectively make strong end-to-end encryption illegal. There's no middle ground here. We either accept that such a model might sometimes output a snippet from its training data and benefit from the 99% of times it doesn't, or we can be copyright maximalists and ensure no one benefits. (Except ironically huge corporations like Microsoft, either because they can license the data for training, or because they have sufficiently well funded legal departments.)

> Now imagine you're a not-Google sized company, are you going to take the chance that Copilot will spit out something that they consider copyrighted? > > I think in terms of legal/business risk, it's just too high as it stands now.

This is a fair point, but you can say this about any other interesting model. Is that piece of text generated by GPT-NeoX-20B (which is a fully free and open model trained by essentially hobbyists) illegal to use because it might infringe on someone else's copyright? You don't know. And it was also trained on code from Github. Where are the posts calling for people to sue their authors because they're not respecting the GPL?

Here, I've just tried it and screenshoted it for you, spitting out GPL'd code: https://i.imgur.com/2T4uSJR.png

Again, this is not the Copilot. This is the free GPT-NeoX-20B model that anyone can download. The model's not under GPL, and yet it clearly "contains" GPL'd code. Anything which affects Copilot's legal status will also affect GPT-NeoX-20B, but even more severely since GPT-NeoX was also trained on a ton of "all rights reserved" data. So when you raise your pitchfork at Copilot you should also ask yourself the question - are you fine with also killing projects such as GPT-NeoX, or maybe a more lax copyright law is more beneficial to the society as a whole when it comes to machine learning?


> This is a fair point, but you can say this about any other interesting model. Is that piece of text generated by GPT-NeoX-20B (which is a fully free and open model trained by essentially hobbyists) illegal to use because it might infringe on someone else's copyright?

Again, you misunderstand. Copyright violations don't occur just by using the system, and as I noted, it's not a problem to use ML models. The actual output content is what matters.

In your example, yes, absolutely, if you used some editor feature that spit that code out verbatim, sans license adherence, you will be violating copyright. The issue isn't specific to copilot beyond the fact that Github has created and offered this model trained on source code and begun selling it as a product.

TL;DR: Is it illegal to use it? Of course not. Is it going to get you sued for blindly taking its output, packaging it and selling it? Possibly. If I were trying to manage the risk to my business, I certainly wouldn't be allowing developers to use it.


A lot of comments here keep drawing parallels to writing as for why this is right or wrong, but I think a more apt comparison would be music, where components like riffs or rhythms can be reused to make something wholly different. Many a musician has claimed IP infringement over another musician using similar melodies, but just like programming, if you look closely enough, you'll see that everyone is copying each other and there's not much that can be done to stop it.

Personally, I'm a bit bothered by this myself, but I'd be lying if I said I never once got any ideas by looking at the source code of a GPL project.


Ideas aren't copyrightable.


only ideas are copyrightable


Google "are ideas copyrightable".

Copyright protects the specific, tangible expression, not the idea or concept itself. So in your case, having "got ... ideas by looking at the source code of a GPL project" doesn't necessarily mean there is any violation of copyright. Employers will often require that employees haven't seen sensitive code at all (clean room) as it avoids any possibility of copying the code itself and provides powerful evidence if they're ever sued, but that's not a legal requirement, just a cautious practice. To read code, learn a concept from it and then apply that concept in your own code is fine (as far as copyright is concerned) as long as you're not copying the code itself.

I am not a lawyer and the above is not legal advice, ofc :-)


This article is from June last year. Has anything happened since then on this? Is anyone bringing legal actions against Copilot and Github? If not why not? I would think that if there was a case against it, that there are open source affiliated entities out there that would have a both the legal and monetary resources to go to court over this.

When I look at the discussion here, it looks like most of the arguments are based on common sense or morality. While that is nice, I would have loved to hear a perspective on this based on law instead. Is there someone here with a more legal perspective/background that could comment?


Absolutely! But this article may focus too much on ethics.

In practice, you usually need to demonstrate that you were harmed by an action to bring it to court.

So we would need the author of a GPL'd repository to sue, with proof that copilot copied and pasted their code.

The moment that happens, MS would simply omit that single repository from the copilot dataset, tell the judge that they fixed a bug in their "content filtering algorithm", and have the case dismissed. They might also ban the developer's account for good measure.


> In practice, you usually need to demonstrate that you were harmed by an action to bring it to court.

I think this is true for copyright infringement damages. But is a GPL violation a copyright or a contract issue? And unlike copyright, the fix isn't damages to compensate the rightsholder, but to uphold the license.

> So we would need the author of a GPL'd repository to sue, with proof that copilot copied and pasted their code.

What happens if multiple authors wait and see, and then jointly sue? At that point, removing a single repo isn't enough.


> And unlike copyright, the fix isn't damages to compensate the rightsholder, but to uphold the license.

That's not true. Courts won't consistently mandate that people relicense their code that includes GPLed stuff. They'll make you pay a fine and remove the code.


Should Steven King be sued because he trained himself by reading other authors’ copyrighted works?


This is not the same at all. An equivalent would be if Steven King read a lot of other authors and then copied parts of their works into his own books for profit, even if they were changed slightly.


How large does the section has to be? A single sentence? Or two? You can currently set Copilot to not include suggestions where more than 150 characters appear in the training set.


It depends upon a judge. A letter or word doesn't carry that much meaning. A value is created when something meaningful is made. If copilot gives code from linux kernel, I think it should follow GPL license.

There are reasons why one opensource project(forgot the name probably redox os) bans people from contributing if they have read pirated source code?


If you accidentally recite someone else's work, it's still plagiarism.


I am absolutely certain that I have written sentences that are word-for-word identical to sentences other people have written. Should I be sued?


You're describing something that happens regularly. Especially in music.


That's the point.

Accidentally including the same material in something new is not immediately infringement.


Steven King is a human. This is a computer program — a complex one, but still a computer program. How does this analogy make any sense?


steven king is just a very complex program, being executed in the brain. There's no way to "look inside" the black box of steven king, and you can make the argument that there's no way to look inside the blackbox of copilot. Therefore, you judge both by the outward actions using the same set of rules.


Building something we can't introspect isn't the same as building something intelligent.

Anyone who thinks Copilot is writing code with any comparable degree of novelty to Stephen King's prose clearly hasn't actually tried Copilot.


> Therefore, you judge both by the outward actions using the same set of rules.

No, you don't. You don't do that for kids, not for animals, not for corporations, not for.... . We don't even do that for patents vs copyright vs ....


Just looking at leaked source from Windows and then contributing to Wine has gotten people in trouble before. If Microsoft is willing to play that game it makes sense that it should go both ways.


Personally, I err on the side of suing them for breaking the license terms and demanding the GPL code being removed from the training data.

Altough, I realize that it isn't feasible.


Then watch Disney sue some random ML practitioner for using star wars images to train model or Google sue Huggingface for using some youtube videos.


Yeah, people forget there are 'transformative use' exceptions to copyright law. Without such an exception, search engines would be illegal.

Don't get me wrong, it's problematic when the model spits out moderately sized, copyrightable, chunks, verbatim.

But I think the bigger issue for the free and open source communities is that Copilot is proprietary, not that it exists. We should create a FOSS alternative. Presumably we can even train it on (source available) proprietary code.


So the key questions here I think are: Can an ML model "create"? Can it "own copyright"? Can it "transform"? If it can't do those things then I think the entire idea falls apart.

Separate from this, the violations are per instance, not the model itself. If I'm an author and write hundreds of non-infringing books over two decades and then suddenly write and sell a copy of an existing work, I'm not going to be sued for my other books or the reading I did to create any of it. I will be sued for the actual infringing book.


Thats... actually a pretty good counterpoint.

Altough google suing for youtube videos is somewhat different to disney with SW as google doesnt have an exclusive license on those.

But yeah, you changed my kind on that one.


They would need to remove all code that isn't put in the public domain from their training data. Permissive licenses like MIT require derivatives to propagate the copyright notice, which Copilot does not do.


This article was posted in 2021. Can the title be changed to reflect that, please?


Doesn't GitHub's own Terms of Service prevent them from doing something like Copilot?

https://docs.github.com/en/site-policy/github-terms/github-t...

>This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

Copilot is not part of GitHub's services or archival efforts. At least, it was not when they were training the model. And the TOS don't mention Copilot at all (or the terms "train" or "model").


No. The generated code should be GPL'ed as the license requires.


I wrote a law school paper discussing the potential liability stemming from Copilot: https://nickvh.com/blog/archive/2022/02/copilot/copilot.html


IANAL or a law student.

Good paper, though I disagree with many points.

However, two things: you claim that Copilot might have infringed during training because of the GPL, but the GPL's clause only kicks in on distribution. If GitHub had trained Copilot and then did nothing else, that clause would not kick in.

Second, your Stack Overflow example of copying is not good because SO has an explicit license for all material given. You must agree to let SO and others have the material under that license, which is CC-BY-SA 4.0.

Don't get me wrong; I hate Copilot. But those two things you said were wrong.


Thanks for engaging with the paper.

I'm too lazy to check the GPL comment (I'll assume I made a mistake). But as far as I can tell my only reference to Stack Overflow was not about liability based on copying from SO. I was making a comment about a common industry practice.


We store confidential code on Github. How do I know it hasn't been used in CoPilot?


Assuming your confidential code is in private repos, you're fine. From their FAQ:

> [GitHub Copilot] has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.


You can't know (although they claim it was trained only on public code, presumably including proprietary public or leaked code). GPL folks are wondering if it is possible to convert proprietary code to GPL through CoPilot.


I think you can pretty safely know. People store sensitive information, passwords, keys, etc in private repos aaaall the time. It would be an entirely unnecessary complexity for copilot to try to filter out that sensitive information. It's not like there's a deficiency of public code on GitHub. It wouldn't make any sense to train on private repos.


All that stuff is in public repos too, so really they should filter out those things from public repos too. The only difference I can see is that private repos are usually proprietary and GitHub didn't want to anger their paying customers, who usually have proprietary code on GitHub.


This feels legal, but icky.

Like technically it's legal for multi billion dollar companies to turn MIT projects into hosted services.

They can do this without compensating the original devs, they can close source their additions, etc.

Just feels icky.

Co Pilot also feels icky, back when it was free, it looked like a neat project. For 10$ a month, one would hope the Microsoft overlords would offer some compensation.

Let's say you write a code snippet that Copilot copies a few million times. Maybe Microsoft can give you a bit of cash.

Then again, it's probably buried in GitHub's TOS that they can use tools like CoPilot to extract from anything you upload.


I wonder if the code CoPilot outputs is even protected by copyright at all. Would be pretty funny if it weren't, lots of proprietary codebases written using it would actually be public domain.


I suspect that upon "publishing" or the equivalent the company employing the developer would assume the copyright any liability attached to it that Copilot represents, rather than the idea it would all become public domain.


I'm talking about lack of ownership of the code, not about liability for generating the code.


Effectively, copyright is ownership here though.

Edit: I see the possible confusion point here:

> would assume the copyright any liability attached

Should read:

> would assume the copyright and liability attached


This is an entirely new type of product developed with a completely new type of technology that our existing laws were not designed to handle. A lot of folks seem to want Copilot to exist under one existing legal concept or another, but I think we really just need to define a new concept altogether for models and their output trained on publicly available data.


There are so many interesting legal questions that Copilot raises. The tiny snippets that copilot recommends probably fall in the fair use bucket, but the use of entire application code for training, without making unmodified source available with instructions on how to get that source, will probably be an issue for Codepilot. It will take years to sort it out.


No. IP sucks. Copyleft is just copyright in disguise.

Don't want people writing code similar to yours? Don't publish anything. Force people to sign contracts before having contact with your source code.


Microsoft bought github and then did its copilot thing without asking. Asking about copilot is to miss the microsoft elephant in the room. This is MS doing its thing all over again.


Should be tagged (2021)


How much code is covered under a license?

When I use GitHub I get some handy functions and sort of the style it is written in… but that’s it.

It’s not a lot of code.

Does someone own?

formatDateISOXXX() { a handful lines of code here }


Microsoft claims that training Copilot is "fair use" but couldn't they still be sued in like Germany or somewhere else where fair use does not apply?


The Luddites: Lets add large unreachable lines of trash code to our open repos. (And on random days, dress up as road road signs and walk around near roads.)


No. Why would we want to go that route? Such a lawsuit would slow down a ton of work in AI, certainly big language models and DALL-E for example.


What will happen when someone writes a website called "This Disney character does not exist"?


That they may have to take the "Disney" trademark out of the name.


Just need to reword it a bit, "This Disney character doesn't exist" can also be "This is not a Disney character (because it doesn't exist)". Sums up the assertion being made well too. Not a Disney character. Except their lawyers may disagree.


Making an example of Copilot (or, better, Copilot users) would slow down a ton of work in AI-automated copyright infringement: a good thing.


copyright, such a good thing


Of course not, just the code that is produced is all GPL, and under all the other licenses as well.


Yes it must be. Even though Micro$soft will get away because it has huge pockets.


1. No. Why?

2. It's in the user agreement

3. You weren't harmed in any practical way

4. It's not a real tool and nobody's using it in the long run

5. If GPL fans want to waste money on lawyers, focus on making the GPL have teeth

6. No, of course you don't get to impose a new license on your work retroactively


Where is it in the user agreement?

https://docs.github.com/en/site-policy/github-terms/github-t...

> This license does not grant GitHub the right to sell Your Content.

Charging $10/month seems like selling it. The only questions is whether it counts as your copyrighted content, if it's been chewed up by an AI.


The GPL has teeth. Vendors are routinely sued over infringement. This is why a lot of things like routers and IoT gizmos you buy nowadays come with a notice about your rights under the GPL along with a link to a form to request source code.


> The GPL has teeth.

TiVo says it doesn't.

.

> Vendors are routinely sued over infringement.

The GPL has never one time been successfully prosecuted. Never. Show me a single specific case.


If they were good, they would release it for free.


for me the matter is simple:

if the resulting code is licensed under a GPL incompatible license: yes

if the resulting code is licensed under the GPL: no


Their contention is that this is fair use, so copyright based licensing doesn't apply at all. Under their legal theory, the GPL is irrelevant.


what about countries that don't allow for "fair use" in their copyright laws like germany?


I worry for the productivity of the field if Copilot would be successfully sued. Destroy innovation and gridlock progress.


How so? What would be prevented, other than systems exactly like copilot that copyright-launder random source code on the internet?


YES


The fact copilot is trained on GPL data makes it unusable for many developers.


(2021)


Betteridge's law says no, and so do I.

If code is GPL, it's free to read, and learn from. That's kind of the point.


Muhahahah. If this has any legal grounds, Machine learning as a hole, is in a lot of danger.

It also shows in global competition. Here in germany I have seen scientists trying to train a gdpr conforming data set....

Lets just say that you obviously get much better resuls, the bigger your training set.


My level of engagement with this: I was accepted into the technical preview of CoPilot, and while I did generate a few functions and messed around a bit, I don't do enough actual programming these days (because reasons) to have made good use of it. I won't be using it as a commercial product, I'm pretty sure, and I seem to have until August to use it for free, if I'm gonna. Anyways...

Remember Bill Gates' open letter to hobbyists in 1976, in which he blamed pirates for ruining everything? This discussion is much more interesting and subtle than the old debate about piracy's place (and it definitely has a place) in user communities, but I find it humorous that Gates's behemoth, all these years later, seems to have circled around to doing the same thing.

As one who started out by going to swap meets with my Commie 64 and completely disregarding the hand-wringing, moustache-twirling frets of the people RMS (and I) would later refer to as Hoarders, I am pretty sure that I land in the Don't Care category of this debate. Specifically, I think the most compelling point made in the article is when it points out that the tools which were used to create CoPilot are available for free to all, so anyone who is willing to put in the effort of learning how to train models is free to recreate CoPilot for themselves, and share it with anyone.

We all have copypasted code from github or SO at some point without thinking about the license, I am pretty sure, in many cases while at work or otherwise engaged in commercial software activities. CoPilot cannot write entire programs (yet), all it can really do at this point is automate the searching and CTRL-C+V part for snippets and functions, and that is a net good for anyone who writes software.

If there is going to be a lawsuit, I would hope that it hinges strictly on whether github has the right to charge money for it, because essentially, what they are doing by increasing the speed at which people can work is lowering the number of coders required to finish a project on a given timeline. This is both a boon to anyone who is working on something they intend to release FOSS, and simultaneously a device by which greedy Hoarders who want to pay as few people as possible while profiting from the work of those they do pay can increase their profits immensely while creating fewer good jobs for humans.

To put it more simply, when Hoarders use this tech, it is a transfer of wealth out of the pockets of working programmers, into the pockets of shareholders in whatever corporation is reselling this technology. There is good reason, in other words, for working programmers to dislike this, and for FOSS volunteers to like it.

The lawsuit I would like to see happen, if they must, is one which simply seeks to prevent anyone from charging for this, as that would at least keep the investor class from further draining the economy of people who actually work.

Given that the tools for training models are freely available, what I would prefer is if someone with the time implements some form of free version of the same thing. The data they trained the model on for Copilot is freely available to anyone with an internet connection, as are the ML tools. It's extremely clear to everyone that the code copypasta'd by the model does not belong to Microsoft/Github, so if they can sell it, nobody can argue that a free program can't automatically search and copy it for you for free.

There are some that this solution, which I am pretty sure will come about, would not satisfy, and those are the people who can rightly be said to be against innovation and new technologies. It hinges on whether money is changing hands, and whose hands the money is passing into.

Back in those Commie 64 days, we all pirated freely with zero consequence, with one exception that I can recall: one dude in my city started selling pirated software at $5 per floppy. As it happens, he was a friend of my uncle's and I knew him personally. He was a poor man with very little going for him in life, suffice to say, but nonetheless, it only took a couple of months until the police were at his door, and they seized his 64 and all his software.

I disagreed with his actions but I never hated the guy, he was too sad and pathetic to hate if you saw the life he actually lived, but many, many people did, including all the local pirates, because we were firmly of the mindset that software was meant to be shared freely. We may have been wrong about that, but everyone, pirates and non-pirates alike, agreed that the dude taking money was 100% wrong.

I think the same dynamic is taking place here. You will never convince me that copying code from github or SO is wrong, but I agree that CoPilot should not be a commercial thing.


no


Imagine this:

I am a human programmer. I work a lot with FOSS. I learn a lot from FOSS.

I go on StackOverflow and answer a bunch of questions.

The answers I give are based on all the shit I learned working with FOSS.

Is SO infringing copyright by publishing my answers?

I would think that "NO" would be the obvious answer.

Copilot is no different. There is no way this would hold up in court.


That's not what Copilot does. It isn't parsing the code, learning what it does, then applying the concept to generate new structurally/functionally similar code.

We like to imagine that's what it's doing, because the utility of what it actually does can have a similar end result.

What Copilot actually does is read every line of code from a project, and keep a dataset built around what code tends to have been written adjacent to what other code.

Copilot's output is literally a translation of the code it was "trained on", using data about that code to do its translations.


> What Copilot actually does is read every line of code from a project, and keep a dataset built around what code tends to have been written adjacent to what other code.

No it reads every line of code from EVERY project.

It is not reproducing code from individual coders, it is reproducing patterns deduced from the work of millions of coders.

It's like Smart Compose in Gmail. Copyright on anything I write is mine by default, but Gmail can of course take all things written by everybody and then use it to train a thing that detects what I probably mean and offer sentence auto completion in Gmails.


Here it is reproducing a function from the source code of Quake verbatim. https://twitter.com/mitsuhiko/status/1410886329924194309


In that recording, prior to any auto completion the coder types:

// fast inverse square root

float Q_

I can get access to the same code if I search “// fast inverse square root” in google:

https://www.google.com.au/search?q=%22%2F%2F+fast+inverse+sq...

The float Q_ is a partial code match so obviously that requires a more specialised search engine such as copilot.

But all this example demonstrates is that copilot is a great code search engine allowing me to easily find code that is published publicly on the internet


Without regard for license. You seem to have moved the goalposts, originally said:

> It is not reproducing code from individual coders, it is reproducing patterns deduced from the work of millions of coders.

But now you say it is a search engine for finding code snippets from the internet. You really fail to see how this is a problem?

Sticking with your analogy:

> It's like Smart Compose in Gmail. Copyright on anything I write is mine by default, but Gmail can of course take all things written by everybody and then use it to train a thing that detects what I probably mean and offer sentence auto completion in Gmails.

How would you feel if I could reproduce your entire personal email verbatim just by starting it with "Hi <unique mother's name>" and letting Gmail complete it for me? How would that hold up in your analogy?


> But now you say it is a search engine for finding code snippets from the internet. You really fail to see how this is a problem?

Can’t it be both? If you already know a specific piece of code well enough that you could find it in a search engine, then it reproduces that code.

If not it’s reproducing general code patterns. I don’t see why they can’t both be valid use cases and I don’t see how the latter use case infringes the rights of the Quake coders any more that search engines that already exist.

> How would you feel if I could reproduce your entire personal email verbatim just by starting it with "Hi <unique mother's name>" and letting Gmail complete it for me? How would that hold up in your analogy?

I think to match what’s happening in the quake example provided I would have had to publish a specific conversation online, and then you would have to start your text quoting my conversation exactly such that it couldn’t match any other published conversation, and then it would provide it as an auto complete.

This wouldn’t infringe on my rights any more than if you were to copy my conversation from the internet into an email and send it.

It would also be about as useful to you as the above Quake example which is not very.


You explicitly grant SO a licence (CC-BY-4.0) on your answers when you post them. If those answers contain copyrighted material, you must have the right to publish them and to relicence them under CC-BY-4.0. If you don't, you're the one who's broken the law, not StackOverflow.


But what if my answers don’t contain any copyright material but all of my expertise in a given subject comes from having learned copyright material and then practised it so that it is firm in my memory and understanding to the extent that I can produce the required solution?


Large language models including Codex are a transformative technology. So long as Microsoft and OpenAI are not selling back usage of the model to the open-source community, I think it's OK, though it's the bare minimum obligation. Bi-directional fair-use is probably the best result we can hope for. The model does 'steal' the 'essence' of the software though, to use inference instead of computation to generate code, documentation, even evaluate and explain it. The models are license blind and there's very little that could be done to prevent it. Like what the invention of the camera has done for art. So long as MS/OpenAI is giving back to open source/free software. The most important thing, however, is that free (libre) software developers are able to work with the language models directly, so that libre software is allowed to continue progressing into what I call Imaginary Programming. That's because with a generative internet all you really need is blockchain + prompting. Traditional 'real' programming is like writing in assembly code -- it's completley outmoded. https://huggingface.co/spaces/mullikine/ilambda




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: