I'm really baffled by all this discussion on copyrights in the age of AI. The Copilot does not
'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it. IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas.
The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim.
> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.
If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.
Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.
It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?
(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)
And just slightly changing the code seems trivial, at what point will it be acceptable?
I just don't think spending much energy there is really beneficial for anyone.
I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?
> (Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)
This. People seem to forget that generative AIs don't just spit out copyrighted work at random, of their own accord. You have to prompt them. And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you? After all, it's you who supplied the missing, highly specific input, that made the AI reproduce a work from the training set.
I maintain that, if we want to make comparisons between transformer models (particularly LLMs) and humans, then the AI isn't like an adult human - it's best thought of as having a mentality of a four year old kid. That is, highly trusting, very naive. It will do its best to fulfill what you ask for, because why wouldn't it? At the point of asking, you and your query are its whole world, and it wasn't trained to distrust the user.
But this means that Microsoft is publishing a black box (Copilot) that contains GPL code.
If we think of Copilot as a (de)compression algorithm plus the compressed blob that the algorithm uses as its database, the algorithm is fine but the contents of the database pretty clearly violate GPL.
While I do believe that thinking and compression will turn out to be fundamentally the same thing, the split you propose is unclear with NN-based models. Code and data are fundamentally the same thing. The distinction we usually make between them is just a simplification, that's mostly useful but sometimes misleading. Transformer models are one of those cases where the distinction clearly doesn't make any sense.
>And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you?
If you, not I, uploaded my GPL'ed code to Github is the blame on you then?
> If you, not I, uploaded my GPL'ed code to Github is the blame on you then?
Definitely not me - if your code is GPL'ed, then I'm legally free to upload it to Github, and to an extent even ethically - I am exercising one of my software freedoms.
(Note that even TFA recognizes this and admits it's making an ethical plea, not a legal one.)
Github using that code to train Copilot is potentially questionable. Github distributing Copilot (or access to it) is a contested issue. Copilot spitting out significant parts of GPL-ed code without attaching the license, or otherwise meeting the license conditions, is a potential problem. You incorporating that code into software you distribute is a clear-cut GPL violation.
The GitHub terms of service state that you must give certain rights to your code. If you didn't have those rights, but they use them anyway, whose fault is that?
>And just slightly changing the code seems trivial, at what point will it be acceptable?
If I start creating a car by using a blueprint of Fords to create something at what point will it be acceptable? I'd say even if you rework everything completely Ford would still have a case to sue you. I can't see how this is any different. My code is my code and no matter how much you change it, it is still under the same licence as it started out with. If you want it not to be then don't start with a part of my code as a base. In my opinion the case is pretty clear: This is only going on because Microsoft has lots of money and lawyers. A small company doing this would be crushed.
Easy. People get to throw rocks at the shiny new thing.
To my untrained eye the entire idea of copyrighting a piece of text is ridiculous.
Let me phrase it in an entirely different way from how any other person seems to be approaching it.
If a medical procedure is proven to be life-saving, what happens worldwide? Doctors are forced to update their procedures and knowledge base to include the new information, and can get sued for doing something less efficient or more dangerous, by comparison.
If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?
I hear an awful lot of people complain all the time about climate change and how bad computers are for the environment, there are even sections on AI model cards devoted to proving how much greenhouse gases have been pushed into the environment, yet none of those virtue signalling idiots are anywhere to be seen when you ask them why they aren't attacking the bureaucracy of copyright and law in the world of computer science.
An arbitrary example that is tangentially related:
One could argue that the company sitting on the largest database of self-driving data for public roads is also the one that must be held responsible if other companies require access to such data for safety reasons (aka, human lives would be endangered as a consequence of not having access to all relevant data). See how this same argument can easily be made for any license sitting on top of performance critical code?
So where are these people advocating for climate activism and whatever, when this issue of copyright comes up? Certainly if OpenAI was forced to open source their models, substantial computing resources would not have been wasted training competing open source products, thus killing the planet some more.
So, please forgive me if I find the entire field to be redundant and largely harmful for human life all over.
Yes, of course copyright is dumb and we'd all be better off without it. Duh.
The problem here is that Microsoft is effectively saying, "copyright for me but not for thee." As long as Microsoft gets a state-enforced monopoly on their code, I should get one too.
> If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?
If you don't "slap a license on it" it is unusable by default due to copyright.
Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?
I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.
> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?
That's just copying with extra steps.
The way to do it legally is to have 1 person read the code, and then write up a document that describes functionally what the code does. Then, a second person implements software just from the notes.
That's the method Compaq used to re-implement the original PC BIOS from IBM.
Indeed. Case closed. If an AI produces verbatim code owned by somebody else and you cannot prove that the AI hasn't been trained on that code, we shall treat the case in exact the same way as we would treat it when humans are involved.
Except that with AI we can more easily (in principle) provide provable provenance of training set and (again in principle) reproduce the model and prove whether it could create the copyrighted work also without having had access to the work in its training set
> Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.
Theoretically maybe, then they would have to prove they did so without having knowledge about the infringed code in court. You can't make that claim for AI that was trained on the infringed code3.
Yes, that's why any serious effort in producing software compatible with GPL-ed software requires the team writing code not to look at the original code at all. Usually a person (or small team) reads the original software and produces a spec, then another team implements the spec. This reduces the chance of accidentally copying GPL-ed code.
> Has a human ever memorised verbatim the whole of github?
No, and humans who have read copyrighted code are often prevented from working on clean room implementations of similar projects for this exact reason, so that those humans don't accidentally include something they learned from existing code.
Developers that worked on Windows internals are barred from working on WINE or ReactOS for this exact reason.
Hasn't that all been excessively played through in music copyright questions? With the difference that the parody exception that protects e.g. the entire The Rutles catalogue won't get you far in code...
Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.
Copyright is a lot less black and white than most here seem to believe.
That’s part of the rub. YouTube doesn’t break copyright law if a user uploads copyrighted material without proper rights. Now, if YT was a free for all, then yeah. But given it does have copyright reporting functionality and automated systems, it can claim it’s doing a best faith effort to minimize copyright infringement.
Copilot similarly isn’t the one checking in the code. So it’s on each user. That said, Copilot at some point probably needs to add some type of copyright detection heuristics. It already has a suppression feature, but it probably also needs to have some type of checker once code is committed and at that point Copilot generated code needs to be cross-referenced against code Copilot was trained on.
It would almost surely be fair use to include a snippet of code from a different library in your (inline) documentation to argue that your code reimplements a bug for compatibility reasons.
In general it is not fair use if you are using the material for the same scope as the original author[0] or if you are doing it just to namedrop/quote/omage the original.
It is possible to argue that a snippet can be too small to be protected, but that would not be because of fair use.
[0] Suppose that some Author B did as above and copied a snippet of code in their docstring to exlain buggy behaviour of a library they were reimplementing. If you are then trying to reimplement B's libary you can copy the same snippet B copied, but you likely cannot copy the paragraph written by B where they explain the how and the why of the bug.
It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.
Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.
In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?
This seems like a really, really easy problem to fix.
It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.
If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.
In Linus Torvald's book "Just For Fun", there's a chapter about copyright where he presents both the upsides and downsides of it in a pretty much balanced way. I think it's worth reading.
Bit of a false equation to act as though a massive computer system is the same as any individual.
People put code on github to be read by anyone (assuming a public repository), but the terms of use are governed by the license. Now you've got a system that ignores the license and scrapes your data for its own purpose. You can pretend it's human but the capabilities aren't the same. (Humans generally don't spend a month being trained on all github code and remember large chunks of it for regurgitation at superhuman speeds, nor can they be horizontally scaled after learning.)
You can still be of the opinion that this is fine, and I may or may not be fine with it as well, I just don't think the stated reason holds up to logic and other opinions ought to "baffle" you
The issue, though, is not the code I personally upload to my own public repositories, but the code that someone else uploads to Github by cloning my repository held somewhere else than Github.
Personally I have eschewed any personal use of Github since the MS aquisition and only ever use it where that's mandated by a client (so not my code). If you clone my code from elsewhere into a Github repo, that's just rude and contrary to me every intent and wish.
I think it's time to add a "No GitHub" clause as an optional add-on to the various open-source licenses.
So then the person who uploaded your code to GitHub has committed a copyright violation and I’m sure GitHub would honor to remove your code from the model training corpus as it was illegally uploaded to GitHub.
It’s not necessarily a copyright violation if the license permits copying. Under a permissive license, you are expressly permitted to copy the code and distribute copies provided you comply with whatever conditions the license mandates, without an explicit blessing of the copyright holders. Most popular licenses do not include a prohibition on training AI models. Maybe people should start including a clause.
Agreed. I was just saying in the current environment GitHub has that license, nobody else has. So if the courts decide one day that because machines learn differently from humans, they will allow copyright holders to add a license exception that disallows machine training, then GitHub will benefit from this. It’s kind of ironical. What’s best for society is to not have any such law enacted and continue to allow open source models to progress alongside proprietary ones (in addition to more level competitive dynamics on the proprietary side).
Copilot has been caught multiple times reproducing code verbatim. At some point it spat out some guy's complete "about me" blog page. That's not learning, that's copying in a roundabout way.
Also, AI doesn't learn "like a human". Neural networks are an extremely simplistic representation of a biological brain and the details of how learning and human memory works aren't even all that clear yet.
Open source code usually comes with expectations for the people who use it. That expectation can be as simple as requiring a reference back to the authors, adding a license file to clarify what the source was based on, or in more extreme cases putting licensing requirements on the final product.
Unless Microsoft complies with the various project licenses, I don't see why this is antithetical to the idea of open source at all.
No disrespect but I am baffled by your statement that it learns, even to go so far as to say as a human coder would learn.
I don't really want this to comment to be perceived as flame bait (AI seems to be a very sensitive topic in the same sense as crypto currency), so instead let me just pose a simple question. If Copilot really learns as a human, then why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?
I think the comment was trying to draw the distinction between a database and a language model. The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller. This should tell us that a language model cannot reproduce copyrighted code byte for byte because the original data simply doesn't exist. Similarly, when you and I read a block of code, it leaves our memory pretty quickly and we wouldn't be able to reproduce it byte for byte if we wanted. We say the model learns like a human because it is able to extract generalised patterns from viewing many examples. That doesn't mean it learns exactly like a human but it's also definitely not a database.
The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.
I see what you're going for, and I respect your point of view, but also respectfully I think the logic is a little circular.
To say "it's not a database, it's a language model, and that means it extracts generalized patterns from viewing examples, just like humans" to me that just means that occasionally humans behave like language models. That doesn't mean though that therefore it thinks like a human, but rather sometimes humans think like a language model (a fundamental algorithm), which is circular. It hardly makes sense to justify that a language model learns like a human, just because people also occasionally copy patterns and search/replace values and variable names.
To really make the comparison honest, we have to be more clear about the hypothetical humans in question. For a human who has truly learned from looking at many examples, we could have a conversation with them and they would demonstrate a deeper sense of understanding behind the meaning of what they copied. This is something a LLM could not do. On the other hand, if a person really had no idea, like someone who copied answers from someone else in a test, we'd just say well you don't really understand this and you're just x degrees away from having copied their answers verbatim. I believe LLMs are emulating this behavior and not the former.
I mean, how many times in your life have you talked to a human being who clearly had no idea what they were doing because they copied something and didn't understand it all? If that's the analogy that's being made then I'd say it's a bad one, because it is actually choosing the one time where humans don't understand what they've done as a false equivalence to language models thinking like a human.
Basically, sometimes humans meaninglessly parrot things too.
> why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?
I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.
Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.
A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?
What they are saying is that if you’ve studied computer science , you should be able to write a computer program without storing millions or billions of lines of code from GitHub in your brain.
A CS graduate could workout how to write software without doing that.
So they’re just pointing out the difference in “learning”.
LLM's are not storing millions or billions lines of code, and neither do we. Both store something more general and abstract.
But I'm saying there's a big difference between a CS graduate and some current LLM that learns from "the CS curriculum". A CS graduate can ask questions, use google to learn about things outside of school, work on hobby projects, study existing code outside of what's shown in university, get compiler feedback when things go wrong, etc.
All a language model can do is read text and try to predict what comes next.
Cambridge dictionary has it as: "knowledge or a piece of information obtained by study or experience".
If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned? Or the application of a statistical inference model? This alone is probably far enough abstracted to never be an ethical or legal issue. However, if I had a model that was only "trained" on Stephen King books, and used it to write a novel, would that be OK? Or do you think it would be in the realm of copyright infringement?
By your definition anything a computer does means it has learned it. If I copy and paste a picture, has the computer "learned" it while it reads out the data byte-by-byte? That sure sounds like it is "studying" the picture.
"AI" and "ML" are just statistics powered by computers that can do billions of calculations per second. It is not special, it is not "learning". To portray some value to it as something else is disingenuous at best, and fraud at worst.
Your polaroid example would require someone to write code that does that one specific thing. You could also argue that this would violate copyright if it was trained on some photographer's specific unique style, made as an app and marketed as being able to mimic the photographer's style. But in your example you have 1000 random polaroid images of unknown origin, so somehow it becomes abstract enough that it doesn't become an issue.
In your stephen king example I would say it's still learned, because the "code" is a general language model that can learn anything. It's just you decided to only train it on stephen king novels. If you have an image model that trained 100% on public domain images and finetune it to replicate a specific artist's style I would personally think the finetuned model and its creator is maybe violating copyright.
But when it comes to learning I would say when you write a program whose purpose is to learn the next word or pixel, but it's up to the computer to figure out how to do that, the computer is learning when you feed it input data. It's the program's job to figure out the best way to predict, not the programmer. (it's not that black and white given that the programmer will also sometimes guide the program, but you get the idea)
When you write a program that does one or several things, it's not learning.
I think it's something to do with the difference between emergent behavior from simple rules and intentional behavior from complex rules.
I think you're using fancy language like "general language model" to obscure the facts.
If I created a program to read words from the input and assign weights based on previous words, I could feed in any data. Just like the polaroid example. (I suggested that the polaroid example was abstract enough not to be an ethical/legal problem because I believe it is mostly transformative, unless the colours themselves were copyrighted or a distinct enough work in themselves.)
Now If I only feed in Stephen King books and let it run, suddenly it outputs phrases, wording, place names, character names, adjectives all from Stephen King's repertoire. Is this a 'general language model'? Should this by copyright exempt? I don't think this is transformative enough at all. I've just mangled copyrighted works together, probably not enough to stand-up against a copyright claim.
I think people use AI and ML as buzzwords to try and obfuscate what's actually happening. If we were talking about AI and ML that doesn't need training on any licensed or copyrighted work (including 'public domain') then we can have a different conversation, but at the moment it's obscured copyright theft.
I can agree it's obscure in the sense that we shrug when asked about how it works. If you specifically train a model to mimic a specific style I can get behind it leaning more towards theft, or at least being immoral regardless of laws.
If you train a model to replicate 10000 specific artists, I could also get behind it being more like theft.
But if the intention was to train with random data (and some of it could be copyrighted) just like your polaroid example to generate anything you want, I'm not so sure anymore.
I feel the intent is the most important part here. But then again I don't know the intent behind these companies, and I guess you don't either. Maybe no single person working in these companies know the intent either.
It also gets murky when you have prompts that can refer to specific artists and when people who use the models explicitly try to copy an artists style. In the case of stable diffusion, if the CEO's to be believed the clip model had learned to associate images of greg ruktowski and other artists to images that were not theirs but in a similar style[0]
Even murkier is when you have a base model trained on public data, but people finetune at home to replicate some specific artist's style.
> If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned?
Equating human cognition with machine algorithms is the root of the issue, and a significant part of its "legitimacy" comes from the need for "AI" companies to push their products as effective, and there's no better marketing than to equate humans to machines. Not even novel.
You can make out the two original copyrighted pictures in that case, and all you did was using 50% opacity which might not be very transformative, so probably?
In my mind (and I suspect others too) in machine learning context, statistical inference and learning became synonymous with all the recent development.
The way I see it, there's now a discussion around copyright because people have different fundemental views on what learning is and what it means to be a human that don't really surface.
If "like a human" is enough to get human rights then why did I get a parking ticket even when I argued that my car just stands there like a human ? This really isn't as good a defense as people portray. There are a lot of rights and privileges granted to humans but not to objects - we can all agree on that I think.
To be fair, when a programmer learns from publicly available but not public-domain code, and then applies the ideas, patterns, idioms and common implementations in their daily job as a software developer, the result is very much a "commercial product" (the dev company, the programmer themselves if a freelancer) learning from someone else's work and ignoring all the licenses.
The only leap here is the fact that the programmer has outsourced the learning to a tool that does it for them, which they then use to do their job, just as before.
No, the difference is that OpenAI has a huge competitive advantage due to direct partnership with Github, which is owned by Microsoft. In fact, it's even worse. With OpenAI making money from GPT, Github has even less incentive to make data easily available to others because that would allow for competition to come in. I wouldn't be surprised if Github starts locking down their APIs in the near future to prevent competitors from getting data for their models.
Nobody is arguing against uploading code. It's about Github/Microsoft specifically.
I agree there's a difference in the ease of access, a competitive advantage, sure. And I get that people writing public-source (however licensed) software don't want to make it easier for them (as in, Microsoft) to make money off of "learning" (of the machine type) from it. That's fair.
However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.
I mean, for the rest of the content all the new fancy LLMs have been trained with, there wasn't a Github equivalent. They just used massive scraped dumps of text from wherever they could find them, which most definitely included trillions of lines of very much copyrighted text.
In short: not only I don't really see an issue with Copilot-like AIs learning from publicly available code (as I described in the GP comment) but I also think if you publish code anywhere at all it's inevitable that it'll end up in Copilot, regardless of where you host it. If you want to make it more expensive for Microsoft to scrape it, sure, go ahead, but I don't think it matters in the long run.
However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.
I’d be quite careful with of this view.
By your logic, it should be ok to take the Linux kernel, copy it, build it, then sell it and give nothing back to the community that built it. Then just blame it on the authors for uploading it to the internet ?
> all this discussion on copyrights in the age of AI.
copyright is a thing, AI do not change that.
> does not 'steal' or and reproduce our code - it simply LEARNS from it as a human
And here we have the central problem, does it act like a human or does it not act like a human? Humans copy things they learn all the time, some of us know various songs by heart, others will even quote entire movies from memory. If AI can learn and reproduce things like humans do then you need to take steps to ensure that the output is properly stripped from any content that might infringe on existing copyrighted works.
There is a definite difference between singing a song while walking down the street and writing down the lyrics, putting it in a database, claiming it’s my content and then selling it on, even if it’s slightly rehashed.
I would have no problem if such AI systems are also completely open source, can be run by me on my system and come with all models to use them also easily available (again in some form of opensource license). I genuinely don't see that happening in the future with BigTech. As such, as a proponent of FSF GPL philosophy, I have no interest in supporting such systems with my hard work, my source code. So yes, I do consider it stealing - my hard labour in any GPL opensource work is meant for the public good (for example, to preserve our right to repair by ensuring the source code is always available through the GPL license). Any corporate that uses my work, for profit, without either paying me or blocking the public good that I am trying for is simply exploiting me and the goodwill of others like me.
Copilot does not steal. Copilot does not learn. If you want to apply these concepts to LLMs, first prove how an LLM is human and then explain why it doesn’t have human rights.
Rather, Copilot is a tool. Microsoft/ClosedAI operate this tool. Commercially. They crawl original works and through running ML on it automatically generate and sell derivative works from those original works, without any consent or compensation. They are the ones who violate copyright, not Copilot.
Whether an LLM actually learns is completely tangential to the topic at hand. A human coder who learned from copyrighted code and then reproduced that code (intentionally or not) would be in violation of the copyright. This is why projects like Wine are so careful about doing clean room implementations.
As an aside, it seems really strange to invoke "open source ideas" as an argument in favor of a for-profit company building a closed source product that relies on millions of lines of open source code.
It’s also fair to say that a lot of this carefulness has probably made life difficult for the developers of wine, but they wanted to avoid Microsoft’s legal team. So they respected the copyright laws.
I'm in several communities for smaller/niche languages and asking questions about things that have few sources make it much more clear that it's not "learning" but grabbing passages/chunks of source. Maybe with subjects that have more coverage it can assimilate more "original" sounding output.
Plenty of people already argued that LLMs don't actually learn like a human. However, you should keep in mind the reason why clean-room reverse engineering exists: humans learn from source material. FLOSS RE projects (e.g. nouveau) typically don't like leaks, because some contributors might be exposed to copyrighted material. Sometimes, the opposite happens: people working on proprietary software are not allowed to see the source of a FLOSS alternative.
> it simply LEARNS from it as a human coder would learn from it.
It doesn't LEARN anything, let alone like a human coder would. It has absolutely zero understanding. It's not actually intelligent. It's a highly tuned mathematical model that predicts what the next word should be.
If you were to learn a phrase that insulted the king in Thai, and said it in Thailand, you would end up in jail. Doesn't matter if you understood what the phrase said. Ignorance doesn't make you immune to consequences.
Your comment implies that we’re in some age of AGI, but we’re not there yet. Some argue that we’re not even close, but who knows, that’s all speculation.
> it simply LEARNS from it as a human coder would learn from it.
The LLM doesn’t learn, the authors of the LLM are encoding copyright protected content into a model using gradient decent and other techniques.
Now as far as I understand the law, that’s OK. The problems arise when distribution of the model comes into play.
I’m curious, are you a programmer yourself? Don’t take this the wrong way, but I want to understand the background of people who coming to the kind of conclusion you seemed to arrive at about how LLMs work.
> it simply LEARNS from it as a human coder would learn from it
What humans do to learn is intuitive, but it is not simple. What the machine does is also not simple, it involves some tricky math.
Precisely if the process was simple, then it could be more easily argued that the machine is "just copying" - that is simple.
There's a lot of nuance here.
What the machine is doing "looks similar to what humans do from the exterior", the same way that a plane flying "looks similar" to a flying bird. But the airplane does not flap its wings.
> kind of irrational and antithetical to open source ideas
It only happens if you bait it really hard and push it into a corner. That's not representative at all. I use Copilot to write highly niched code that's based on my own repo. It's simply amazing at understanding the context and suggest things I was about to write anyway. Nothing it produces is just copypasted character by character. Not even close.
As others have pointed out, it means the model contains copyrighted material. So I guess that’s totally illegal. Like if I ripped a Windows ISO, zipped it up and shared it with half the world. You know what would happen to me don’t you ?
Not the same thing at all. The data isn't just sitting there in a store inside the model that you can query. No-one would be able to look at the raw data and find any copyrighted material, even if all it was trained on was copyrighted code (which I agree is an issue).
What is not accurate? They are still not storing any material internally, even if the patterns they have learned can cause them to output copyrighted material verbatim. People need to break out of the mental model that an LLM is just a bunch of pointers fetching data from an internal data store.
Humans are intentionally loading up giant sets of curated data for training, purposes, into a super computer to produce a model which is an black box and have provided zero attribution or credit to those who made this work possible. Humans are tuning these models to produce the results you see.
In the case of ChatGPT-x, Open AI company which is disguised as a not for profit with a goal of producing ever more powerful models that may eventually be capable of replacing you at work while seemingly not having any plan to give back to those who’s work was used to make them insane amounts of money.
They haven’t even given back any of their research. So it’s ok to take everyone’s open source work and not give back is it ?
This isn’t some cute little robot who wakes up in the morning and decided it wants to be a coder. This is a multi-national company who has created the narrative you’re repeating. They know exactly what they’re doing.
"Learning" is a technical term, AI doesn't really learn the same way a human does. There is a huge difference between allowing your fellow human beings to learn from you and allowing corporations to appropriate your knowledge by passing it through a stochastic shuffler.
Copilot is run by a corporation, and the model is owned by the corporation - despite being trained on open source data.
In general individuals will have problems with the first L of LLMs - unless the community invents a way to democratise LLMs and deep learning in general. So far deep learning space a much less friendly place for individuals than software was when ideals of open source movement were formed.
A full LLM is too expensive for individuals to train, but LoRAs aren't.
There are multiple open source LLMs out there that can be extended.
We can already see it in AI art scene. People are training their own checkpoints and LoRAs of celebrities, art styles and other stuff that aren't included in base models.
Some artists demand to be excluded from base model training datasets, but there's nothing they can do against individuals who want to copy their style - other than not posting their art publicly at all.
I see the same thing here. If your source code is public - someone will find a way to train an AI on it.
But.. to be clear what you can and can't do with certain code depends on the license.
Imagine code that is "open source" as in openly visible and available, yet the license explicitly forbids the use of it to train any AI/LLM.
Now how could the creator enforce that? Don't get me wrong, I am aware that the enforcement of such licenses is already hard (even for organizations like the FSF).. but now you are going up against something automated where you might not even know what exactly happens.
Potayto potahto. We all know there's a difference between training a machine learning model and learning a skill as a human being. Even if you can trick yourself into believing AI is just kinda like how human brains work maybe, the obvious difference is that you can't just grow yourself a second brain and treat it like a slave whereas having more money means you can build a bigger and better AI and throw more resources at operating it.
Intellectual property is a nebulous concept to begin with, if you really try to understand it. There's a reason copyright claim systems like those at YouTube don't really concern themselves with ownership (that's what DMCA claims are for) but instead with the arbitrary terms of service that don't require you to have a degree in order to determine the boundaries of "fair use" (even if it mimics legal language to dictate these terms and their exemptions).
The problem isn't AI. The problem is property. Ever since Enclosure we've been trying to dull the edges of property rights to make sure people can actually still survive despite them. At some point you have to ask yourself if maybe the problem isn't how sharp the blade you're cutting yourself is but whether you should instead stop cutting. We can have "free culture" but then we can't have private property.
> IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas
You may be right that this is antithetical to "open source" ideas, as Tim O'Reilly would've defined it - a la MIT/BSD/&c., but it's very much in line with copyleft ideas as RMS would've defined it - a la GPL/EUPL/&c. - which is what's being explicitly discussed in this article.
The two are not the same: "open source" is about widespread "open" use of source code, copyleft is much more discerning and aims to carefully temper reuse in such a way that prioritises end user liberty.
A key difference is that a company is making a proprietary paid product out of the learnings from your code. This has nothing to do with open source.
If the data could only used by other open source projects, e.g. open source AI models, I don't think anyone would complain.
You could argue "well, but anyone can use the code on Github" and while that's technically true, it's obvious that with both Github and OpenAI being owned by Microsoft, OpenAI gets a huge competitive advantage due to internal partnerships.
> it simply LEARNS from it as a human coder would learn from it
Does it though? It "learns" correlations between tokens/sequences. A human coder would look at a piece of code and learn an algorithm. The AI "learns" token structure. A human reproducing original code verbatim would be incidental. AI (language model, at least) producing algorithm-implementing code would be incidental.
I want that. I very much want someone to take one of the Windows code leaks, use it to train a LLM, and then make a fork of ReactOS with AI-completed implementations of everything ReactOS hasn't yet finished. Because then we could find out if Microsoft really believes that LLMs are fair use:)