I understand the sentiment, but I think it is misguided and therefore - counterproductive.
First, LLMs learn patterns, not just copy and paste. If they generate verbatim copies of any non-trivial-enough part that would be the subject of a copyright license, yes, it would be copyright infringement. Yet, could anyone give such practical examples? And if so, how do they differ from a software engineer who copies and pastes code?
Second, if the code is hosted anywhere else, there is no guarantee that Copilot (or another model) won't learn from that. The only way to make sure no one and nothing will learn from open-source code is to make it as closed as possible.
Third, for me, the crucial part of open-source code is maintenance. GitHub is there and works well both as a platform for creation (I consider GitHub the most productive social network) and an archive. "No GitHub" (even as a mirror) means that the code is likely to be stored in places less likely to engage collaborators and less likely to last long.
> If they generate verbatim copies of any non-trivial-enough part that would be the subject of a copyright license, yes, it would be copyright infringement. Yet, could anyone give such practical examples?
It's always the same two examples, and I would not classify that as "many", especially since that fast inverse square root function has been shown to be on GitHub and other sites countless of times with all sorts of different licences (which is wrong, but copilot doesn't seem to do better or worse than humans in this regard).
That codeium.com is just asking leading questions, or the AI equivalent of that.
It is mutatis mutandis the same but is that a problem? I'm sure many would say so, I'm not convinced.
Ultimately if his code is out there a Google search could bring up a snippet without the license visible and I might copy paste that. The crux is the same code might be presented without context.
Copilot is just a tool and the personal responsible for it's safe usage is the human behind it.
In my world view, if I copy a picture off Google image search ultimately I am morally the one who infringed copyright not Google.
> In my world view, if I copy a picture off Google image search ultimately I am morally the one who infringed copyright not Google.
I have an idea why, but... why exactly? What about a web scraper (that I made, similarly to that of Google) that downloads images? What if it is randomly downloading images and not intentionally a specific one?
Valid points. But I don't want my code to be used by big corporations and monopolies to train closed source LLMs that they're going to sell. Shouldn't I get to have a say in that?
For example, GPL controls what kind of projects can use my source code. Maybe there could be an addendum to GPL that requires all LLMs trained on the source code to be open source. Sure, that won't guarantee that Copilot-like bots won't be trained using my code. But it does give me a legal framework to stop big corporations from profiting off such Copilot-like bots without making them open-source as well.
A genuine Q: if the LLM was from a purely non-profit company that gave out their AI for free, would you mind your code being used? Would you in fact be proud that it has made a useful contribution? Assuming that the outcome does not affect your income.
You look at the problem from principles, while I look for the outcomes.
When to the third point - well, it is up to the author, and I respect that (regardless if I would do the same thing). People have the right to not share it at all, or share it as a copyrighted piece of software, or with any other limitations. Though, all limitations (and copyleft is a limitation) affect its usage.
First, LLMs learn patterns, not just copy and paste. If they generate verbatim copies of any non-trivial-enough part that would be the subject of a copyright license, yes, it would be copyright infringement. Yet, could anyone give such practical examples? And if so, how do they differ from a software engineer who copies and pastes code?
Second, if the code is hosted anywhere else, there is no guarantee that Copilot (or another model) won't learn from that. The only way to make sure no one and nothing will learn from open-source code is to make it as closed as possible.
Third, for me, the crucial part of open-source code is maintenance. GitHub is there and works well both as a platform for creation (I consider GitHub the most productive social network) and an archive. "No GitHub" (even as a mirror) means that the code is likely to be stored in places less likely to engage collaborators and less likely to last long.