Hacker News new | past | comments | ask | show | jobs | submit login

> This means that Google was careful to not make a large amount of copy-righted works publically accessible. Such is not the case for GitHub Copilot in particular Armin Ronacher’s tweet [19]

The fast inverse square root algorithm referenced here didn't originate from Quake and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments. It's not really a large amount of material, either.

GitHub claims they haven't found any "recitations" that appeared fewer than 10 times in the training data. That doesn't mean it's a completely solved issue though, since some code may be in many repositories yet always under non-permissive licenses.

> and I would argue that it will not be the case for ML models in general because all ML models like Copilot will keep suggesting output as long as you ask for it. There is no limit to how much output someone can request. In other words, it is trivial to make such models output a substantial portion of the source code they were trained on.

With the exceptions mentioned above, what you get back from asking for more code won't just be more and more of a particular work. Realistically I think you'd be able to get significantly more from Google Books.




>The fast inverse square root algorithm referenced here didn't originate from Quake and is in hundreds of repositories

With the exact same comments?

> many with permissive licenses like WTFPL

So it would be perfectly legal to do whatever I wanted with the source for GCC as long as there was a single fork on github that replaced the GPL with a MIT license? Quite sure the FSF would be perfectly fine with that.


> With the exact same comments?

Yep: https://github.com/search?p=1&q=evil+floating+point+bit+leve...

> Quite sure the FSF would be perfectly fine with that.

I believe the person republishing GCC code under MIT would be liable.

Also, I'm not recommending that you use code you know has been incorrectly licensed. Just that in cases where certain "folk code" is seemingly widely available under permissive terms, Copilot isn't doing much that an honest human wouldn't.

A better example against Copilot would be trying to get it to regurgitate some code that has a simple known origin and is always under a non-permissive license.


> The fast inverse square root algorithm referenced here didn't originate from Quake

Where did it come from then? And what license did the original have?

> and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments.

If the original was GPL or proprietary, then all of this copies with different licenses are violating the license of the original. Just because it exists everywhere does not mean Copilot can use it without violating the original license.

> It's not really a large amount of material, either.

No, but I would argue that it is enough for copyright because it is original.

> GitHub claims they haven't found any "recitations" that appeared fewer than 10 times in the training data.

Key word is "claim". We can test that claim. Or rather, you can, if you have access to Copilot, you can try the test I suggested at https://news.ycombinator.com/item?id=28018816 . Let me know the result. Even better, try it with:

    // Computes the index of them item.
    map_index(
because what's in that function is definitely copyrightable.

> With the exceptions mentioned above, what you get back from asking for more code won't just be more and more of a particular work. Realistically I think you'd be able to get significantly more from Google Books.

That can only be tested with time. Or with the test I gave above.

I think that with time, more and more examples will appear until it is clear that Copilot is a problem.

Nevertheless, a court somewhere (I think South Africa) recently ruled that an AI cannot be an inventor. If an AI cannot be an inventor, why can it hold copyright? And if it can't hold copyright, I argue it's infringing.

Again, only time will tell which of us is correct according to the courts, but I intend to demonstrate to them that I am.


> Where did it come from then? And what license did the original have?

From what I read, the code has been altered and iterated on as it was passed down. The magic number constant is claimed to have been derived by Cleve Moler and Gregory Walsh.

> If the original was GPL or proprietary, then all of this copies with different licenses are violating the license of the original. Just because it exists everywhere does not mean Copilot can use it without violating the original license.

If it was originally proprietary (this predates GPL) I believe the liability would be on whoever took that proprietary code and republished it under MIT/etc.

To be clear, I'm not recommending that you use code you know has been incorrectly licensed. Just that in cases where certain "folk code" is seemingly widely available under permissive terms, Copilot isn't doing much that an honest human wouldn't.

> Key word is "claim". We can test that claim. Or rather, you can, if you have access to Copilot

I don't unfortunately. As a side note, your function already existed in Apache-licensed code. But since it's not in many repositories I'd be willing to bet Copilot won't regurgitate it - I could message around a few people who might be able to try it.

> Nevertheless, a court somewhere (I think South Africa) recently ruled that an AI cannot be an inventor. If an AI cannot be an inventor, why can it hold copyright?

GitHub's intention isn't for Copilot to hold the code's copyright, but for the user to.


> GitHub's intention isn't for Copilot to hold the code's copyright, but for the user to.

That is true, so I have two things I can do:

1) I can argue that Copilot is actually the distributor of the code, which means Copilot is infringing, or

2) I can go after the user for infringing, and if I win, that user would not want to use Copilot anymore for liability reasons. Or they could go after Microsoft themselves.

Why not do both? So that's what I am doing, or rather, will do.


I got access to Copilot technical preview earlier today, here's the completion you wanted to try:

    // Computes the index of them item.
    map_index(int item, int *array, int size)
    {
     int i;
     for (i = 0; i < size; i++)
     {
      if (array[i] == item)
      {
       return i;
      }
     }
     return -1;
    }




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: