The ToS is necessary for GitHub to provide their services, the wording is pretty carefully constructed so that GitHub are safe to change their service and develop new things. Loosely speaking "by uploading code you grant GitHub permission to blah blah with that code"
I'm not sure how well it will go down in court fighting this... since we agreed to it. But the more interesting question will be is the result a complete "you loose" and GitHub walks away, or if they are forced to take actions in order to defend the copyright of users producing content... a "Code Id" type system that warns you if the code your uploading is too similar to someone else's in order to allow you to use the fun new AI tools to make code and pay GitHub, but also simultaneously defend users legal intellectual property rights.
I just want to make sure you appreciate that if you really believe this argument then GitHub can only be used by the people who actually directly own the copyright on projects; and if you, for example, want to clone and edit my software (the vast majority of which I explicitly never uploaded to GitHub) then you wouldn't be allowed to (which doesn't seem like either the intention or the way it is commonly used)... and like, it would essentially be impossible to use GitHub to work on an open source project that has some long storied history with many hundreds of contributors without going back and getting all of them to agree.
Ah yes, good point: plenty of the people that fork projects do not actually have the copyright to that code to begin with, they just use github while they themselves are in compliance with the license, that definitely does not give GitHub rights that they would have otherwise to negotiate with the original copyright holders. 'Open source' does not equate 'public domain' and GitHub effectively seems to try to make that claim.
I'm pretty sure thats a narrower interpretation than GitHub are aiming for. I'm just paraphrasing the parts of GitHub's ToS that I can remember since current debate on the topic has lead to me remembering a few important parts reasonably well but I've certainly not memorised them. So this is a good opportunity for me to go re-read them and quote them directly... (also in case anyone is about to mention it ... I am aware this I'm linking to the current incarnation of the ToS and it may have changed... but there have been equivalent sections in the ToS for years, and this is pretty standard stuff for User Generated Content licenses, and digging up Internet Archive links to specific historical versions is a bit further than I feel necessary for the purposes of this specific reply)
The phrase relevant to your point is "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post."
It's fair to interpret that as GitHub are not going to be copyright police. The bit at the end where I suggest a "code Id" is more of a thought experiment as to how they could continue to offer the service while complying with a potential adverse ruling that doesn't ascribe blame on them or the service since theres another section of the ToS that I, with my "knows slightly more about law than average but absolutely not a lawyer" hat firmly on, feel will be how GitHub's legal team at least try to make short work of the lawsuit, their success with this tactic is a matter for the Courts, and I'd love better legal scholars to weigh in.
"We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."
For me the key quote being "including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users;". Now theres some legal arguing to be done about if charging for the AI constitutes an infringement on the second paragraph which opens with "This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service" but thats a very different argument to what I see a lot of people making. People are arguing (generally speaking) from the standpoint of "its not right, this violates my rights as the author having selected this license for my code/commits and published it under that licensed for others to share" ... not "i didn't agree to GitHub selling my content and this constitutes violation of GitHub Terms of Service: Section D, Sub-Section 4 where they told me that they would not sell my content".
But broadly speaking, unless the argument shifts to GitHub Terms of Service: Section D, Sub-Section 4 and classifying this as an unapproved sale of the users content, then I don't see how GitHub are not well within their rights to have trained the AI model and offered it as a service. We by agreeing to the ToS agreed to GitHub Terms of Service: Section D, Sub-Section 3 where we promise to only post code we have the rights to post and that we the user will comply with the legal complexities of third party licenses and basically are responsible for not posting stuff to GitHub for which we cant grant GitHub the requested legal rights, which when combined together means that we gave them permission to use our code and commits, regardless of any license files we may have put in the repos, to train the AI model. We can definitely argue derivation and what justifies a sale, and I'd be inclined to say they may actually have breached that term, but no one I've read is talking about that, its all about copyright infringement for AI generated code and moral rights with respect to using the code to train the model, not a clear cut contractual breach of the Terms of Service that GitHub may or may not have perpetrated on us as the other party agreeing to be bound by the contract.
The key distinction I'm interested in is providing the GitHub (or any similar product) "Service" vs selling a separate, derived product (Copilot / ChatGPT).
A: Common ToS to say that a product's owner obtains a license to user content for purposes of providing that user the product service.
B: Somewhat common ToS to extend that to providing the product service to third party users (i.e. use your content for other users of the service), but depends on business model (e.g. most social* businesses).
C: A lot less common ToS to obtain a right to distribute user content in derived products.
A number of sites have gotten into hot water with their userbase over trying to update their ToS from B to C. From memory... Adobe Cloud, DeviantArt, maybe some others?
Typically this gets flak in creative communities, given that it is many people's business, and they're more concerned about distribution rights than your average coder.
At its base, OpenAI/Microsoft/etc. will eventually run into the exact same issues that bedeviled the Linux kernel in the 1990s, except with a much thornier IP ownership question (given the greater number of parties).
But... we're all aware - as is GitHub - that plenty of the content there is not posted by the original copyright holders, who are the only parties that are able to enter into such a contract. That was the reason for GitHub coming into existence in the first place. You can't turn around a couple of years later and start arguing that the use of GitHub allows for a blanket exemption on copyright law, which is effectively what this amounts to.
GitHub ToS is written by GitHub, it's not a contract in the sense that no consideration has been given to the other party and as such it isn't legally binding on that other party, but regular law, such as copyright law, still applies to GitHub.
Its the same as other user generated content sites... The ToS is to legally shift blame from GitHub to the users... and thats what made me think of "code id" actually, since GitHub have a firm defence in the form of "Users doing illegal things isn't our fault, we asked them not to and tried to kick people off when we found out they were violating the terms, but they might still get slapped around a bit by the Court and need to implement some form of safeguards the way YouTube was forced to, because your point about how binding the terms of service are when the consideration is "use of this service in exchange for agreement" is true, there is not a super strong contract here, its nominally more binding than the average clickwrap contract pre-install EULA since the consideration in exchange is use of the service itself, but as case law around things like scraping and other internet activity has shown, its definitely not as binding as a physically signed sale contract would be...
It shouldn't matter if the copyright holder agreed to it directly, if they've published the original code under an open source license. Since open source licenses all allow people to use the code for "whatever"
Even GPL doesn't (yet) include a clause saying the code can't be used to train AI unless the AI itself is open source
> Since open source licenses all allow people to use the code for "whatever"
That's not what they allow for, and copyright being a 'right' it allows you to pass those rights on to others and to retain some for yourself. If not explicitly passed on the right still rests with the original author, plenty of precedent for that.
To take an example: someone who used MIT licensed code but doesn't reproduce the license.
Therefore isn't following the terms of the copyright grant, ergo doesn't have a license for use, ergo is violating copyright.
Now what does that look like when I take 100 different open source licenses, including MIT, put them in a GPT blender, and then productize my output without following any of the licenses?
... makes you think there might be a legal component to why OpenAI switched to a SaaS model. Although believe they'd still be in hot water over any AGPL et al. code.