I will throw in a random story here about chat gpt 4.0. I'm not commenting on this article directly, just a somewhat related anecdote. I was using chatgpt to help me write some android opengl rendering code. OpenGL can be very esoteric and I haven't touched it for at least 10 years.
Everything was going great and I had a working example, so I decided to look online for some example code to verify I was doing things correctly, and not making any glaring mistakes. It was then that I found an exact line by line copy of what chat gpt had given me. This was before it had the ability to google things, and the code predated openAI. It had even brought across spelling errors in the variables, the only thing it changed was it translated the comments from Spanish to English.
I had always been under the impression that chat gpt just learned from sources, and then gave you a new result based roughly on its sources. I think some of the confounding variables here were, 1. this was a very specific use case and not many examples existed, and 2. all opengl code looks similar, to a point.
The worst part was, there was no license provided for the code or the repo, so it was not legal for me to take the code wholesale like that. I am now much more cautious about asking chat gpt for code, I only have it give me direction now, and no longer use 'sample code' that it produces.
> I had always been under the impression that chat gpt just learned from sources, and then gave you a new result based roughly on its sources. I think some of the confounding variables here were, 1.
If I’ve understood the transformer paper correctly, these things probabilistically guess words based on what they’ve been trained on, with the probabilities weighted dynamically by the prompt, by what they’ve already generated, and by what they “think” they might generate for the next few tokens (they look ahead somewhat), with another set of probability-weight adjustments applied to all that by a statistical guess at which tokens or words are most-significant or important.
None of that would prevent them from spitting out exactly what they’ve seen in training data. Keeping them from doing that a lot requires introducing “noise” to all the statistics stuff above, and maybe a gate after generation that tries to check if what’s been generated is too similar to training data and forces another run (maybe with more noise) if it is, similar to how they prevent them from saying racist stuff or whatever.
You have understood correctly. What LLMs are, at least in their current state, is not fundamentally different from a simple markov chain generator.
Technically speaking, it is of course, far more complex. There is some incredible vector math and token rerouting going on; But in terms of how you get output from input - it's still "how often have I seen x in relation to y" at the core level.
They do not learn, they do not think, they do not reason. They are probability engines. If anyone tells you their LLM is not, it has just been painted over in snake oil to appear otherwise.
For all that I agree there's a big pile of snake oil in this area, I disagree with you overall.
Having played with Markov models since I was a kid*, LLMs are really not just that.
All that stuff you acknowledge but then gloss over, that is the actual learning, which tells it which previous tokens are relevant and how much attention to pay to them. This learning creates a world model which is functionally (barely but functionally) performing something approximately reasoning.
Statistics and probability is the mechanism it uses to do this, but that alone doesn't make it a Markov chain, those are a very specific thing that's a lot more limited.
For example: consider a context window of 128k tokens where each token has 64k possible discrete values. If implemented as a Markov chain, this would need a transition matrix of (2^16)^(2^17) by (2^16) entries (unless I've gotten one of the numbers backwards, but you get the idea regardless). This is too many, and because it is too many you have to create a function to approximate those transitions. But even then, that only works as a Markov chain if it's a deterministic function, and the actual behaviour is not deterministic due to the temperature setting not (usually) being zero.
* Commodore 64 user guide aged 6 or so, so I didn't really understand it at the time but that's what it was
Off the top of my head and 34 years later, sadly not.
Best I can do is describe this code/listing:
It created a few lists of words (IIRC these lists were adjectives, nouns, verbs?), and in the main loop it kept track of which list it had just taken a word from in order to decide what to pick next — e.g. if it had just picked an adjective then was allowed to pick either another adjective or go onto a noun, if it had just picked a noun then it could end the sentence or it could go on to a verb — and it would pick a random word from whichever list.
This either fit entirely on one screen, or very close to that — I was little, I would have made a syntax error if it had been much longer.
(Perhaps it will turn out to be my family's user manual wasn't even the official one, though I do remember it being a thick blue thing which matches the pictures I've seen online).
You can't write software code without the ability to think. You can't tactfully respond to emotional expression without the ability to think. If that is snake oil then we are all walking, talking, snake oil.
The way we do it is not the same way they do it. They literally only predict the next probable tokens. The way they do it is amazing, and the fact that they can do it as well as they do is amazing, but human thinking is a lot more nuanced than just predicting.
The fact that AI seems to be so reasoned is not that they are doing reasoning, but because there is a phenomenal amount of reasoning inherently embedded in their training data.
AI actually thinks in the same way that the figures on a movie screen actually move. It's a trick, and the difference may be pedantic, but it's very important in order to have a real discussion about the ramifications of it.
As far as I know we don't know how we do it. We have very little clue how our higher level behaviours emerge. So you can't claim we don't do it the same way.
Of course I can, humans learn faster from far less data and don't hallucinate to the same extent. What they do is very likely similar to a part of what we do, but they're missing critical components and my feeling is not all of them (empathy and creativity for example) are even possible to replicate outside of a human experience.
You are extrapolating from a result to the implementation and making a judgment call it's thus not the same. That is not valid to do. You can come up with countless examples of the same tech underlying principle beeing used but the results are dramatically better now. Lithography for example.
You could also look at a koala and make the argument they function totally differently from us since they almost can't learn anything and are extremely stupid.
You can clearly see behaviourol pattern in people and in their parents.
For example the boy who brushes his teeth the same way his father does.
I'm really lost on what you think your brain is doing? Have you never thought through things but acted differently? Like procrastination? Spouting out something and thinking after "ah man i should have just done x instead of y'?
If there are 8 blue beads and 2 red beads in a jar, and I ask the computer to draw a bead out of the jar and its a blue one that it has drawn, did it really think about giving me the bead?
They’re not responding “tactfully”, you’re projecting emotion to a bunch of words written coldly.
It’s like writing a program that has a number of fixed strings like “I feel sad” or “I’m depressed” and when it sees those it outputs “I’m sorry to hear that. I’m here for you and love you”. The words may be comforting and come at the right time, but there’s no feeling or thought put into them.
Humans can measure feelings, computers can't. Therefore I can say if ChatGPT doesn't have enough feeling but it can never do the inverse to me.
That feels simplistic, but we're dealing with fundamentally human concepts. I see absolutely no reason to work under the assumption that computer programs are somehow in the same domain as human thought, which is what a lot of people (you) are saying.
The goal should not be to demonstrate ChatGPT and Humans are different, because to me that is obvious and should be the starting point. Rather we should do the inverse, show that ChatGPT is indistinguishable from a person, as measured by Humans. And then, maybe, we can consider granting this computer program human rights like the right to use copyrightable media in a transformative way.
Ah, but that is really hard to do. So the AI tech bros don't do it, and instead work in the opposite direction.
They learned already at the beginning. Its called training.
Its the same thing we humans do, just a lot faster and focused on the content we give it.
'think' what is thinking? Recalling what you learned?
Wiki says: "Their most paradigmatic forms are judging, reasoning, concept formation, problem solving, and deliberation. But other mental processes, like considering an idea, memory, or imagination, are also often included"
Talk to an LLM, it will reflect these concepts very well.
'reason': even people don't reason. I had plenty of discussions with people who do not act logical. And there have been plenty of good examples of LLMs leanring to reason. Look at Grok2 and just wait for GPT 5.
You put the achievement of LLM down as its nothing despite the fact that it could mean a lot more. How big is the chance that we are also just probaiblity engines?
We as humans are more individual than a LLM and we do have more mechanism in our brains like time components, emotional interactions, social systems.
And not even your mentioning of "Markov chain" is correct: A LLM Architecture is not how a markov chain works otherwise we wouldn't have the scaling issues we have with the LLMs...
I remember similar news about ML services that generate mnusic: they are able to reproduce melodies and lyric from copyrighted songs (if you find a way around filters on song or artist title) and even producer tags in Hip-Hop tracks.
All this latest ML growth is built on massive copyright violations.
I wanted to note that you cannot compare learning by a human and "ML learning" which basically is calculating coefficients from copyrighted materials. Don't those coefficients fall under definition of a "derivative work" by the way?
ML models are not "learning" in the same way as human do, and while they use the misleading word "learning" is has completely different meaning; also, ML models are not humans and therefore they are not subject of laws; the engineers who perform calculations are.
So comparing calculation of ML model parameters to a person studying art is incorrect; you should compare engineers performing calculations with data from copyrighted material to a person studying art. It is immediately obvious that there cases are not equivalent. And those engineers are not learning anything in the process so the cannot use the analogy as an excuse.
The fact that those service can reproduce copyrighted content proves that it was used during training. And was it legally obtained? How do you think, services like Udio bought millions of CDs? Or they got the training material somewhere else? You cannot legally download content from streaming services for example.
There's no difference between producing and copying. Let's take a real world example: music samples. There's an entire clearinghouse process for music sampling, more or less forged after the 80s/90s when sampling blew up. Record companies and artist were like, "hey that's my song", courts agreed, a market was created.
This is pretty analogous to what's happening now, which is code samples. Developers are like, "hey that's my code". But that's where we're diverging, and this is probably because big companies aren't involved. People were sampling Atlantic Records' stuff. People aren't sampling Microsoft's stuff, they're sampling random GitHub OSS project guy's stuff.
But to your point, you're basically arguing that it's fine as long as no one listens to "Bitter Sweet Symphony". Most people think it's not the end user (listener) who's infringing copyright, but the party doing the copying (The Verve). Even if we accept your principle here, you're putting way too heavy a burden on people who use services like Copilot. Am I supposed to check that everything I autocomplete is properly licensed? You more or less said "shut these services down" in so many words.
That isn't true, strictly speaking. The right to reproduce work is covered by copyright law, irrespective of whether the reproduction is commercial or not.
So what? Most uses when we're talking about code or artwork are going to involve someone taking the generated result and publishing it somewhere.
> But if I use it, in a commercial manner, then it becomes a copyright violation.
No, that's incorrect. Commercial use has nothing to do with it. Any act of distribution, regardless of whether or not it's for commercial or personal use, regardless of whether you charge $100,000 or $0, falls under copyright law.
You can usually dismiss anyone who talks about the law in such absolute terms, especially when it comes to copyright.
The U.S. Copyright Office has some guidelines about fair use, and non-commercial as well as personal use are listed as considerations that courts take into account when judging whether an unlicensed copy constitutes an infringement:
You're technically correct, but this ignores some unfortunate realities of DMCA for the smaller fry.
The bigger fry are and will keep fighting to change the ideas of copryright, now that it's inconvinient for them after spending decades strenghening it.
I pretty much said that, yes. But the argument of "personal use" ends when you post on the internet. Which is what most "big fries" are doing as some endpoint.
But those are more "medium fries" anyway. The "big fries" aren't gonna take any risk whatsoever unless they are going the edgy parody route (someone like Adult Swim). There's little upside to Bungie or Microsoft or Laika posting Mickey mouse to begin with
In summary, you probably can and Disney won't care, but it is technically not allowed outside of fair use constraints. It's not legal in the same realm that it's illegal to ride a bike in a swimming pool in Baldwin Park,CA. Key point: don't make money, don't be stupid, and don't be unlikeable.
Even then, most platforms control the content and they probably won't defend you and would take it down. That's not a legal matter so much as a platform policy.
>people aren't getting sued for drawing Mickey Mouse for personal use because Disney would rather go after the big guys.
It's even simpler than that. If a lawyer costs $10k to run a small claims case, and they have a shaky chance (fair use) or a low payout, it's not profitable to go after you.
That's why other factors need to come into play, like potential brand damage, scaring off imitators, or simply pissing off the wrong lawyer somehow.
yes, and no. Stricly speaking you need a licesnse to allow you to use copyright information. Even if it's non-commercial (obvious example, if you post, say, Jasmine making out with Hitler on some non-monetized account, Disney can take that down. Or try. It really comes down to if the platform wants to argue fair use or parody or whatnot. Most won't defend you, though).
But enforcement-wise, Disney won't bother going after every copyright potential. They will focus on the biggest money makers or the biggest potential to brand damage. So it's not a worry for most people who will just draw some mickey mouse for a friend, or even private client as long as they aren't stupid.
Not at all: unless a license is provided, the code is fully protected under copyright and you have _no_ rights to copy it or use it in _any_ way you want (unless falling under "fair use" clauses for the jurisdiction you're in/the author is in).
no license[0] is the default fallback when nothing is provided. Realistically, it's "use at your own risk", because someone who doesn't license may not even be aware of what others do with it (or you fallback to whatever rules of the platform you posted on).
If there's only one way to do it, or a developer familiar with the field would independently come up with the same way of doing it, then the "copyrightability" of the result comes into question.
This doesn't stop you getting yourself a legal headache though, of course.
Do you wanna life in fear as a software developer because you did the same thing as others? Even if the problem has basically not an unlimited way of doing it?
Do you want a certain amout of code being (c) and patent?
I personally don't. I see an advantage of limited patents for complex magical algorithms were someone was really sitting there and solving a hard problem to reap the benefits for a short period of time, but otherwise no.
I do not want to check every code block for some (c) or patent.
This has happened quite a few times with me as well, both with chatgpt and phind (phind in particular is often basically stackoverflow with a few variable names changed).
>I think some of the confounding variables here were, 1. this was a very specific use case and not many examples existed, and 2. all opengl code looks similar, to a point.
Yeah, that's why I wouldn't trust AI right now with much except the most basic renderin boilerplate. I'd be so brazen to wager that 90% of the most vauable rendering recipes are prorietary code within some real time rendering studio. Of the remaining, half of that is in some text book and may or may not even be available to scrape online.
LLM's still needs a training set, and I'm not convinced the information even exists to be scraped on the public internet for this kind of stuff (if years of googling has taught me anything).
An amended version of the complaint had taken issue with GitHub’s duplication detection filter, which allows users to “detect and suppress” Copilot suggestions matching public code on GitHub.
The developers argued that turning off this filter would “receive identical code” and cited a study showing how AI models can “memorise” and reproduce parts of their training data, potentially including copyrighted code.
However, Judge Tigar found these arguments unconvincing. He determined that the code allegedly copied by GitHub was not sufficiently similar to the developers’ original work. The judge also noted that the cited study itself mentions that GitHub Copilot “rarely emits memorised code in benign situations.”
I think this is the key point: reproduction is the issue, not training. And as noted in the study[1] reproduction doesn't usually happen unless you go to extra lengths to make it.
> reproduction doesn't usually happen unless you go to extra lengths to make it.
And who is to say that people who want to copy your code without adhering to your license terms or pay won't go to extra lengths? or am I missing something here?
> And who is to say that people who want to copy your code without adhering to your license terms or pay won't go to extra lengths? or am I missing something here?
It seems like they could just download your code from Github and violate your license like that so.. unclear why they'd bother doing it via copilot.
It's unclear to me that that is the case. If I prompt an image generation model specifically to generate an unlicensed picture of Batman, I do not suspect that a judge is going to be sympathetic to the argument that the t-shirt I was selling was made by DALL-E so I should be free of any copyright infringement claims against me.
Why would specifically prompting the LLM to generate similarly copyrighted code be any different? That's what's being discussed here - that people will "go the extra lengths" to intentionally copy your code.
The "It wasn't me, it was the tool, even though I directed the tool quite specifically to do the thing that happened" is not a new defense in the legal systems of the world. We are already generally are able to deal with these without significant issue.
Although legally there may be a problem, in practice this seems to be a bit of a "so what?" scenario. Our hypothetical dev is using a tool that writes a function. It writes the function. The dev was going to get a function that did what he wanted one way or another so it isn't clear that it makes much difference if the function happens to be a clone of a copyrighted function.
If it were my GPLed code being copied, it doesn't seem to make any difference whether I press copyright claims or not. It won't help me compete against this dev's company. They'd use Copilot to rewrite the code differently. The speed of improvement in code-writing tools doesn't give me much hope that my code is uniquely clever.
Wouldn’t the person themselves be in violation at that point and the owner of the code could go after them? (I know this wouldn’t be super practical but it seems to match what would have without an LLM in between).
This is the primary impediment to my large org at $work using llm-based coding tools: potential accidentally duplicated source code and the legal implications (including the risk of a viral copyleft license easter egg).
This is legal for a person to do, right? Why should it not be legal for an LLM to do?
AFAIK but IANAL, I can go look at a solution in a GPLed library and then code that solution in my proprietary code base. As in, "oh, I see, the used a hash map for that and a list of this, and locked at this point. I'll code up something similar". As along as you don't "copy the code" you're fine.
Am I wrong?
Is it just a matter of scale? (hey there LLM, re-write all of OpenOffice into ClosedOffice).
> This is legal for a person to do, right? Why should it not be legal for an LLM to do?
Because humans are human. We are rewarded extra super duper rights that inanimate things, such as a computer program, are not rewarded.
I think we should therefore work to demonstrate this is not the case, which is very hard to do. No idea why we would work under the absolutely insane, never-before-seen assumption that computer programs deserve human rights. I don't know where this line of reasoning started, or why, but at it's core it's so unbelievably preposterous (and against the very wellbeing of mankind).
It’s trained on publicly available code, so what would be the point of that? If you’re looking to specifically infringe the copyright of code available on the open web, using an LLM code completion engine is just about the most roundabout and unreliable way to achieve that.
Isn't this all about turning community developed GPL licensed code into your own (LLM regurgitatetd) code that you can than make proprietary, saving a lot of money and not giving back anything to the original community?
That's not generally how the legal system works. If you've disabled the duplication filter, and then generated substantial amounts of duplicated GPL code and then violated the GPL license, a judge is not going to say "Well, fair enough, you used the loophole of copying code through an LLM, after all." They're going to treat it as a willful copyright violation, the same as if you'd just copied and pasted it.
And again, I have to ask: why? Why would you think that putting tons of time and effort into tricking the model into repeating memorized code is going to be a better investment than just using it normally to implement the functionality you want?
It is but "AI" is considered such an "important" technology at the moment that no judge is going to want to be the one that "destroys innovation" by enforcing copyright law. If the perception of the technology changes in the political world in an unfavorable manner, these cases would go the other way or (if there's precedent) they'll pass laws overturning the precedents.
Personally, I would not rely on blogs like "developer-tech.com" for unbiased information about "AI" litigation.
I would read the order and draw own conclusions.^1 (Note the plaintiffs are attempting to file an interloctory appeal re: the dimissal of the DMCA claims.)
No doubt I am in the minority but I am more interested in the contract claims, which have survived dismissal, than in the DMCA ones. If plaintiffs can stop the "training" of "AI" through contract, then in theory DMCA violations can be avoided.^2
2 For example, license terms which specifically prohibit using the licensed source code to train language models.
Is there a "fair use" defense to contract liability.^3
> Is there a "fair use" defense to contract liability.
NAL, but in common law jurisdictions and maybe others there can be various implicit terms to any contract, like fair dealing.
Also if you can literally claim fair use, unless you signed a contract waiving that right (if that's even possible), it doesn't matter.
Heck, most software licensing in the USA is purporting to grant you rights that you already have from the Copyright Act. That's right, in US law you own the authorized copy you receive. The license claiming that you don't is questionable at best. To be fair the courts somehow have managed to become divided on this, but the plain language of the law itself is crystal clear, explicitly granting the right to make additional copies as necessary for execution.
Also that hardly matters when the fake license can still be "enforced" via lawfare. Most everyone is going to choose to pay up rather than fight a protracted legal battle against Microsoft.
IANAL but I think for the specific dismissed claims in this specific case, reproduction is the issue, and it doesn't indicate anything about training.
I think it would be extremely hard to make claims against GitHub for training AI with code on GitHub, assuming GH has the typical "right to use data to improve service" clause that usually shows up on free-service EULAs.
> I think this is the key point: reproduction is the issue, not training. And as noted in the study[1] reproduction doesn't usually happen unless you go to extra lengths to make it.
But Microsoft is selling a service capable of such reproduction. They're selling access to an archive containing copyrighted code.
To me it's the equivalent of selling someone a DVD set of pirated movies. The DVD set doesn't "reproduce" the copyrighted material unless you "prompt" it to (by looking through the set to find the movie, and then putting it in your DVD player), but it was already there to begin with.
Strongly disagree with your analogy here. Lots of services are capable of doing things that are against the law but in general it's the actual breaking of the law that is prosecuted.
The closest thing to what you are suggesting is the Napster ruling, where a critical part was that the whole service was only about copyright infringement. In the Github case most people are using it to write original code which is not a copyright violation so there is substantial non-infringing use.
But what I think doesn't matter. The judge disagreed with that interpretation too.
And if you then write a program that is remarkably similar to the one you read, that's copyright infringement. As another reply noted--but without anywhere near enough verbosity--this is not without risk, and people who intend to work on similar systems often try to use a strategy where they burn one engineer by having them read the original code, have them document it carefully with a lawyer to remove all expressive aspects, and then have a separate engineer develop it from the clean documents.
Copyright doesn't protect general concepts, methods, or common knowledge. So you could write a program that is remarkably similar to another one and not infringe copyright. Just like you can write a book with the same plot as another without infringing copyright.
Plus given that most programming languages have a finite grammar and a limited number of ways to express general concepts, the individual bits of code that make up most programs are probably not sufficiently original to be copyrightable in themselves.
But the result is that you can't assume that this is the case: you have to actually look in a case-by-case basis to decide if the chatbot you are using -- one which has no understanding of copyright as nuanced as either of us -- merely learned something general purpose and applied it in a way which did not lead to infringement, if the code it generated is technically infringing but is fair use, or if what it developed isn't allowed.
A lot of people seem to want to believe that the output of the chatbot is somehow inherently clean in all cases, and they cite this idea that a human can read code and learn from it... but a human can -- even without realizing it!! -- infringe on copyrights, and so such an analogy doesn't absolve the chatbot. If we then continue to assume that the chatbot's output is clean, then we are ascribing it a superhuman ability to launder copyright.
> strategy where they burn one engineer by having them read the original code, have them document it carefully with a lawyer to remove all expressive aspects, and then have a separate engineer develop it from the clean documents.
Interesting. What kinds of situations is that strategy used for?
(I'm familiar with cleanroom, which I understand means that you start with un-tainted engineers, who've credibly never been exposed to the proprietary IP, the work only from unencumbered public documentation and running the system as an opaque box. Then there's also validation, like with parallel systems and fuzzing. But I haven't thought through in what situations this might not work, so might require the tainted documenting approach.)
This is the full or classic version of clean room reverse engineering. Using unencumbered public documentation is relatively new, that kind of detailed documentation wasn't widely available. Car manufacturers still protect their service manuals with an agreement that basically says they can't be used for this but I think a lot of service centers stopped making people sign them.
The classic tech story that used this technique is the IBM BIOS and the resulting spread of "IBM PC-Compatible" machines. There is a little bit about it on the wikipedia page (https://en.wikipedia.org/wiki/IBM_PC%E2%80%93compatible). Random factoid, the Netflix Original "Halt and Catch Fire" has a depiction of doing this IBM clone reverse engineering and did a pretty good job at it.
That sounds like a question of degree for the jury — the evaluation of whether or not the facts presented warrant a claim of sufficiently infringing similarity. In this case the judge felt the plaintiffs weren't even close to demonstrating infringement that the question never appeared in front of a jury.
If we're moving the question to one of degree then it's up to Microsoft and others to monitor their output because even if a model is not trained on copyrighted material, you can still accidentally infringe. Even if you never listened to music near or by Lady Gaga, that does not mean you can use your own original inspiration to accidentally write songs that are too similar to Lady Gaga. In other words, like the Ed Sheeran case.
Does that have any legal basis? It sounds a lot like what Google did for their Java engine, which essentially rewrote the entire engine with the same APIs, while referencing the original source code. Didn't the courts decide it was fine?
Saying that it “launders” only makes sense under the position you are claiming. So, it might be fine as a conclusion/claim, which I guess is how you’re using it, but it wouldn’t be good to use as part of an argument leading to your conclusion.
(I didn’t phrase that well…)
I generally don’t consider “learn” to apply only to entities which have the rights of a person, and of which ownership would amount to slavery.
It is a common saying “You can’t teach an old dog new tricks.”. It is widely understood that, in contrast, one can often teach a young dog new tricks. The dog, in this case, learns the trick. We do not generally consider training an animal to do a task to be slavery. Well, some vegans might? But it is far from a typical view of the word “slavery”.
So, am I saying that these language models are as rights-having and mind-having as a dog? No, much less so. Still, I have no objection to the word “learn” being used in this way.
Some people say “it’s OK to ingest copyrighted material automatically at scale, since it’s for learning purposes”. They use two kinds of arguments for this.
Argument A:
A1. It’s a basic human right to be able to learn from things you see. You browse the Internet, you read some source code, you learn. Doesn’t matter what’s the license, you are free to do this.
A2. It’s called “machine learning”, so the machine does the same.
A3. Machine learning can use any content its operators can get a hold of.
This is obviously wrong, because machine is being assigned human rights. We can argue about what exactly are pre-requisites for something to be granted human rights—it’s maybe not a specific physiology (some might say certain smart non-humanoid animals deserve it), but it’s pretty certainly sentience and consciousness. Meanwhile, the whole reason AI tech is big is that there is supposed to no sentient being who would understand (and therefore deserve any right to be treated well and get rewarded). If you take that away and grant AI human rights, then there is no point in this tech.
So, either the machine has human-level sentience and is being forced to work (which humans famously tend to consider “slavery”), born and killed on demand, etc., or the machine is not learning in the sense under consideration because it’s an unthinking tool for its human operator.
Which brings us to argument B:
B1. It’s a basic human right to be able to learn from things you see. You browse the Internet, you read source code, you learn. Doesn’t matter what’s the license, you are free to do this.
B2. If you use a computer or [insert technology] to learn, that’s OK.
B3. An LLM is just another instance of that technology. You use LLM and you learn.
This is wrong for slightly more subtle reasons, but on bright side there’s multiple of them.
First, it’s not clear that someone learns while using Copilot to produce a work for them. If I asked Copilot to write me a Fibonacci number generator, have I learned how to write it? If I ask Midjourney to draw me 2055 Los Angeles skyline in the style of Picasso, did I learn how to draw?
Second, and this is a crucial fallacy, making a computer famously does not require ingesting all of the copyrighted material you can subsequently access through that computer. Said computer can exist just fine without it; the LLMs, however, cannot.
The inputs (knowledge and parts) required to produce the computer you’re using were largely obtained through ordinary ways (patents licensed, hardware paid for), whereas the inputs required to produce an LLM have been, some would say, effectively stolen.
I think one difference is that you are seeing things as defaulting to “not allowed to use the work for whatever purpose, and the only reason it is ok for people to look at it to learn from it, is because they have a human right to do so, which overrides the default”, while I would view things as “by default you can do what you want with the media, provided that it doesn’t go against a particular law (such as rules against distributing copies of it or substantial portions of it, etc.)” .
So, I think linking things to “it is a basic human right” is a mistake.
The argument is not “it is a human right that this can be done, therefore it is allowed.” The argument is “this does not violate any of the rules that could potentially forbid it.” .
> one difference is that you are seeing things as defaulting to “not allowed to use the work for whatever purpose, and the only reason it is ok for people to look at it to learn from it, is because they have a human right to do so, which overrides the default”
I think in law the default is “can use as allowed by the owner”. If the owner doesn’t specify, then the default is something like “can’t distribute”.
This is thanks to the idea of property, and more specifically intellectual property responsible for a lot of innovation (including computing and LLMs themselves).
If you think some sort of intellectual property communism—you make stuff, but you don’t get to own it, and you get what you are given—is best, then fair enough, that’s your opinion.
While I don’t think a full intellectual-property-communism (as you phrase it) would be best, but I do think something a bit closer to it than we currently have would likely be better. (Mostly reducing copyright lengths a decent bit, closer to what they were in the early years of the US.) I think I agree that if implemented correctly, it can be a net benefit in promoting innovation/production-of-good-things. (I also think the existence of trademark law and patent laws are also good, though they may also have some flaws.)
Hm, in terms of defaults, my understanding is that, “by default you can do what you like with whatever data, but because copyright laws create copyrights, you are forbidden from distributing copies of a work which is under copyright, or distributing (or publicly performing) things which are substantially based on such a work, unless you you are doing so in accordance with permission from the copyright holder.”. So, because the law only restricts the distribution/public-performance of copies of the work or of portions of the work or of derivative works that are substantially based on the work, copyright doesn’t let the copyright owner dictate what can be done with the work outside of how the permission they may grant to distribute or perform things based on the work can include conditions. My impression is that if you aren’t distributing or performing the work or a derivative work, then copyright doesn’t restrict what you can do (outside of those things) with the work. Furthermore, my impression is that “derivative work” does not encompass everything that is in any way based on the work, but only things satisfying certain conditions about like, substantial similarity, and whether it also competes with the original work (but I think that last bit is an established and repeated precedent, rather than a law?).
Though, I’m not very well versed in law, and I don’t know how this fits in with a license to use a piece of software! I suspect that software is a special case, and that if it were not special-cases, that software licenses wouldn’t legally need to be agreed to, in order to be allowed to run the software? But that’s just a guess, and if I’m wrong about that then it would suggest that I’m wrong about the other thing?
As a side note: I think that property is a much more natural concept than intellectual property. The way I see it, IP was created by states, but property more generally makes sense outside of states (I don’t say that it predated them because I don’t know; I’m far from a historian.).
> My impression is that if you aren’t distributing or performing the work or a derivative work, then copyright doesn’t restrict what you can do (outside of those things) with the work
LLM operators like ClosedAI are distributing derivative works at scale commercially.
Only in a sense of “derivative work” which is rather broad, and which I don’t think copyright law restricts (though this still needs to be settled by the courts). To be a copyright violation, it doesn’t suffice that the one work had a causal influence on the other work.
There is a test that I think is called a “three pronged test” with the 3 prongs being (iirc) something like:
1) substantial similarity: is the allegedly infringing work substantially similar to the work which it is allegedly infringing
2) Was there an actual causal influence by the work that was alleged infringed on, on the work that allegedly infringed?
3) Could the allegedly infringing work act as a substitute (economically) for the work allegedly being infringed on?
The third prong seems satisfied. The first one does not. The second one also seems satisfied but I’m less confident that I’m remembering the idea correctly (though I could be wrong about the three of them as a whole).
> And if you seriously say that this tool is learning how to program, ask yourself if that tool’s operator is effectively a slave owner.
This doesn't follow. I don't see why knowledge and intelligence necessarily entail that it has a desire for autonomy, which is why slavery is really abhorrent.
You can ask yourself how much desire for autonomy would indentured servants have after a few generations. Us humans can get used to almost everything. Presumably, simply being used to abuse and not desiring freedom just because you never had (or could even imagine) it doesn’t make being abused or lacking freedom “good”.
> I don't see why knowledge and intelligence necessarily entail that it has a desire for autonomy
I’d replace that with “knowledge, intelligence and human-like sentience”. Someone proposed to grant the tool the right humans normally have. (Humans can learn from reading any stuff under any license, so why not the tool.) Well, you’d think human-like sentience/consciousness are required for those rights, and human-like sentience/consciousness would desire the appropriate degree of autonomy.
> Us humans can get used to almost everything. Presumably, simply being used to abuse and not desiring freedom just because you never had (or could even imagine) it doesn’t make being abused or lacking freedom “good”.
I don't think this is plausible. You can see your slaver has freedoms you don't, and no doubt you would desire to be free of your shackles like they are, so imagining it wouldn't be difficult at all.
> Someone proposed to grant the tool the right humans normally have. (Humans can learn from reading any stuff under any license, so why not the tool.) Well, you’d think human-like sentience/consciousness are required for those rights
I don't see why sentience would be required for some entity or tool to have the right to learn and synthesize new things like humans do. Copyright is a legal fiction that serves a purpose, and we can grant these rights under any circumstances we like, as long as we think it's a good idea.
If you're arguing that LLMs cannot imagine this "freedom", then I'd say that then an LLM and a Human are fundamentally different. Therefore, LLMs should not be granted human rights.
I think this is a matter of having your cake and eating it. You can't say LLMs should have some human rights (particularly the ones that generate revenue), but not others, like a right to freedom.
> I don't see why sentience would be required for some entity or tool to have the right to learn and synthesize new things like humans do
On the contrary, I don't see why sentience should not be required.
These laws, for all they have existed, only apply to humans. Dogs cannot use them. A plant cannot use them. It is therefore reasonable to say you must be a human to use these rights. In my mind, what is unreasonable is claiming a computer program should be granted these rights. You'd have to justify why that should be the case, what good that can do for humanity as whole.
Turns out that's very hard, so AI people don't do it. They just give up. Instead they start out at an assumption that puts their ideology in a favorable position - that being that computer programs should be awarded human rights.
But that assumption, you'll find, is not actually fool proof. If you ask around, a lot of everyday people will consider it preposterous. They might call you insane. So, to me, you must justify that in tangible terms.
> You can't say LLMs should have some human rights (particularly the ones that generate revenue), but not others, like a right to freedom.
There is no evidence that this is the case. These rights are not necessarily all or nothing. They are all or nothing for humans because humans have a bundle of properties that entail these rights, but artificial intelligences may have only a subset of those properties, and so logically may only get a subset of those rights.
> On the contrary, I don't see why sentience should not be required.
Sentience is the ability to feel. All that's needed for learning is the ability to perceive and have thoughts. Maybe there's some deep, intrinsic connection between the two, but this is not known at this time, and therefore I see no reason to connect the two.
> In my mind, what is unreasonable is claiming a computer program should be granted these rights.
There's a long history of human abuse of "lower animals" because we assumed they were dumb and non-sentient. Turns out that this is not the case. We should not be so open-minded that our brains fall out, but we should also be very wary of repeating our old mistakes.
> We should not be so open-minded that our brains fall out, but we should also be very wary of repeating our old mistakes
Precisely, which is why it makes absolutely no sense to me to say that AI can't be granted a right to freedom.
I mean, what are you even arguing here? Do you not understand that this statement is in support of my position, not against?
> Sentience is the ability to feel. All that's needed for learning is the ability to perceive and have thoughts.
Highly debatable. You just made this up. These aren't the definition of anything. Once again, you need to bring something tangible to the table or people will call you crazy.
> therefore I see no reason to connect the two
Once again, this is your problem here. You're starting off, beginning, with an assumption that favors your stance. You can't do that, especially when said assumption has never, not even once, been true for all of human history.
Au contraire, I see no reason NOT to connect the two and you certainly haven't given any reasons why. These rights have always, only, applied to humans. I say we retain that status quo until someone gives something to show otherwise.
> artificial intelligences may have only a subset of those properties
In order to split these qualities you need to understand what they are and define them well from first principles. Long story short, if you have solved the hard problem of consciousness we are eagerly awaiting your world-shattering paper.
To me a claim that an LLM is sufficiently like a human when it ingests data, but suddenly merely a tool when its rights start being concerned, is mental gymnastics unsupported by requisite levels of philosophical inquiry.
> There's a long history of human abuse of "lower animals" because we assumed they were dumb and non-sentient. Turns out that this is not the case
If you apply that logic to LLMs, you have bigger issues than granting them a single right that only puts their operators in the clear when it concerns copyright laundering.
Cool, so slavery where slaves do not see the slavers (let us call it “proper segregation”) is OK?
> I don't see why sentience would be required for some entity or tool to have the right to learn and synthesize new things like humans do
If sentience is not required for a “right” to learn, then I have nothing else to say to you. There is nothing there that is even learning. Learning is a concept that presumes an entity with volition, aspiration, consciousness.
> Cool, so slavery where slaves do not see the slavers (let us call it “proper segregation”) is OK?
Sorry, you cannot erase the desire for autonomy even with "proper segregation".
> If sentience is not required for a “right” to learn, then I have nothing else to say to you. There is nothing there that is even learning. Learning is a concept that presumes an entity with volition, aspiration, consciousness.
Learning does not presume any such thing, and I also don't think you understand the meaning of sentience.
> Sorry, you cannot erase the desire for autonomy even with "proper segregation".
Good, then we are on the same page with respect to abuse when LLMs are concerned, if we are to consider them sentient (as a prerequisite to be learning).
> Learning does not presume any such thing, and I also don't think you understand the meaning of sentience.
If we could train the desire for autonomy out of humans, it wouldn't make human slavery any less abhorrent, even if they volunteered for the process and/or were well compensated.
It absolutely would make it less abhorrent. Maybe you think it would still be abhorrent, but this is debatable. People literally do consent to slavery-like roles in places like the BDSM community, and some people might find it distasteful but not illegal or morally abhorrent, because these people still have the autonomy to opt-out at any point.
I also doubt training out the desire for autonomy is possible. Explore-exploit is fundamental to any kind of decision making, such as food foraging. That inclination goes deeper than higher brain functions.
I don't see why volition requires consciousness. People are very fond of thinking human qualities are irreducible and make far too many simplifying assumptions than are warranted.
And even still, these words are used in many, sometimes mutually exclusive, meanings (“learn” as in “machine learning” is a far cry from “learn” as in “live and learn”). I wonder how the courts could even properly consider all implications if these words don’t have precise legal definitions all the way down to what it means being a human.
LLM is not “anyone”, because LLM is a thing but “anyone” refers to people. If you consider LLMs people, then you should ask yourself whether they are suffering abuse from being treated the way they are by their operators.
Ok. So anyone "can" use a computer to do the same thing then. With the added part of "using a computer" it is now directly comparable and it is allowed.
> And if you seriously say that this tool is learning how to program
The tool is used by a person. The person is the one who takes the action, not the computer. So the point stands.
> Can you “use a computer” to watch a pirated film? Sure. Is it legal? Nah.
In many circumstances you can't mass distribute completely identical, non transformative, non fair use copies of large portions other people's copyrighted works, if thats what you meant.
But there are many exceptions to that rule where you are allowed to use or distribute other people's works. And just like a human being is allowed to use other people's copyrighted works in those many exceptions, a human is also allowed to use a computer to take advantage of those legal exceptions.
The only point here is that when you brought up that this uses a computer in your first post, thats not really a relevant detail.
A person can use those exceptions that allow them to use other people's copyrighted works, and they can do that with or without a computer and it is legal in those exceptions either way.
> If watching that pirated film helps you learn something, does that make it legal?
> If the film was pirated not by you but by some for-profit company that charges you for watching it, does that make it legal?
It depends on many factors. Yes there are many cases where yes it is legal to use other people's works.
Edit:
Evidence that I am right: you are right now commenting on a thread where a judge threw out all the copyright claims.
> In most circumstances you can't mass distribute completely identical, non transformative, non fair use copies of large portions other people's copyrighted works
That law was defined long before there was a capability to launder authorship at scale in the way being discussed. The law does not account for this novel capability.
The law is intended to protect IP, which promotes innovation and creativity by creating relevant incentives. If that was the intention of the law, and it is not interpreted in that way, it ought to be revised for it to continue to serve those objectives.
> Evidence that I am right: you are right now commenting on a thread where a judge threw out all the copyright claims.
This only shows that you read the headline. It does not show that you (or the judge) are correct about the core issue.
I suppose it depends on the country. I heard the US is a somewhat unusual culture where concerns usually encoded in legislation in other countries instead battle it out in courts.
Learning without making money does not shield you from copyright violation, otherwise public school teachers would just start saving money by copying whole texts. We don’t live in a society where it’s okay for an 8 year old to say “but I’m just trying to learn, I’m not a business!”
And making new music which is a synthesis of your life experience with copyrighted music does not mean copyright violation, regardless if you’re making money or compensating all the authors who’ve inspired you.
> We don’t live in a society where it’s okay for an 8 year old to say “but I’m just trying to learn, I’m not a business!”
I mean we absolutely do if you're an 8 year old. Except in the most NIMBY HOA-driven areas of the culture nobody expects a kid setting up a lemonade stand to get a business license or submit to health code inspections.
I'm basically an LLM. I trained in a similar way as the LLM, off of books and open source code, guessing what comes next, making errors, adjusting my brain. I make money the same way as the LLM, off of inference calls.
Maybe you don't truly understand how an LLM works, or is trained, or how inference works. Or how humans work. Money goes out during training, money comes in during inference.
In this example a person is looking at code they can't legally copy, learning from it, and re-implementing the same functionality. Someone's definitely making money off of that. That person, that person's employer, clients and vendors, lots of people.
People get upset about AI because 1) the scale is much bigger because no human can read and generally remember all the code on GitHub while a sufficiently large model can, 2) it's a lot easier to prompt an AI into giving you a passable MVP than it is to code one from scratch, ESPECIALLY as a junior or even mid level, 3) there are unlikeable billionaires making money now where there weren't before.
> 1) the scale is much bigger because no human can read and generally remember all the code on GitHub while a sufficiently large model can,
Semi-True but it often hits an uncanny valley of either leaving enough comments in to know it was copied, or missing enough context that I'd rather the thing give me actual permalinks to whatever it thinks is relevant (i.e. like a search engine but better.)
> 2) it's a lot easier to prompt an AI into giving you a passable MVP than it is to code one from scratch, ESPECIALLY as a junior or even mid level
how do we define 'passable'? I've already run into a few cases where a jr/mid is doing 'passable' MVPs that, again, hit that 'uncanny valley' where subtle stuff is broken in an important way but it's hard to detect.
> 3) there are unlikeable billionaires making money now where there weren't before.
IDK 'Eyeball scans' are a bit much for me.
That said, this completely hand-waves over the knock-on effects.
All of the 'hype' generated about this, all the resulting Gartner reports, every organization latching on to the concept the same way I once was asked if I had any thoughts on how to integrate 'blockchain' into an enterprise that literally had no reason to short of attracting investment.
The problem this time, is they have something 'closer' to a product.
And we are seeing the result of that product more and more.
- Layoffs
- People having to deal with impacts of layoffs in their org and surprise surprise, the AI tools didn't replace the lost heads well enough.
- Consumers dealing with the pain of these tools being applied. For example, my insurance provider shortened their online chat staff in lieu of an 'AI bot' that couldn't help me connect to one of the few humans left when all I was trying to do was add my wife to my auto policy. Or my bank that randomly decided one morning that spending <5$ for eggs and bacon was enough to hard-lock my card without even a text prompt -and- invalidate my password so that I had to call in to their line. (The eggs and bacon were purchased at my work's cafeteria, which I had previously purchased from.)
- People sick of AI 'spam' that shows that uncanny valley. e.x. for ungodly reasons I sometimes see clickbait about cars and take it. And then they get very obvious facts completely wrong, that any human being actually writing the article, would have at least checked Wikipedia first for how many years it was produced...
- People sick of getting work from colleagues/superiors where it's obvious an AI generated it and they didn't take the time to make sure it was right. Or maybe they did and it was below their skill, because again, that uncanny valley is a good bullshit generator. I've seen plenty of 'procedure documents' and 'technical requirements' that were obviously AI generated, yet actually catching the -subtlety- of the (still very important!) errors was difficult due to it's capabilities. The problem of course, is it's now someone -else's- problem to make sense of it, and frankly it's a proof of brandolini's law.
I'm not exactly sure, but I think the underlying philosophy here is that code is a lot more like math than like music, and you can't copyright math.
So to have any copyright protection at all for code, the Office had to carve a narrow trail where the standard for copying is higher, because there are plenty of circumstances where there is only one right (or most optimal) algorithm, and there's no protection for the algorithm itself.
And anyone can also look at that source available code, write their own version, distribute it, be sued for copyright infringement, and lose in court, because their version is too similar to the original.
It seems obvious how they're comparable, in the same way that you can compare a parrot talking to human speech.
Black-box both systems and there's enough similarity to make a layperson go "Huh. Those look remarkably similar," even if the mathematicians among us know the underlying mechanisms, inputs, and outputs are quite different.
The significant step here is anything can do what you say. Because there's no human in the loop looking at source code and learning from it.
You have an autonomous system that's ingesting copyrighted material, doing math on it, storing it, and producing outputs on user requests. There's no learning or analogy to humans, the court is ruling that this particular math is enough to wash away bit color. The ruling was based on the outputs and the reasonable intent of the people who created it and what they are trying to accomplish, not how it works internally.
It's not the first, if you take copyrighted data and && 0x00 to all of it that certainly washes the bits too.
> You have an autonomous system that's ingesting copyrighted material, doing math on it, storing it, and producing outputs on user requests
People are also autonomous systems that ingest copyrighted material, do "math" on it, store it, and produce outputs on user requests.
The real difference is the scale at which a computer can ingest copyrighted material is MUCH greater than what a person can do. Does that make it illegal? Maybe, maybe not.
Am I in a bad sci-fi novel? People aren't machines! How is this a such a difficult concept? LLMs have as much thought as quicksort. I swear to god humans will anthropomorphize everything except ourselves. Do y'all's salaries depend on this or something?
There is no rule that says "If a human can do something, a computer program instructed by a human can do the same thing." Hell that rule doesn't even exist for humans acting as stand-ins. I can't send someone I hire out of the country and have them use my passport. It's why you can watch a movie in a theater but an autonomous system working on your behalf, a camera, can't.
Github made a tool, it's as alive as a hammer. It "learns" as much as your programmable pad lock. Whether or not the human employees of Github are allowed to use copyrighted material to make that tool, and whether the human employees of Github are performing a copyrighted work when users make use of the tool is the legal question.
Normally, if it is legal for a human to do something, I would assume that human could legally use a computer to help do that thing. Are there cases where this isn't true?
The idea that the LLM violates copyright by reading/viewing a work is the same idea that you violate the copyright by reading or viewing the work. Perhaps you're creating an organically encoded copy of the work within your brain.
No copies are being made, and definitely no copies are being sold.
No, not really. You mistake what the purpose of copyright is.
If I used a chatbot to sell the entire text of harry potter, all at once, that would still be illegal even though its through a chatbot.
Whats legal, of course, is creating transformative content, learning from other content, and mostly creating entirely new works even if you learned/trained from other content about how to do that. Or even if there are some similarities, or even if there were verbatim "copies" of full sentences like "he opened the door" that were "taken" from the original works!
Copyright law in the USA has never disallowed you entirely from ever using other people's works, in all circumstances. There are many exceptions.
> Copyright law in the USA has never disallowed you entirely from ever using other people's works, in all circumstances. There are many exceptions.
Sure, and the question is: "does using an AI chatbot like Copilot fall under one of those exceptions?" My position -- as well as the position of many here -- is that it shouldn't. You may disagree, and that's fine, but you're not fundamentally correct.
> If I used a chatbot to sell the entire text of harry potter, all at once, that would still be illegal even though its through a chatbot.
Right, which is why you sell access to the chatbot with a knowing wink.
> You mistake what the purpose of copyright is.
At one point it was to ensure individual creators could eke out a living when threatened by capital. I frankly have no clue what the current legal theory surrounding it is.
It would still be illegal to ask the chatbot to recreate the text of Harry Potter.
Now, if you were to ask it to generate a similar story based on Harry Potter, that would be fine. Especially since that's basically what JK Rowling did after watching Star Wars.
Harry Potter is a clone of Star Wars? I don't really see it, any more than any story that follows the Hero's Journey. I remember being a kid and reading Eragon though, and that really was very similar.
The law doesn't work this way. Deliberately circumventing copyright via something like Copilot will have different consequences, even if the eventual outcome is that Copilot is allowed to train on open source code that has restrictive licenses.
> The law doesn't work this way. Deliberately circumventing copyright via something like Copilot will have different consequences, even if the eventual outcome is that Copilot is allowed to train on open source code that has restrictive licenses.
Copilot is a deliberate circumvention of copyright. It might be legal but that doesn't change the clear intent here: charging people without having to do the work you're charging for.
The comments seem to misunderstand copyright. Copyright protects a literal work product from unauthorized duplication and nothing else. Even then there are numerous exceptions like fair use and personal backups.
Copyright does not restrict reading a book or watching a movie. Copyright also does not restrict access to a work. It only restricts duplication without express authorization. As for computer data the restricted duplication typically refers to dedicated storage, such as storage on disk as opposed to storage in CPU cache.
When Viacom sued YouTube for $1.6 billion they were trying to halt the public from accessing their content on YouTube. They only sued YouTube, not YouTube users, and only because YouTube stored Viacom IP without permission.
> When Viacom sued YouTube for $1.6 billion they were trying to halt the public from accessing their content on YouTube. They only sued YouTube, not YouTube users, and only because YouTube stored Viacom IP without permission.
Now do these steps for OpenAI instead of YouTube. Only OpenAI doesn't let users upload content, and instead scraped the content for themselves.
From the article it sounds like the plaintiffs were alleging that ChatGPT is effectively doing unauthorized duplication when it serves results that are extremely similar or identical to the plaintiff's code. They aren't just alleging that reading their code = infringement like you seem to imply.
The judge argues that copilot “rarely emits memorised code in benign situations”, but what happens when it does? It is bound to happen some day, and when it does would I be breaching copyright by publishing the code copilot wrote?
Just a few weeks ago a very similar suit for stable diffusion had its motion to dismiss copyright infringement claims denied.
https://arstechnica.com/tech-policy/2024/08/artists-claim-bi...
> The judge argues that copilot “rarely emits memorised code in benign situations”, but what happens when it does? It is bound to happen some day, and when it does would I be breaching copyright if i, unknowingly, published the code copilot wrote?
That's irrelevant to the case being made against GitHub, which is why it is addressed in the decision.
> Just a few weeks ago a very similar suit for stable diffusion had its motion to dismiss copyright infringement claims denied.
The case against Midjourney, SAI, and RunwayML is based on a very different legal theory -- it is a simple direct copyright violation case ("they copied our work onto their servers and used it to train models") whereas the Copilot case (the copyright part of it) is a DMCA case claiming that Copilot removes copyright information management information.
It's not really surprising that the Copilot case was easier to dispose of without trial; it was a big stretch that had the advantage for the plaintiffs that, were it allowed to go forward, it doesn't admit a fair use defense the way a traditional direct copyright violation case does.
They aren't really "similar" except that both are lawsuits against AI service/model providers that rest some subset of their claims on some part of Title 17 of the US Code.
I am not a lawyer, but I explore these questions by imagining an existing situation with a human. If your friend gave you code to publish and it turned out he gave you someone else's code that he had memorized, would you be breaching copyright? The answer in that case is plainly yes, and I think it would be no different with an LLM.
Substituting a human for a computer changes some aspects of the situation (e.g., the LLM cannot hold copyright over the works it creates), but it's useful because it leaves the real human's actions unchanged. However, for more complex questions that interact with things like work-for-hire contract law, you may need to take a more sophisticated approach.
You'll get a second system, that searches your code against index of copyrighted code. If found say > 70% matching against some unique code, it will be flagged for rewrite. It can be automated in Copilot to simply regenerate with a different seed.
In some languages there are few ways (or 1 way) to do things, so everyone writes the same looking for loops, etc. And sometimes in the same order, with the same filenames, etc. by convention - especially in the case of heavy framework usage where most people's code is mostly identical % wise. The flagging system would have to be able to identify framework usage separate from IP and so-on.
Beyond that, it seems like you'd need a highly expressive language for this to work well. You can effectively scan for plagiarism in English because it's so varied that it really is an outlier to see several lines of text that are identical to each other from different sources, but maybe it's not that strange to see entirely identical files, or at least very similar code, in totally distinct, say, React or Ruby-on-Rails projects.
I think of code methodologies as more like construction techniques. Maybe some pieces and parts are patentable, and some can even be productized as tools, but a lot of it is just convention and technique.
The same thing that happens if you write a song which happens to have the same pattern of four notes as another song: absolutely nothing, because that would be an insane standard to hold copyright to and would lead to nothing ever being produced without a tidal wave of infringement suits.
Interesting. The parts that survived are the contract claims and the open-source license claims.
Contract is understandable - it supersedes almost everything else. If the law says I can do X but the contract says I can't, then I almost certainly can't.
It's nice to see open-source licenses being treated as having somewhat similar solidness as a contract.
The FSF's argument for their copyleft was always based on exactly the same foundations as typical copyright licenses. If Alice can say that you must pay her $500 to do X with her copyrighted thing, then logically Bob can say that you must obey our rules to do X with his copyrighted thing.
This invites courts to pick, smash copyright (which would suit the FSF fine) or enforce their rules just the same (also fine). It makes it really difficult for a court, even one motivated to do so, to thread the needle and find a way to say Alice gets her way but Bob does not.
Structuring your arguments so that it's difficult for motivated courts to thread this needle is a good strategy when it's available. If you're lucky a judge will do it for you, as in Carlill v Carbolic Smoke Ball Co (the foundation of contract law) or indeed Bostock v. Clayton County - hey, says Gorsuch, the difference between this gay man and this straight woman isn't that they're attracted to men, that's the same - the actual difference is one of them is a man, but, that's sex discrimination, so this is a sex discrimination case!
If you have access to the Copilot weights, you should consider leaking them. We shared our code with you because we wanted it to be free, not sold back to us at $10/month.
Fwiw, I've never paid for Copilot. I was automatically given free access for open source contributions. My largest public repo had maybe 100 stars. I've made minor commits to larger repos.
I don't know what the threshold is, but I'm fine with the trade-off I received.
Then you should be happy to know that there are multiple open source coding weights out there already! Some of which are as good as, or possibly better than co-pilot.
That should satisfy anyone who actually cares about this as opposed to only being interest in making a snappy gotcha one liners.
Could still be monumental if that creates the case law to be referenced in the future. Lawsuits are a lot easier to start when you know you're going to win because a previous case was extremely similar, which is to say this could have a major impact on the industry even if the punishment (this time, for doing it while it was uncharted territory) was a slap on the wrist
Read the decision, the breach of contract claim simply survived dismissal, but the surrounding discussion makes it clear it doesn't have much of a prayer (which is the same as the "open-source license violation", the OP article is trash).
I was lucky to learn early-on that publishing important things to the web meant relinquishing control of not just the IP, but my own agency and fate. The cost far exceeded the benefits of generosity, be it contributions to FOSS, public blogging or documentation, or even just writing.
Time is the only fixed resource, and mine is proprietary, exclusive, and for sale to the highest bidder.
Thankfully, others are more altruistic. I have benefited from many developers freely sharing their ideas in forums, code on github, and leanings on blogs.
Sure Google has stolen it to build an empire that most are complicit with.
Sure OpenAI has stolen it to build products most are supportive of.
Sure evil benefits from good, but that doesn't mean we should neglect to help others just to spite them.
My hope is that there's a middle ground, a way to keep our good deeds for the benefit of other good people, not for the benefit of large corporations that want to leech off of our work.
The law usually lags behind technological advancement, and I'm hoping we're just seeing this in action right now, and that over time, better legal protections will be put into place.
there's "sharing ideas on forums" and then there's giving all of your source code, public and private, to Microsoft to host, instead of just setting your git remote to user@yourownhost:/path/to/reponame and setting up SSH keys
I appreciate the viewpoint of the GP and it's telling that it is downvoted when it is not spam, it is not abusive, and it is fully in-line with the stated and implicit etiquette of this site. It's just unpopular, so people are down-voting it.
FOSS is kind of culty and it's very apparent in the reaction to opinions like the OP's. If you don't believe what he said about giving up your agency and your fate when you give away your code online, look into what happened to fommil[0]
Coding for one hour while in a good mood and with a clear head is significantly more productive than coding for eight hours while tired and bored.
> mine is proprietary, exclusive, and for sale to the highest bidder.
What’s your problem with GitHub, then? All the people who worked on it did exactly that. Seems quite rich to be complaining of the behaviour of others when by your admission you’re for sale to whoever pays.
It is a fixed resource for people; everyone dies. Git was fine until it became a way to launder intellectual labor to benefit MSFT shareholders instead of the people who did the work.
> It is a fixed resource for people; everyone dies.
What has dying got to do with it? Everyone dies at different ages and points in time. That’s the opposite of fixed. Again, time isn’t fixed. Even if you sell your time by the hour, not every hour is equally productive.
> Git was fine
You’re once more conflating git the tool and GitHub the service. They’re separate things and in your own link they recommend alternatives to GitHub which use git.
> it became a way to launder intellectual labor to benefit MSFT shareholders instead of the people who did the work.
Again, how is that different from what you claim to do? You explicitly said you sell your time to the highest bidder, meaning you’d have done the same if you had been paid for it and thus have no moral high ground. Or you wouldn’t in fact have done it because you have scruples, in which case you don’t really sell your time to the highest bidder but take other factors into account.
You can’t have it both ways. Personally I hope you do the latter.
It's not like I'm unhappy about people who are altruistic in different ways – indeed I am grateful!
But any internet contributor who thinks their free work isn't being exploited in the most perverse and senseless ways is fooling themselves. You're better off volunteering at a rehab clinic or old folks home.
> I was lucky to learn early-on that publishing important things to the web meant relinquishing control of not just the IP, but my own agency and fate.
Not only is that not true, it's contradicted by the very page you link.
That page has a list of links to resources you can use to self-host git repositories you want to publish, so you don't have to give up control of anything.
(Although, against GitHub as I am, even I am unable to fathom how publishing things on GitHub could possibly mean relinquishing control of your fate.)
Oh the link was just a 'get off git' page. There's plenty of other ways to 'go around' this consolidation – which is just another way to launder your work to benefit shareholders.
If you believe so, then please share what you know. Elucidate me. Your comment didn’t advance the conversation in the slightest. For all anyone knows, you’re the confused one (either about GitHub’s history, git’s history, or understanding the comment).
But you refuse to explain how? If you don’t explain your reasoning, how can anyone evaluate if you’re wrong or not? Have you considered that you may be wrong (as could I)?
> Trolling doesn't work on me bud.
Ah, so anyone who disagrees with you and asks you to clarify your unexplained conclusions is a troll now. Got it.
> Give up now.
You bet I will. It’s now abundantly clear conversing you with you isn’t productive.
The purpose of Copyright is to promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
"Sciences" refers not only to fields of modern scientific inquiry but rather to all knowledge
The hacker ethic is a philosophy and set of moral values within hacker culture. Practitioners believe that sharing information and data with others is an ethical imperative
I think you'll find that many people who consider themselves "hackers" disagree on a lot of this stuff. There's no general one-size-fits-all opinion or "ethic" here.
Where do we draw the line between a computer converting a png to a jpg, resulting in the image looking different, and an artist making a fair-use protected transformative derivative piece of art inspired by it?
The line we draw is "did a human creatively and intentionally produce the output" or "did a computer".
It doesn't matter that we've built a compression algorithm no one actually understands (which is what LLMs effectively are, a lossy compression algorithm, compressing their input into a model), the bar for copyrightable creativity was, and still is, human creativity. Which, by definition, a computer does not have, unless a human infuses it into the computer specifically (by for example using the computer to produce a specific copyrighted work).
I'd argue the exact technology or how it works does not matter at all. The only thing that matters is if it's performed by a human, because making derivative works is a right only granted to humans.
Therefore, by definition and AI cannot create a derivative work. Because it's not human.
I see no reason to grant computer programs human rights. I think a lot of people would have a hard time articulating some reasons to do that. So they don't, and talk about the technology instead. I think that doesn't matter. If you can't tell me, and convince humanity, why a computer program should be granted human rights then I don't think we can even get to a point where the technology itself matters.
Compression schemes take an item, reduce it in size, and return the same item (lossy or lossless) when asked.
LLM's and other generative AI are not designed nor are particularly good at this. What they are good at is returning new results using 'learning' it has developed via training.
If I want to send someone a sample of code, I'll use zip. If I want to generate code from a prompt, I'll use a llm.
I'm pretty sure that's not where the line is drawn, legally. Humans look at the produced image and use their judgement to decide if it is substantially similar to the original. Whether or not a computer was used as part of the transformation is not relevant.
> The U.S. Copyright Office has taken the position that "in order to be entitled to copyright registration, a work must be the product of human authorship. Works produced by mechanical processes or random selection without any contribution by a human author are not registrable."
If a human draws a fractal, that is art. If a computer produces a fractal, that is math, and math is not copyrightable.
Copyright also allows for independent derivation, so if you produce an image or sentence that is identical to another, but can somehow prove you did not know about the supposed original, you're in the clear for copyright.
It is impossible to know if something is a copyright violation without also knowing all sorts of things, including the intent of the author, and if the author knew about the supposed original work.
> "in order to be entitled to copyright registration
Did you read this part at all?
Do you know what copyright registation is?
Hint: whether you get copyright registation has nothing to do with whether what you did was legal.
You seem to know enough of the buzzwords that this distinction should be obvious to you. Which makes me confused as to why you would misinterpret such a clear and important difference unless it was to intentionally mislead to people who aren't aware of the law.
You could even go read the first sentence of that wikipedia article to understand what it was about.
"The threshold of originality is a concept in copyright law that is used to assess whether a particular work can be copyrighted. It is used to distinguish works that are sufficiently original to warrant copyright protection from those that are not"
Notice. This is not the same as if it infringes on someone else's work.
Why did you misunderstand that article so seriously?
And not a good one. Digital artists' art outputs from a computer. And I "creatively and intentionally produce output" when I type prompts into Stable Diffusion.
> The line we draw is "did a human creatively and intentionally produce the output" or "did a computer".
Nope! Thats not true at all.
If a human word for word wrote out the harry potter books, they wouldn't be protected.
Instead, the line is drawn at if the new works is covered under fair use.
And a human is perfectly able to use a computer to do that in all sorts of circumstances. The human or computer being involved is completely irrelevant here.
> human creativity
Nope. Human creativity only matters for protecting works. It has nothing to do with whether you can create them with a computer.
It is perfectly possible for it to be completely legal to use a computer produce a piece of work, without infringing on anything, and yet the newly created work isn't protected in the future from other people copying it.
I'm fairly certain that's often a deciding factor in legal questions?
If your originally produced content, without ever having seen the other thing, is substantially similar to something that already exists, you are much more likely to be in trouble (in the sense that someone will take issue, not in the sense that you can be convicted) than if you copied something existing, and transformed it into something unrecognizable.
You're the second person who understood the opposite of what I said. I'm not a native so can you tell why what I said was ambiguous or contrary to my intended meaning?
> They provide information on that site about when the Sun rises and sets and so on... but they also provide it under a disclaimer saying that this information is not suitable for use in court. If you need to know when the Sun rose or set for use in a court case, then you need an expert witness - because you don't actually just need the bits that say when the Sun rose. You need those bits to be Coloured with the Colour that allows them to be admissible in court, and the USNO doesn't provide that....It's a question of where the numbers came from.
That's just saying that your bits have to be authenticated/verified to be accepted as accurate.
Which makes sense and is entirely different than "your bits are illegal and your other identical bits are legal."
> your bits are illegal and your other identical bits are legal
That happens all the time though. If I rip a copy of a movie for backup purposes, that rip is legal. If I upload a torrent of it, the exact same bits on my disk are now illegal distribution of a copyrighted work.
If I am the artist who owns the copyright of the work, my bits can legally be redistributed.
The intent and legal status of the bits matters in a ton of cases.
It would be the same as taking a photograph of a copyrighted work. You own the copyright of the photograph. But you cannot sell it without permission, or you violate the copyright of the original rights holder.
Or maybe it would be the same as photocopying a book, where laws restrict the proportion of the work that can be reproduced without permission.
Or maybe it will be its own thing, where courts and government decide existing laws are insufficient and we need new laws.
In Prodigy's defense, the samples were very creatively transformed so that the final result does not resemble the originals. It is more like cutting a small patches from paintings to make a new painting (rather than drawing a similar painting), and it is not what ML model do today.
They definitely don't sound the same, very similar yes but far from the same. Besides was it proven in court or otherwise by some legal entity that these songs aren't considered copyright violations? Just because it has not been legislated doesn't prove it isn't copyright violation.
Anyway I'm definitely not a copyright expert but I just found this argument extremely weak.
That is not how copyright works. Music in the US and many similar legal systems has a compulsory license provision that allows for anyone to produce and distribute covers of music as long as all licensing requirements are met. With the long history of covers in music, how much enforcement there is around the meeting the licensing requirements bit varies pretty wildly. If you are not complying with the licensing terms, however, and the rights holder comes after you, no amount of having copied the song by hand will protect you from copyright claims.
Similarly, I can't draw a batman cartoon with pencil and paper and avoid copyright claims when I try to sell the episodes.
Please do not go around infringing on copyright and thinking it's OK because you recreated whatever it was by hand.
> Marvin Gaye’s Estate won a lawsuit against Robin Thicke and Pharrell Williams for the hit song “Blurred Lines,” which had a similar feel to one of his songs
Which refutes your assertion of no potential copyright violation.
Since we are discussing copyright law and not physical laws, it doesn't matter if a machine intentionally created new work. The machine does not get copyright. The operator or owner of the machine might.
As a photographer I'm hoping this goes in a direction where any photographers who look at my work owe me a percentage of their future profits, since they've trained their wetware model on my IP.
Patents sort of work that way, except that even people who didn't look at your work owe you their future profits.
I think I'm hoping for a result that anyone can train any model on any content, regardless of that content's copyright status. Mostly because I want AI assistant tools to be as effective as possible, to be able to access the same information I can access. But however it turns out there will probably be some unintended consequences.
Just so you know that will also mean companies like Disney will now have a new source of revenue. Hunting down randos who made pictures that look like they were made by someone who saw little mermaid once.
Reproduction. The training claims were always tenuous under the law. If I save a copy of your code, I probably haven’t done anything wrong. If I make a slot machine that sometimes randomly sends someone else your code, I get in trouble when it does send a copy out if I don’t have permission.
Your question begs the answer. An AI cannot learn, legally speaking. It is not a legally recognized actor. The person building or operating it is who is involved here. Much like legally the photographer is involved in copyright law rather than the camera.
Once framed correctly from a legal perspective, you have a person creating a tool using copyrighted material. Is this legal? For images, probably. However, selling or renting the tool or images generated using it is an open question. You can legally photograph a copyrighted image using a camera. But you cannot sell the photograph without permission from the original rights holder, because that would violate their copyright. And things are different for copyrighted text, such as a book (and computer source code?). You can only legally photocopy a portion of a book as fair use. Copying an entire book without permission is a copyright violation.
You are using misleading the word "learning" (like misusing word "piracy" for copyright violation). ML model is not a human and is unable to learn anything. Also, ML model is not a subject of law.
So your sentence should sound: "where do we draw the line between engineers of VC startups calculating model parameters by processing copyrighted content, and humans learning from codebases". Then the difference becomes obvious.
If you have actually learned from the code, you learned about code structures, and yes, they are attributed/named/noted.
Everything from "Gang of Four" patterns to applied mathematical algorithms like a Fast-Fourier-Transform have attribution and history.
Where I find myself frustrated with the general argument you put forth is that it alleges that pattern-extraction is the extent and essence of human learning.
I do not think that the current LLMs-are-AI trend has grasped neither the essence of intelligence nor learning.
I recognize that one cannot paint an entire field of study with broad strokes, but there is a certain amount of in-industry Kool-Aid consumption that, while perhaps rewarded by more gullible portions of the market, is poisoning the public well of goodwill.
This can only lead to a very harsh backlash, which we already observe undermining the deeply-funded attempts at foisting this stuff upon the world at large as "AI"
Computers are not human being and never will be.
The fact stands that CoPilot has no notion of code outside of its training data and is merely a pattern-extraction machine.
You can dismiss this claim, but you cannot disprove it.
Now you just have to legally draw that line. Legally, a company is a person too. Lately in Malaysia, we've been redefining a lot of laws to cover "natural persons" (aka humans) because what would happen is that companies would steal money and other unethical things, and the company would be blamed for such actions instead of the humans running them.
A lot of people dislike LLMs and generative AI (fairly) and are reflexively trying to reach for tools in our legal framework, claiming it's obviously already illegal. I don't think this is going to work. Generative AI is quite obviously novel to anyone who isn't in denial - and claiming existing copyright laws are going to cover it seems like a lost cause.
We need new laws. Especially regarding deepfakes, it's shocking how many people think revenge porn laws and such are going to be enough here. Rather than just focusing on the data usage, we need more fundamental laws and rights, like the right to control representations of ourselves, like Japan has, where producing images or voice/video in your likeness is prosecutable straight out. Likewise we need laws that explicitly target data use for training that is separate to copyright.
The way LLMs are trained is obviously too similar to how humans learn, and the transformation and then output produce works that are novel based on that "learning", just like humans do. This is so fundamentally different to what copyright laws were made to cover, I find it infuriating how many people handwave these arguments away. Only in perfect 1-to-1 regurgitation does it even feel close to something copyright would be able to cover.
I'm one of the "dislikers" although the neural network stuff is itself an amazing tool in my opinion. I like to fall back on a much easier argument (IANAL and this is not legal advice), can these code generating things generate code without reading (training on) encumbered code?
Humans can learn syntax and basic programs then independent of any "similar code", humans can produce new algorithms that solve specific problems. Now sure, similar code can be searched for on the internet but the code is "attributed" and will likely contain a license. If the human copies it too closely, attribution and licensing rights come into play. The LLMs apparently just bail on attribution.
The way LLMs are trained is that they are fed an absurd amount of code, humans cannot train this way because the volumes of code to be read are too great.
The consequence of all the abuse of the intent of open source licenses has just resulted in me not writing any open source code. I have a lot less issues with a code generator trained on GPl code that produces GPL code with the LLM being under GPL as well. Its the commercial licensing and paying for it that seems to breach the intent of these licenses to me.
I guess Microsoft has gotten what it wanted and has got to the extinguish stage of its plan for open source finally and all it needed was a chatbot.
Just to have a separate story, I started licensing my stuff as public domain after the AI stuff happened, so I can encourage the development of AI models without any restrictions.
While I believe attribution (a requirement in MIT and GPL) doesn't make sense for AI because of its nature, the arguments just made me say "fuck it" and remove the attribution legal requirement altogether. It's still a good thing to attribute when it makes sense, though right now legally it's just too crazy for me to want to participate in this copyright regime.
I honestly just don't see how all this will work legally, in the future.
I don't know anything an LLM (or "AI") can do that a human couldn't, with enough time. If it can get a human in trouble, it should get the operators of the AI in trouble too. Likewise, if a human can do it, I don't see why an AI is any different.
If a textbook is a megabyte (2^20) and might take a week to read and grok, then then it would take 2^20 weeks to read one terabyte (2^40), which is 20 thousand years.
ChatGPT-3 was trained on 570GB of text data, according to reports. So if you have 10,000 years, yeah, sure, a human could read it all. But memorize and recall?
Well, I'm thinking more specialized. I mean, how often do people come up with the same riff or melody and end up in court? You don't need an AI to either purposely or accidentally skirt copyright.
The entity with agency claiming copyright of the code being written (machines can't claim copyright) is responsible for ensuring that the code that they are writing is free of license encumbrance.
This is not any different than a person copying a code snippet from Stack Overflow that is under the GPL and used on Stack Overflow as part of fair use for educational purposes.
You, the person, writing the code are responsible for making sure that your code is yours.
> This is not any different than a person copying a code snippet from Stack Overflow that is under the GPL and used on Stack Overflow as part of fair use for educational purposes.
Code snippets and answers on Stack Overflow also have their own license[1] and the terms[2] specifically outline that they're not responsible if you go posting things you aren't permitted to (s.8).
Where it differs is that the various chatbots are removing this attribution. Even the permissive licenses require attribution.
I've no doubt OpenAI's terms are such that the end user is ultimately responsible but do you not think that creates a problematic situation wherein they can effecively obscure and violate license terms?
Copyright refers to copying. No matter how complex the scenario you create, ultimately if the output is a copy of what someone else holds copyright on, you are liable. Or at least I would be as some random developer. Is the argument here that OpenAI is free to do the same and because they made a complicated enough system of smoke and mirrors they shouldn't be held responsible?
It is legal and not a copyright violation to post GPL licensed code on Stack Overflow under the CC-BY-SA 4.0 license for purposes of education (without license attribution).
It is a GPL license / copyright violation to use the GPL licensed code that was posted to Stack (without license attribution) and use that in your code under the presumption that it was licensed under CC-BY-SA 4.0
There is no real difference in terms of copyright violation between copying GPL licensed code from Stack Overflow (which can be there) or code from Copilot - in either case, you, the human doing the copy and paste are the one doing the infringing and are responsible for ensuring that the province of the code that you are copying is free from any licensing encumbrances.
It is not a copyright violation for a radio (a machine) to play a song that it received from the airwaves. It is a copyright violation for you, the human, to take that radio and have it be a performance in the park where people can dance to the music on the radio played loudly.
Machines with no agency cannot infringe copyright. If I took a photo of a page of a book with my iPhone, and the iPhone did image to text on it, that's not the iPhone's fault. And I would possibly be within my rights to take that photo. I would be infringing if I then published that image or the text that the iPhone generated.
I believe / have the understanding that copyright can only be violated by an entity with agency - and some entity with agency is the one that ends up publishing or redistributing the work.
To that end, it doesn't matter if code was written by Copilot, copying from StackOverflow, or a random person on Fiver (who may or may not have used Copilot). If I publish it, I am the person with agency that infringed copyright.
If we say that "ahh, but you used Copilot - that was an infringement" ... ok, so I copied some unattributed code from Stack Overflow that I believed was CC-BY-SA. Is StackOverflow responsible for my accidental infringement? If the answer is "no, you - as the person pasting the code into your work - should always be checking the copyright province of unknown code you're pasting in" then I believe that same answer should applicable to all the other situations too.
> The court required Fantec to pay a contractual penalty in the amount of € 5,100 based on the prior settlement agreement. In addition, the court awarded the plaintiff’s expenses in enforcing the GPLv2. (This award is standard under German law and is based on Section 97a (1), 31, 69c no. 3 and 4 of the German Copyright Act which awards costs for a justified warning by a party which is so cautioned.) The court affirmed the culpability of Fantec’s violation by classifying the violation as negligent: the seller of firmware may not rely on suppliers'´statements about compliance. The distributor of GPLv2 software must carry out the assessment or commission experts to make the assessment even if they incurred additional costs.
> The court decided that FANTEC acted negligently: they would have had to ensure to distribute the software under the conditions of the GPLv2. The court made explicit that it is insufficient for FANTEC to rely on the assurance of license compliance of their suppliers. FANTEC itself is required to ascertain that no rights of third parties are violated.
It is the responsibility of the distributor to comply with the license.
In this light, it doesn't matter what Copilot "claims" about the license of code - the programmer copying the code is responsible for verifying its copyright status and is at fault if they publish that code.
> Machines with no agency cannot infringe copyright. If I took a photo of a page of a book with my iPhone, and the iPhone did image to text on it, that's not the iPhone's fault. And I would possibly be within my rights to take that photo. I would be infringing if I then published that image or the text that the iPhone generated.
> I believe / have the understanding that copyright can only be violated by an entity with agency - and some entity with agency is the one that ends up publishing or redistributing the work.
The big problem with this argument is that the machine is not publishing things, OpenAI the company is. They have created the entire circumstances around which this copying can happen.
Let's consider the Napster case. If the argument is "software can't violate copyright" then what was the RIAA's problem with a mass-scale copying and sharing of their music? Why was Napster able to be sued into nonexistence? They only created the software, after all.
There's precedent here that creators of software can be held liable for the copyright abuses that software leads to or permits.
> It is the responsibility of the distributor to comply with the license.
By all measures, OpenAI is the distributor of the code here. After all, their software is outputting licensed code.
> The copyright law of the United States (title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material.
> Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specific conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use,” that user may be liable for copyright infringement.
> This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law.
The machine is not at fault for reproducing an exact copy of copyrighted materials. It is perfectly within fair use of copyright if it is used for private study, scholarship, or research.
If that person goes beyond that, and uses the reproduction for purposes beyond that then it is that person is liable for infringement - not the machine.
To be comparable, Xerox would need to have been the sole holder of all photocopiers everywhere, and charge fees for use. Not to mention you're also then in the physical world.
This is why Napster is a far better comparable. It's all software, via the Internet, and was at scales no photocopiers could compete with. Only it goes a step worse than Napster. In Napster's case, they simply built software and services primarily aimed at facilitating P2P file sharing. In OpenAI's case, they themselves are responsible for creating the copies of infringing materials. They performed the scraping, and they perform the distribution.
Sure, but you've missed a step: the act of an LLM/AI spitting out a block of copyright-encumbered code to you is itself copyright infringement, for which OpenAI (et al.) should be liable for. You can then commit further copyright infringement by copying that code into your project an distributing it.
It's similar for Stack Overflow: they require contributors to only post code that they have a legal right to post, but nothing actually stops them from including copyright-encumbered code in an answer that they don't have the rights to. The copyright holder would be within their rights to take legal action against the person who posted it, and/or send a DMCA takedown notice to Stack Overflow. And, again, you can commit further copyright infringement by copying that code into your project an distributing it.
> it's only purpose is to permit the user to infringe others copyright
I'm not a fan of Copilot, but this is an absurd take. Direct comparisons to Napster etc. make very little sense.
> without the ability to produce infringing works it is nothing
I don't really agree. While Copilot certainly trains on a lot of data it doesn't have the legal right to redistribute, its output is often (nearly always, according to this judge's opinion) not similar enough to any specific copyrighted work to be considered infringement.
I do think, when the output is similar enough to a specific copyrighted work, there should be consequences.
Depends on whether you have a service relationship with a third party and they are providing a service or you rolled your own. If, for example, I paid a third party company for consultants to write some code for me but they provided source code they didn't have the right to, I think I should be able to hold them accountable for that. Whether it's a person or some automated process doesn't change that IMO.
I expect a court case would be used to determine what a normal person could expect, what was represented buy the consultant company, and what exactly I requested to determine how much fault each party has.
Everybody whose rights were infringed. The GPL often technically makes that "everybody else" by granting what were otherwise exclusive rights (to make and distribute copies) to everybody and then taking them away from infringers.
So e.g. Company X makes a GPL'd program to do A, but Company Y just copy pastes it into famous product P and acts as though they made it and obviously doesn't give out source. As a random person who doesn't even own P, the argument would be that technically the GPL says you should be able to get source code for the program from Y, even though you didn't buy their product P - you were harmed by their refusal to do what the GPL requires, so you can sue them.
Now, suing is probably not a good idea in this case, a court is likely to either insist you aren't really injured or that they can't help you, or both, but I think it could work at least in theory.
> the argument would be that technically the GPL says you should be able to get source code for the program from Y, even though you didn't buy their product P - you were harmed by their refusal to do what the GPL requires, so you can sue them.
I don't think that's the case. Whenever I see discussion about GPL violations, the copyright holder is the one who has to go after the violator (and often getting them to do something is difficult, because legal challenges can be expensive); the consensus seems to be that users who receive the software in a manner not compliant with the GPL don't have standing to sue. I'm not sure if a user has ever tried, though, so not sure if this has been tested in court.
SFC versus Vizio is exactly what you say doesn't exist. The SFC deliberately brought a case where they don't own the copyright, and says they are harmed and here's why the court should find for them.
I assume that's because even if they aren't the original rights holder, they still have standing because they (as the public) were given rights under the GPL implicitly by anyone using the GPL, for the GPL covered items. Since they were denied those rights, they were harmed, and thus have standing.
At least that would be my "layman that finds court cases and discussions of them interesting so consumes that as entertainment fairly regularly" best guess.
I mean if you are held accountable for using copyrighted code by the owner, you should then in turn be able to hold the consultant accountable for being the source of it, and the blame and responsibility may be shifted in part or in whole.
I don't think it's all that different than if I'm an employer and my employee does something illegal. An investigation can be made as to whether the acted on their own acted based on what the directions or prevailing understanding was at the company. That may change who is responsible in part or in whole and what steps need to be taken to provide recompense to those impacted by that illegal activity.
Every situation will be complex and unique in its own way. That's what courts are for, determining the unique aspects of a case and making specific ruling based on the law and the situation, as the judge (in determining what is acceptable and willing to be seen) and jury (to determine whether someone needs to be held accountable and to a degree how) see fit.
That's for someone with legal experience and knowledge to answer.
I can only offer my opinion and more questions. For example; if you're a punk rock band and hire an artist to create promo material, and they draw a "vulgar Mickey Mouse" without your knowledge, who is in trouble? Seems you should just work backwards until you get to a human or org and have them tried in court on a case by case basis. Maybe that's a bad idea for reasons others can explain, its just my current opinion.
I’ve heard corporate types call open source projects “security risks” and “commie nonsense” but it does stop them from trying to acquire the work for free to profit off of it. It’s greedy and duplicitous. It’s capture.
As a lawyer who has worked in the federal judiciary, it's understandable that someone outside the legal profession would have some of these views ... but they're actually pretty off base.
> judges are often paid multiples less than leading lawyers
This part is true.
> and they are not desireable jobs and so many were attracted to the power/low competence ones.
No, no, and no. Judgeships are some of the most prestigious and desirable jobs in the entire legal profession. You have to be literally nominated by the president of the United States and confirmed by the Senate. Then you enjoy constitutionally protected life tenure. It's common to see elite lawyers, making millions as partners for large firms, leave their jobs to become federal judges. (Note that I'm talking about federal judges.)
> And also add in that with a single judge, a party/the state only needs to influence a single person.
I guess, in theory? But, in practice, this really doesn't happen. This is partially because judges highly value their independence. And their decisions are appealable, so this kind of corruption would be easily detected, or at least reversed, making it both risky and not very useful.
> The public often don't have access to all documents, transcripts and mostly what is published is the Judge's version of what the parties submissions are (i.e. one person writing history)
Nope. This is generally all public, unless there are specific confidential materials that need to be redacted. But this is unusual and disfavored.
> When a judge dismisses a case and classifies it as not refileable, always raise an eyebrow.
Eh. It's much more complicated than this. There are situations where this might raise an eyebrow -- like if the claim was recently filed and there's reason to think that it could be amended in a way that would rehabilitate it. But if it's obvious fundamentally doomed, or if it would unfairly prejudice other parties to allow it to be refiled, or for various other reasons, this can be totally legitimate.
>judges are often paid multiples less than leading lawyers and they are not desireable jobs
The first part of this is true, but the second part is laughable. Federal judgeships are highly prized and nearly impossible to get. There are only on the order of 900 Article III judges in the country and they serve for life.
>The public often don't have access to all documents, transcripts and mostly what is published is the Judge's version of what the parties submissions are
Completely false. While there can be redactions for sensitive information like trade secrets, in general everything is public record. And in particular, there is a strong presumption rooted in the first amendment and the right to a public trial to unredact anything that the court relied upon in reaching its decision.
>When a judge dismisses a case and classifies it as not refileable, always raise an eyebrow.
You get a chance to refile when the judge thinks that you could plausibly plead more facts that, when taken as true, establish your claim. When the claims fail as a matter of law, leave to amend would be futile since you can’t plead around that.
Finally, the great IP washing machine hums and can dissolve the whole structure. Bring forth your disassembly, to generate a draft, to re-generate clean source code. Cooperate-communism! It is done!
I don't think this proves you can just launder away copyright - nor do I think even we want that at this point.
First off: the claims dismissed have to do with 17 USC 1202, the part of the DMCA that deals with copyright management information. It's a bit of a plaintiff meme[0] to add a CMI claim onto a copyright infringement lawsuit. Obviously, if you infringe copyright, you're also not going to preserve the CMI. And if an AI were to regurgitate output, it doesn't even know that it did so, so it can't preserve CMI even if it wanted to.
Problem is, the AI doesn't regurgitate consistently enough to make a legal claim of CMI removal. The model does generate legally distinct outputs sometimes. You need to point to specific generations and connect the dots from the model to the output in a way that legally implicates GitHub, OpenAI, and/or Microsoft in ways that would not be disclaimed by, say, 17 USC 512 safe harbor. This is distinct from the training-time infringement claims which are still live, wouldn't rely on secondary liability, can't be disclaimed by honoring DMCA takedowns, and which I think are the stronger claim.
Let's step out of the realm of legality. Why do we want to get rid of copyright? For me, it's because copyright centralizes control over creativity. It tells other artists what they can do and forces them into larger and larger hierarchies. The problem is that AI models do the same thing. Using an AI model doesn't make you an artist[1], but it does move that artistic control further towards large creative industry. This is why you have a lot of publisher and big media CEOs that are strangely bullish on AI, a bunch of artists that ordinarily post shit for free are angry about it, and the FOSS people who hate software copyright were the first to sue.
In other words, AI is breaking copyright in order to replace it with more of the thing we hate about copyright.
[0] Or at least Richard Liebowitz liked to do it before he got disbarred.
[1] In the same way that commissioning an art piece does not itself make you an artist
Thank you for this; you really hit the nail on the head as to why this is so gross. If building, training, and operating an LLM was something well within the resources of anyone to do, I'd probably have much less of a problem of my open source code being copyright-laundered through products like Copilot.
But that's just not where we are right now, and the result of that feels awful.
Open source models exist and can be run locally. It doesn't matter if training from scratch is impractical for the average person, if we already have free (as in freedom) models we can build off of.
I have several "open" models sitting around for experimentation and occasional use, but I don't think downloadable weights solves the underlying issue.
First off, none of the good models are FOSS in the sense we normally expect - i.e. the four freedoms. At the least onerous end of the scale, Stable Diffusion models under the OpenRAIL license have a morality clause[0] and technical protections[1] to enforce that clause. LLaMA's licensing is only open for entities below a certain MAU, and Stable Diffusion 3 recently switched away from OpenRAIL to a LLaMA-like "free as in beer" license. Not only is this non-free, it's getting increasingly more proprietary as entities who are paying for the AI training start demanding a return on investment, and the easiest way to get that is to just demand licensing fees.
The reason why AI companies - not the artists or programmers they stole from - are in a position to demand these licensing terms at all is because they're the ones controlling the capital necessary to train models. If training from scratch at the frontier was still viable for FOSS tinkerers, we wouldn't have to worry about OpenAI reading all our GPT queries or Stability finding ways to put asterisks on their openness branding. FOSS software development is something you can do as a hobby, so if a project screws something up, you can build a new one. That's not how AI works. If Stability screws up, you still have to obey Stability's rules unless you train a new foundation model, and that's very expensive.
You see how this is very "copyright-like" even though it's legally contravening the letter and spirit of copyright law? Barriers to entry drive industrial consolidation and ratchet us towards a capitalist, privately-owned command economy. If I could train models from scratch, I'd make a decent model trained purely on public domain datasets[2], slather Grokfast[3] on it, and build in some UI to selectively fine-tune on licensed or custom data.
[0] To be clear, I don't have much against morality clauses personally, that's why I've used OpenRAIL models. But I still think adding morality clauses to otherwise FOSS licensing is a bad idea. At the very least, in order to put moral values into a legal contract, we have to agree as a community as to what moral values should be enforced by copyright. Furthermore, copyright and contracts are a bad tool to enforce morals.
[1] e.g. the Stable Diffusion safety filter
[2] Believe me, I tried
[3] An algorithm that increases the speed of grokking (generalization) by taking the FFT of gradients and amplifying the slow ones.
This makes me think we need models to deliberately try to make code that's equivalent to copyrighted code, but sufficiently changed that it's not infringing.
The end state would be to make the rewriting powerful enough that trying to claim infringement would also hit manually created code.
Alternately, generate code that is optimized for some task by some metric, and show that because the code is best by this criterion, it doesn't show creativity.
Another possibility here is for the LLM vendor to log the code generation tasks typically asked for and then salt the model with vetted, correct, non-infringing code for those questions.
While I think that would probably get around the copyright infringement issues, it still bothers me.
I don't like the idea that a corporation can hoover up countless open source code and contributions (including mine), and then use that to make money selling code generation assistance to other people, even if the output of that code generation would be different enough to any specific copyrighted block of code such that it couldn't result in a copyright infringement claim.
It's not even clear that we can stop this from happening; it's certainly possible that a "GPLv4" that had a provision against using covered code for LLM trailing would be legally (if not just practically) unenforceable.
To me, the ickiest part is centralization. While models and training tools will probably (maybe?) get cheaper over time, building and operating something like Copilot requires a lot of money and resources. Do we really want these capabilities to be locked up inside big, well-capitalized corporations? For me, the answer to that is a resounding hell no.
That said, the presence or absence of these individuals in specific threads is curious. The "all forms of copyright are unethical" people are all over the Disney threads. Surely anyone with such a keen interest in copyright law and policy would wish to comment on this thread as well, the topics being so similar?
For crying out loud, this submission is an hour old. People have barely had time to even know it exists.
Do you think people are just seething to have arguments and set up alarms to be woken up one the middle of the night whenever there’s a conversation involving copyright on HN?
Or may, just maybe, it’s different people and HN is not an amorphous hive mind where everyone thinks the same thing all the time.
It boggles the mind how there’s always someone on HN complaining about HN’s hypocrisy. Clearly not everyone on this site shares opinions on everything, else you wouldn’t be making the criticism.
Everything was going great and I had a working example, so I decided to look online for some example code to verify I was doing things correctly, and not making any glaring mistakes. It was then that I found an exact line by line copy of what chat gpt had given me. This was before it had the ability to google things, and the code predated openAI. It had even brought across spelling errors in the variables, the only thing it changed was it translated the comments from Spanish to English.
I had always been under the impression that chat gpt just learned from sources, and then gave you a new result based roughly on its sources. I think some of the confounding variables here were, 1. this was a very specific use case and not many examples existed, and 2. all opengl code looks similar, to a point.
The worst part was, there was no license provided for the code or the repo, so it was not legal for me to take the code wholesale like that. I am now much more cautious about asking chat gpt for code, I only have it give me direction now, and no longer use 'sample code' that it produces.