Simon Willison had an analysis of Claude's system prompt back in May. One of the...

miltonlost · 2025-11-11T15:02:37 1762873357

All AI companies know they're breaking the law. They all have prompts effectively saying "Don't show that we broke the law!". That we continue to have tech companies consistently breaking the law and nothing happens is an indictment of our current economy.

admaiora · 2025-11-11T15:55:02 1762876502

And it's a question of do we accept breaking law for the possibility to have the greatest technological advancement of the 21st century. In my opinion, legal system has become a blocker for a lot of innovation, not only in AI but elsewhere as well.

rpdillon · 2025-11-11T16:33:04 1762878784

This is a point that I don't see discussed enough. I think anthropic decided to purchase books in bulk, tear them apart to scan them, and then destroy those copies. And that's the only source of copyrighted material I've ever heard of that is actually legal to use for training LLMs.

Most LLMs were trained on vast troves of pirated copyrighted material. Folks point this out, but they don't ever talk about what the alternative was. The content industries, like music, movies, and books, have done nothing to research or make their works available for analysis and innovation, and have in fact fought industries that seek to do so tooth and nail.

Further, they use the narrative that people that pirate works are stealing from the artists, where the vast majority of money that a customer pays for a piece of copyrighted content goes to the publishing industry. This is essentially the definition of rent seeking.

Those industries essentially tried to stop innovation entirely, and they tried to use the law to do that (and still do). So, other companies innovated over the copyright holder's objections, and now we have to sort it out in the courts.

visarga · 2025-11-11T16:47:28 1762879648

> So, other companies innovated over the copyright holder's objections, and now we have to sort it out in the courts.

I think they try to expand copyright from "protected expression" to "protected patterns and abstractions", or in other words "infringement without substantial similarity". Otherwise why would they sue AI companies? It makes no sense:

1. If I wanted a specific author, I would get the original works, it is easy. Even if I am cheap it is still much easier to pirate than use generative models. In fact AI is the worst infringement tool ever invented - it almost never reproduces faithfully, it is slow and expensive to use. Much more expensive than copying which is free, instant and makes perfect replicas.

2. If I wanted AI, it means I did not want the original, I wanted something Else. So why sue people who don't want the originals? The only reason to use AI is when you want to steer the process to generate something personalized. It is not to replace the original authors, if that is what I needed no amount of AI would be able to compare to the originals. If you look carefully almost all AI outputs get published in closed chat rooms, with a small fraction being shared online, and even then not in the same venues as the original authors. So the market substitution logic is flimsy.

sidewndr46 · 2025-11-11T16:56:07 1762880167

You're using the phrase "actually legal" when the ruling in fact meant it wasn't piracy after the change. Training on the shredded books was not piracy. Training on the books they downloaded was piracy. That is where the damages come from.

Nothing in the ruling says it is legal to start outputting and selling content based off the results of that training process.

rpdillon · 2025-11-11T17:10:44 1762881044

I think your first paragraph is entirely congruent with my first two paragraphs.

Your second paragraph is not what I'm discussing right now, and was not ruled on in the case you're referring to. I fully expect that, generally speaking, infringement will be on the users of the AI, rather than the models themselves, when it all gets sorted out.

sidewndr46 · 2025-11-11T17:30:47 1762882247

I'm in agreement that it will be targeted at the users of AI as well. Once that prevails legally someone will try litigating against the users and the AI corporations as a common group.

gruez · 2025-11-11T17:00:01 1762880401

>Nothing in the ruling says it is legal to start outputting and selling content based off the results of that training process.

Nothing says it's illegal, either. If anything the courts are leaning towards it being legal, assuming it's not trained on pirated materials.

>A federal judge dealt the case a mixed ruling in June, finding that training AI chatbots on copyrighted books wasn't illegal but that Anthropic wrongfully acquired millions of books through pirate websites.

https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...

Q6T46nT668w6i3m · 2025-11-11T16:35:10 1762878910

I don’t follow. You’re punishing the publishing industry by punishing authors?

rpdillon · 2025-11-11T16:46:26 1762879586

I'm saying that LLMs are worthwhile useful tools, and that I'm glad that we built them, and that the publishing industry, which holds the copyright on the material that we would use to train the LLMs, have had no hand in developing them, have done no research, and have actively tried to fight the process at every turn. I have no sympathy for them.

The authors have been abused by the publishing industry for many decades. I think they're just caught in the middle, because they were never going to get a payday, whether from AI or selling books. I think the percentage of authors that are commercially successful is sub 1%.

cycomanic · 2025-11-11T20:00:58 1762891258

So the argument is because LLMs are useful and the publishing industry was not involved in their creation we should disregard the property rights of the publishing industry and allow using their work without a license? By that same argument (if something useful is being build, we ignore existing rights) shouldn't not also just take the code/models from OpenAI etc. and just publish them somewhere? Why not also their datacenters?

rpdillon · 2025-11-11T20:51:57 1762894317

It's not really an argument. It's an observation that they sat on their hands while other industries out-innovated them. They were complacent and now they're paying the price.

We have laws and rules, but those are intended to work for society. When they fail to do so, society routes around them. Copyright in particular has been getting steadily weaker in practice since the advent of the Internet, because the mechanisms it uses to extract value are increasingly impractical since they are rooted in the idea of printed media.

Copyright is fundamentally broken for the modern world, and this is just a symptom of that.

1718627440 · 2025-11-11T21:14:25 1762895665

> Folks point this out, but they don't ever talk about what the alternative was.

That LLMs would be as expensively priced as they really are on society and energy costs? A lot of things are possible, whether they are economically feasible is determined by giving them a price. When that price doesn't reflect the real costs, society starts to wast work on weird things, like building large AI centers, because of a financial bubble. And yes putting people out of business does come with a cost.

"Innovation" is not an end goal.

rpdillon · 2025-11-12T02:14:19 1762913659

Innovation is absolutely an end goal, at least in terms of our legal framework. The primary impetus for copyright and patent law is is innovation: to credit those that innovate their due, and I do think this stems from our society seeing innovation as an end goal. But the intent of the system is always different than its actual effect, and I'm fairly passionate about examining the shear.

I run my AI models locally, paying for the hardware and electricity myself, precisely to ensure the unit economics of the majority of my usage are something I can personallly support. I do use hosted models regularly, though not often these days, which is why I say "the majority of my usage".

In terms of the concerns you express, I'm simply not worried. Time will sort it out naturally.

Q6T46nT668w6i3m · 2025-11-11T16:34:09 1762878849

You’re willing to eliminate the entire concept of intellectual property for a possibility something might be a technological advancement? If creators are the reason you believe this advancement can be achieved, are you willing to provide them the majority of the profits?

thedevilslawyer · 2025-11-11T16:42:43 1762879363

That's an absolutely good tradeoff. There's no longer any need for copyright. Patents should go next. Only trademarks can stay.

delaminator · 2025-11-11T17:33:35 1762882415

> There's no longer any need for copyright

So you assign zero value to the process of creation?

Zero value to the process of production?

So people who write and produce books, shows and films should all do what? Give up their craft?

thedevilslawyer · 2025-11-12T23:47:38 1762991258

Creation isn't special, or constrained in number.

Process of creation itself is gratifying and valuable to those who will pursue it. No reason to additionally reward it.

Lamp lighters had to give up their craft I suppose and made way to a better world.

delaminator · 2025-11-18T09:41:56 1763458916

> Creation isn't special, or constrained in number. > Process of creation itself is gratifying and valuable to those who will pursue it.

spoken like someone who has never made anything in the real world

Holding a boom mic in the air is not gratifying and valuable to anyone who has to do it.

The fruits of your labour are not your labour.

_DeadFred_ · 2025-11-11T17:54:56 1762883696

Bullshit. Read up and understand the history of these things and their benefits to society. There is a reason they were created in the first place. Over a very long time. With lots of thoughts into the tradeoff/benefits to society. That Disney fucked with it does not make the original tradeoff not a benefit to society.

thedevilslawyer · 2025-11-12T23:49:04 1762991344

The fact that you don't actually call out the specific benefit is telling. We're in a world of plenty and don't need copyright to have those benefits for our fellow humans.

saghm · 2025-11-11T16:27:28 1762878448

Without agreeing or disagreeing with your view, I feel like the the issue the issue with that paradigm is inconsistency. If an individual "pirates", they get fines and possible jail time, but if a large enough company does it, they get rewarded by stockholders and at most a slap on the wrist by regulators. If as a society we've decided that the restrictions aren't beneficial, they should be lifted for everyone, not just ignored when convenient for large corporations. As it stands right now, the punishments are scaled inversely to the amount of damage that the one breaking the law actually is capable of doing.

hulitu · 2025-11-14T12:43:27 1763124207

> And it's a question of do we accept breaking law for the possibility to have the greatest technological advancement of the 21st century

You mean like, murder ?

lokar · 2025-11-11T15:52:55 1762876375

The whole industry is based on breaking the law. You don’t get to be Microsoft, Google, Amazon, meta, etc without large amounts of illegality.

And the VC ecosystem and valuations are built around this assumption.

mock-possum · 2025-11-11T15:44:06 1762875846

I don’t read this as “don’t show we broke the law,” I read it as “don’t give the user the false impression that there’s any legal issue with this generated content.”

There’s nothing law breaking about quoting publicly available information. Google isn’t breaking the law when it displays previews of indexed content returned by the search algorithm, and that’s clearly the approach being taken here.

Q6T46nT668w6i3m · 2025-11-11T16:36:04 1762878964

Masked token prediction is reconstruction. It goes far beyond “quoting.”

terminalshort · 2025-11-11T18:35:49 1762886149

This is incorrect. Two judges have now ruled that training on copyrighted data is fair use. https://www.whitecase.com/insight-alert/two-california-distr...

Workaccount2 · 2025-11-11T16:56:37 1762880197

Training on copyright is not illegal. Even in the lawsuit against anthropic it was found to be fair use.

Pirating material is a violation of copyright, which some labs have done, but that has nothing to do with training AI and everything to do with piracy.

_DeadFred_ · 2025-11-11T18:01:03 1762884063

If my for profit/for sale product couldn't exist without inputting copyrighted works into it, then my product is derivative of those works. It's a pretty simple concept. No 'but human brains learn'. Humans aren't a corpo's for profit product.

'Would this product have the same value without the copyrighted works?'

If yes then it's not derivative. If no then it is.

dahart · 2025-11-11T17:20:44 1762881644

There is US precedent for training being deemed not fair use. https://www.dglaw.com/court-rules-ai-training-on-copyrighted...

Why wouldn’t training be illegal? It’s illegal for me to acquire and watch movies or listen to songs without paying for them*. If consuming copyrighted material isn’t fair use, then it doesn’t make sense that AI training would be fair use.

* I hope it’s obvious but I feel compelled to qualify that, of course, I’m talking about downloading (for example torrenting) media, and not about borrowing from the library or being gifted a DVD, CD, book or whatever, and not listening/watching one time with friends. People have been successfully prosecuted for consuming copyrighted material, and that’s what I’m referring to.

terminalshort · 2025-11-11T18:40:54 1762886454

That interpretation is not correct. The owner explicitly denied license to the data and then the company went to a third party to gain access to the data that they were denied license to.

> When building its tool, Ross sought to license Westlaw’s content as training data for its AI search engine. As the two are competitors, Thomson Reuters refused. Instead, Ross hired a third party, LegalEase, to provide training data in the form of “Bulk Memos,” which were created using Westlaw headnotes. Thomson Reuters’s suit followed, alleging that Ross had infringed upon its copyrighted Westlaw headnotes by using them to train the AI tool.

dahart · 2025-11-11T19:51:16 1762890676

You’re contradicting the conclusion / interpretation written on dglaw.com? What is incorrect, exactly? It doesn’t seem like your summary challenges either my comment or the article I linked to, it’s not clear what you’re arguing. The court did find in this case that the use of the unlicensed data used for AI training was not fair use.

Workaccount2 · 2025-11-12T15:28:34 1762961314

The case isn't on LLMs or transformers, it's on using some other form of non generative AI to create an index of case law. The details are light, but I would guess that the "AI" was just copying over the data from Thomson Reuters.

boredhedgehog · 2025-11-11T17:52:45 1762883565

> Training on copyright is not illegal.

The court decision this thread is about holds that it is, on the grounds that the training data was copied to the LLM's memory.

hulitu · 2025-11-14T12:41:50 1763124110

You can always vote, but there is always someone going through the back door paying politicians and judges.

blibble · 2025-11-11T16:27:08 1762878428

and training on mountains of open source code with no attribution is exactly the same

the code models should also be banned, and all output they've generated subject to copyright infringement lawsuits

the sloppers (OpenAI, etc) may get away with it in the US, but the developed world has far more stringent copyright laws

and the countries that have massive industries based on copyright aren't about to let them evaporate for the benefit of a handful of US tech-bros

terminalshort · 2025-11-11T18:42:45 1762886565

No thank you. I am perfectly fine with AI training on my open source code and it is perfectly legal because my open source code does not include a license that bans AI training.

blibble · 2025-11-11T18:47:50 1762886870

which license is that then?

because other than public domain they all require at least displaying the license, which "AI" ignores

qustrolabe · 2025-11-11T18:52:47 1762887167

post trained models strongly inclined to pass response similar to what got them high RL score, it's slightly wrong to keep thinking of LLMs as just next token predictions from dataset's probability distribution like it's some Markov Chain