Radio stations get to use anyone’s music but they still need to pay to play that...

ndriscoll · 2025-09-29T22:40:01 1759185601

Hmm? Creating new models is clearly adding wealth to the world, and it wouldn't terribly surprise me if a lot of source material (e.g. scanned books or recorded music) is older than the people working on models. The history of copyright is basically a perfect example of rent-seeking.

Retric · 2025-09-29T23:35:32 1759188932

Creating new models doesn’t require taking content without any compensation.

That’s the basic flaw in any argument around necessity.

ndriscoll · 2025-09-29T23:55:11 1759190111

No, but society has no reason to grant monopolies on 50 year old publications (e.g. textbooks or news articles written or songs recorded prior to 1975), and the changes that were made to copyright law to extend it into multiple generations were actual rent-seeking. i.e. manipulating public policy to transfer wealth from others to you rather than creating wealth. Going with the original 28 year compromise from a time when publishing was much more expensive, anything prior to 1997 would be free for anyone to use for any purpose with no expectation of compensation. We'd all be far richer culturally if we had this base to freely build upon instead of having the last 100 years stolen.

Likewise much of the most important information to want to train on (research literature) was just straight up stolen from the public that paid for its creation already.

By contrast, the models being created from these works are obviously useful to people today. They are clearly a form of new wealth generation. The open-weights models are even an equitable way of doing so, and are competitive with the top proprietary models. Saying the model creators need to pay the people monopolizing generations-old work is the rent-seeking behavior.

Retric · 2025-09-30T00:58:25 1759193905

It’s far more recent works that AI companies care about. They can’t copy 50 year old python, JavaScript, etc code because it simply doesn’t exist. There’s some 50 year old C code, but it’s no longer idiomatic and so it goes.

Utility of older works drop off as science marches on and culture changes. The real secret of long copyright terms is they just don’t matter much. Steamboat Willy entered the public domain and for all practical purposes nothing changed. Chip 20 years off of current copyright terms and it starts to matter more, but still isn’t particularly important. Sure drop it down to say 5 years and that’s meaningful but now it’s much harder to be an author which means fewer books worth reading.

AnthonyMouse · 2025-09-30T07:23:31 1759217011

I still don't really get how compensation is supposed to work just based on the math. Models are trained on billions of works and have a lifetime of around a year; AI companies (e.g. Anthropic) have revenue in the low billions of dollars a year.

Even if you took all of that -- leave nothing for salaries, hardware, utilities, to say nothing of profit -- and applied it to the works in the training data, it would be approximately $1 each.

What is that good for? It would have a massive administrative cost and the authors would still get effectively nothing.

Retric · 2025-09-30T16:00:28 1759248028

I think you’re overestimating the number of authors, and forgetting there’s several AI companies. A revenue sharing agreement with 10% going to creators isn’t unrealistic.

Google’s revenue was 300 billion with 100 billion in profits last year, the AI industry may never reach that size but 1$/person on the planet is only 8 billion dollars, drop that to 70% of people are online so your down to 5.6 billion.

That’s assuming you’re counting books and individual Facebook posts in any language equally. More realistically there’s only 12k professional journalists in the US but they create a disproportionate amount of value for AI companies.

AnthonyMouse · 2025-09-30T20:15:31 1759263331

> Google’s revenue was 300 billion with 100 billion in profits last year, the AI industry may never reach that size but 1$/person on the planet is only 8 billion dollars, drop that to 70% of people are online so your down to 5.6 billion.

Google is a huge conglomerate and a poor choice for making estimates because the bulk of their revenue comes from "advertising" with no obvious way to distinguish what proportion of that ad revenue is attributable to AI, e.g. what proportion of search ad revenue is attributable to being the same company that runs the ad network, and to being the default search in Android, iOS and Chrome? Nowhere near all of it or even most of it is from AI.

"Counting books and individual Facebook posts in any language equally" is kind of the issue. The links from the AI summary things are disproportionately not to the New York Times, they're more often to Reddit and YouTube and community forums on the site of the company whose product you're asking about and Stack Overflow and Wikipedia and random personal blogs and so on.

Whereas you might have written an entire book, and that book is very useful and valuable to human readers who want to know about its subject matter, but unless that subject matter is something the general population frequently wants to know about, its value in this context is less than some random Facebook post that provides the answer to a question a lot of people have.

And then the only way anybody is getting a significant amount of money is if it's plundering the little guy. Large incumbent media companies with lawyers get a disproportionate take because they're usurping the share of YouTube creators and Substack authors and forum posters who provided more in aggregate value but get squat. And I don't see any legitimacy in having it be Comcast and the Murdoch family who take the little guy's share at the cost of significant overhead and making it harder for smaller AI companies to compete with the bigger ones.

Retric · 2025-09-30T23:46:27 1759275987

> Google is a huge conglomerate

The point of comparison was simple a large company here, the current size of say OpenAI when the technology is still fairly shitty is a poor benchmark for where the industry is going. LLM’s may even get superseded by something else, but whatever form AI takes training it is going to require work from other people outside the company in question.

Attribution is solvable both at a technical and legal level. There’s a reasonable argument a romance novelist isn’t contributing much value, but that’s not an argument nobody should be getting anything. Presumably the best solution for finding value is let the open market decide the rough negotiations.

AnthonyMouse · 2025-10-01T06:45:14 1759301114

> LLM’s may even get superseded by something else, but whatever form AI takes training it is going to require work from other people outside the company in question.

It's going to require training data, but no incremental work is actually being done; it's being trained on things that were written for an independent purpose and would still have been written whether they were used as training data or not.

If something was actually written for the sole purpose of being training data, it probably wouldn't even be very good for that.

> Attribution is solvable both at a technical and legal level.

Based on how this stuff works, it's actually really hard. It's a statistical model, so the output generally isn't based on any single thing, it's based a fraction of a percent each on thousands of different things and the models can't even tell you which ones.

When they cite sources I suspect it's not even the model choosing the sources from training data, it's a search engine providing the sources as context. Run a local LLM and see what proportion of the time you can get it to generate a URL with a path you can actually load.

> Presumably the best solution for finding value is let the open market decide the rough negotiations.

That's exactly the thing that doesn't work here because of the transaction costs. If you write a blog, are you supposed to negotiate with Google so they can pay you half a french fry for using it as training data? Neither party has any use for that; the cost of performing the negotiations is more than the value of the transaction. But the aggregate value being lost if it can't be used as a result of that is significant, because it's a tiny amount each but multiplied by a billion.

And then what would happen in practice? Google says that in exchange for providing you with video hosting, you agree to let them use anything you upload to YouTube as training data. And then only huge conglomerates can do AI stuff because nobody else is in a position to get millions of people to agree to that term, but still none of the little guys are getting paid.

Restricting everyone but massive conglomerates from doing AI training in order to get them to maybe transfer some money exclusively to some other massive conglomerates is a bad trade off. It's even a bad trade off for the media companies who do not benefit from stamping out competitors to Google and the incumbent social media giants that already have them by the neck in terms of access to user traffic.

ndriscoll · 2025-09-30T11:15:48 1759230948

Ostensibly copyright is there to increase economic incentives to make things it protects, and like you said, we can massively cut it down without affecting much there. So focusing on economic viability, set it to something like 15 years for code and 20-30 for everything else. Require registration for everything and source escrow for code and digital art to be granted copyright. That would give a wealth of code to train on already even without people who would be fine freely giving it away. There's also government code as a relatively large public domain source for recent material.

Like I said science has mostly been stolen, and has no business being copyrighted at all. The output of publicly funded research should immediately be public domain.

Anyway this is beside the point that model creation is wealth creation, and so by definition not rent-seeking. Lobbying for a government granted monopoly (e.g. copyright) is rent-seeking.

Retric · 2025-09-30T14:54:51 1759244091

Exclusively training on 15 year source code would make code generation significantly less useful as API’s change.

Economic viability and utility for AI training are closely linked. Exclude all written works including news articles etc from the last 25 years and your model will know nothing about Facebook etc.

It’s not as bad if you can exclude stuff from copyright and then use that, but your proposal would have obvious gaps like excluding works in progress.

ndriscoll · 2025-09-30T17:19:32 1759252772

You wouldn't need to exclusively train on 15 year old source code. What I said would simply grant you free access to all 15 year old source code, but you can already train on public domain code and likely any FOSS code without any issue, or if courts do start deciding that models inherit copyright, at the most you might have to link a list of all of the codebases you trained on with license info. The nature of the thing is that any code it spits out is already in source form, so the only missing part is the notice.

I suppose we all exist in our own bubbles, but I don't know why anyone would need a model that knows about Facebook etc. In any case, it's not clear that you couldn't train on news articles? AFAIK currently the only legal gray area with training is when e.g. Facebook mass pirated a bunch of textbooks. If you legally acquire the material, fitting a statistical model to it seems unlikely to run afoul of copyright law. Even without news articles, it would certainly learn something of the existence of Facebook. e.g. we are discussing it here, and as far as I know you're free to use the Hacker News BigQuery dump to your liking. Or in my proposed world, comments would naturally not be copyrighted since no one would bother to register them (and indeed a nominal fee could be charged to really make it pointless to do so). I suppose it is an important point that in addition to registration, we should again require notices, maybe including a registration ID.

Give a post-facto grace period of a couple weeks/months to register a thing for copyright. This would let you cover any work in progress that gets leaked by registering it immediately, causing the leak to become illegal.

Retric · 2025-09-30T18:26:54 1759256814

>> It’s not as bad if you can exclude stuff from copyright and then use that

Making a copy of a news article etc to train with is on the face of it copyright infringement even before you start training. Doing that for OSS is on the other hand fine, but there’s not that much OSS.

I think training itself could reasonably be considered fair use on a case by case basis. Train a neural network to just directly reproduce a work being obviously problematic etc. There’s plenty of ambiguity here.

otterley · 2025-09-30T14:16:05 1759241765

That’s fine, but if you don’t want content taken without compensation, don’t make it available for free on the Internet. You can’t have it both ways, where it’s free to individuals to read but not for machines to do it. That’s just practically impossible.

otterley · 2025-09-30T14:14:31 1759241671

The music analogy doesn’t hold. Unlike websites that provide content for free to the public, commercial recording artists don’t make their content available for free on demand to the public. Spotify and radio/TV broadcasters, as well as individuals, don’t get a copy unless they buy one or make arrangements with the publisher or its licensees.

This is why we’re seeing paywalls go up: authors and publishers of textual content are seeing that they need to protect the value of their assets.

Retric · 2025-09-30T14:28:35 1759242515

LLMs are being trained on published books a direct equivalent of records. People were able to get one to reproduce over 40% of Harry Potter and the Sorcerer’s stone word for word. https://arstechnica.com/features/2025/06/study-metas-llama-3...

There’s zero chance that happened without the book being in their training corpus. Worse, there’s significant effort put into obscuring this.

otterley · 2025-09-30T14:31:02 1759242662

Yes, they are trained on books, but the courts so far are largely in agreement that AI models are neither copies nor derivative works of the source materials. If you’re searching for legal protections against your works being used for model training, copyright law as written today does not appear to give you cover.

Retric · 2025-09-30T14:34:44 1759242884

Stuff can take a long time to wind their way through the court system. The worst cases already failed but many are going strong, here’s a 1.5 billion dollar win for authors.

https://www.kron4.com/news/technology-ai/anthropic-copyright...

otterley · 2025-09-30T14:48:26 1759243706

Anthropic and the authors settled over a portion of the case involving the unauthorized copying of works that were used to train the model. Obtaining the works is a step that happens before the training has begun.

“the authors alleged nearly half a million books had been illegally pirated to train AI chatbots...”

Finally, a settlement isn’t a “win” from a legal perspective. It’s money exchanged for dropping the case. In almost every settlement, there’s no admission of guilt or liability.

Retric · 2025-09-30T15:15:34 1759245334

The entire case settled, the authors aren’t going to appeal when the company can’t hand out much more than the 1.5 billion in question and the company isn’t allowed to use the works in question going forward.

otterley · 2025-09-30T15:45:55 1759247155

Before the settlement was made, Judge Alsup found as a matter of law that the training stage constituted fair use.

https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

Retric · 2025-09-30T16:07:53 1759248473

A judge yes, but that’s subject to appeal. The point is it never reached that stage and never will.

otterley · 2025-09-30T17:46:23 1759254383

As an attorney, I'm trying to understand what you're getting at.

His opinion, while interlocutory and not binding precedent, will be cited in future cases. And his wasn't the only one. In Kadrey v. Meta Platforms, Inc., No. 23-cv-03417 (N.D. Cal. June 25, 2025) Judge Chhabria reached the same conclusion. https://storage.courtlistener.com/recap/gov.uscourts.cand.41...

In neither case has an appeal been sought.

Retric · 2025-09-30T18:53:12 1759258392

I’m saying legally things aren’t clear cut at this point.

If you’ve read Kadrey the judge says harm from competing with the output of authors would be problematic. Quite relevant for software developers suing about code generation but much harder for novelists to prove. However, the judge came to the opposite conclusion about using pirate websites to download the books in question.

A new AI company that is expecting to face a large number of lawsuits and win some while losing others isn’t in a great position.

otterley · 2025-10-01T15:11:38 1759331498

The judge didn’t come to an opposite legal conclusion about using pirated works. The judge concluded that the claim of using pirated works was unsubstantiated by the plaintiffs.