I will again ask the obligatory question: are model weights even copyrightable? ...

parl_match · 2024-11-05T20:59:44 1730840384

I doubt there will be a satisfactory answer for a long time.

killjoywashere · 2024-11-05T22:37:17 1730846237

How's that NYTimes vs OpenAI lawsuit going? Last I can find is things are hung up in discovery: OpenAI has requested potentially a century of NYTimes reporters' notes.

https://news.bloomberglaw.com/ip-law/openais-aggressive-cour...

bdowling · 2024-11-05T22:55:29 1730847329

Half a century worth of reporters’ notes might be some valuable training data.

neilv · 2024-11-06T00:19:55 1730852395

> The AI company asked Judge Sidney H. Stein of the US District Court for the Southern District of New York to step in and compel the Times to produce reporters’ notes, interview memos, and other materials for each of the roughly 10 million contested articles the publication alleges were illegally plugged into the company’s AI models. OpenAI said it needs the material to suss out the copyrightability of the articles. The Times quickly fired back, calling the request absurd.

Can any lawyer on here defend OpenAI's request? Or is the article not characterizing it well in the quote?

warkdarrior · 2024-11-05T21:16:16 1730841376

(IANAL)

Model weights could be treated the same way phone books, encyclopedias, and other collections of data are treated. The copyright is over the collection itself, even if the individual items are not copyrightable.

TMWNN · 2024-11-05T21:18:12 1730841492

>phone books, encyclopedias, and other collections of data are treated

Encyclopedias are copyrightable. Phone books are not.

skissane · 2024-11-05T22:14:12 1730844852

> Encyclopedias are copyrightable. Phone books are not.

It depends on the jurisdiction. The US Supreme Court ruled that phone books are not copyrightable in the 1991 case Feist Publications, Inc., v. Rural Telephone Service Co.. However, that is not the law in the UK, which generally follows the 1900 House of Lords decision Walter v Lane that found that mere "sweat of the brow" is enough to establish copyright – that case upheld a publisher's copyright on a book of speeches by politicians, purely on the grounds of the human effort involved in transcribing them.

Furthermore, under its 1996 Database Directive, the EU introduced the sui generis database right, which is a legally distinct form of intellectual property from copyright, but with many of the same features, protecting mere aggregations of information, including phone directories. The UK has retained this after Brexit. However, EU directives give member states discretion over the precise legal mechanism of their implementation, and the UK used that discretion to make database rights a subset of copyright – so, while in EU law they are a technically distinct type of IP from copyright, under UK law they are an application of copyright. EU law only requires database rights to have a term of 15 years.

Do not be surprised if in the next couple of years the EU comes out with a "AI Model Weights Directive" establishing a "sui generis AI model weights right". And I'm sure US Congress will be interested in following suit. I expect OpenAI / Meta / Google / Microsoft / etc will be lobbying for them to do so.

ronsor · 2024-11-05T21:24:45 1730841885

Encyclopedias may be collections of facts, but the writing is generally creative. Phone books are literally just facts. AI models are literally just facts.

margalabargala · 2024-11-05T22:37:42 1730846262

> AI models are literally just facts.

Are they, or are they collections of probabilities? If they are probabilities, and those probabilities change from model to model, that seems like they might be copywritable.

If Google, OpenAI, Facebook, and Anthropic each train a model from scratch on an identical training corpus, they would wind up with four different models that had four differing sets of weights, because they digest and process the same input corpus differently.

That indicates to me that they are not a collection of facts.

ronsor · 2024-11-05T23:49:04 1730850544

The AI training algorithms are deterministic given the same dataset, same model architecture, and same set of hyperparameters. The main reasons the models would not be identical is due to differing random seeds and precision issues. The differences would not be due to any creative decisions.

margalabargala · 2024-11-06T17:00:45 1730912445

Sure, but they don't all use the same algorithm, the same hyperparameters, etc.

At some point, with sufficiently many hyperparameters being chosen, that starts becoming a creative decision. If 5 parameters are available and all are left at the default, then no, that's not creative. If there are ten thousand, and all are individually tweaked to yield what the user wants, is that creative?

Not to mention all of these companies write their own algorithms to do the training which can introduce other small differences.

roywiggins · 2024-11-05T21:33:41 1730842421

What if I train an AI model on exactly one copyrighted work and all it does it spit that work back out?

eg if I upload Marvels_Avengers.mkv.onnx and it reliably reproduces the original (after all, it's just a fact that the first byte of the original file is OxF0, etc)

bdowling · 2024-11-05T23:00:02 1730847602

A work that is “substantially similar” to a copyrighted work infringes that work, under US law, no matter how it was produced. (Note: Some exceptions apply and you have to read a lot of cases to get an idea of what courts find “substantially similar” .)

HWR_14 · 2024-11-06T03:48:55 1730864935

> no matter how it was produced

IIRC, this is wrong. Independent creation is a valid (but almost impossible to prove) defense in US copyright law.

This example is not an independent creation, but your reasoning seems wrong.

bdowling · 2024-11-08T06:13:59 1731046439

I wrote "some exceptions apply" to try to avoid getting into the weeds, but yes, independent creation is an exception. Other exceptions include out-of-term works, public domain, Mise-en-scène (e.g., stock characters), fair use (a huge can of worms), etc.

ronsor · 2024-11-05T21:35:49 1730842549

If the sole purpose of your model is to copy a work, then that's copyright infringement.

PeterStuer · 2024-11-06T20:39:28 1730925568

If the sole purpose of your model is to copy a work, then there would be far easier, cheaper and more reliable techniques to achieve that.

Judge the output, not the system.

roywiggins · 2024-11-05T21:38:26 1730842706

Oh, in this case, the model can either reproduce the work exactly, or it can play tic-tac-toe depending on how you prompt it.

ronsor · 2024-11-05T21:41:30 1730842890

We can change "sole purpose" to "primary purpose", and I'd argue something that happens 50% of the time counts as a primary purpose.

PittleyDunkin · 2024-11-06T06:09:37 1730873377

Who gives a damn about copyright when this is clearly profiting off of someone else's work without compensation? Sometimes the law is inadequate and that's ok—the law just needs to change.