Hacker News new | past | comments | ask | show | jobs | submit login
The New York Times Launches a Strong Case Against Microsoft and OpenAI (katedowninglaw.com)
205 points by andy99 8 months ago | hide | past | favorite | 294 comments



A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources.

If we enforce the behavior of always paraphrasing and synthesizing the information before returning it even in cases where the exact quote is asked for then that is a failure.

The correct solution is to make the model capable of

1) quoting directly

2) identifying and then indicating when its output is a direct quote

3) citing the source of the identified quote

while still retaining the ability to paraphrase and synthesize those sources when appropriate. These requirements are what humans are held to and should also be applied to commercial AI models. Models that are not intended for commercial use should be exempt and at this point there isn't really a way to hold them to such requirements anyways.


> These requirements are what humans are held to

If only. Tracking down the original source of quotes and/or stats is a bit of a hobby of mine, and it's extremely difficult to do. News sources will regularly cite "a recent study" with little additional information to help identify it. Pithy quotes attributed to famous authors never contain a reference to the work, it's always just the author's name, and if you go digging the quote is misattributed as often as not. Full-on plagiarism with no attribution is rampant, even in reputable places where you'd think they'd have it under control.

Like with self-driving cars, we expect AI to be better than the average human, which I think is the correct attitude, but we should acknowledge that that's what we're asking for.


Or when they reference previous news they always reference their own articles and not the original press releases. Engadget is a prime example of this.

You always end up having to search for the sources on your favorite search engine.


“Use the Force, Harry” - Gandalf


I have a similar hobby of sorts I occasionally pick up and drop off, except of unattributed quotes. Some are easy (“Who shall deliver me from this turbulent priest?”), while some I still cannot find the exact quote from, even while knowing what should be the source material (“It was wintertide at Camelot. The rich brotherhood did rightly revel, and mirthful was their mood. Oft-times on tourney bent those gallants sought the field, Though like as joust those gentle knights did sally with missiles made of snow and laughingly grapple on slippery ground.”).

It’s certainly highlighted to me how difficult information upkeep can be, and is a strong reminder to source myself as soon as possible. Which is a realistic and ideal expectation for NN, as alongside their training weights ideally their training dataset is searchable and attributable. Otherwise I feel the largest strength of a dataset, being a digital library for the NN and thus us, is lost.


> “Who shall deliver me from this turbulent priest?”

Robert Dodsley



if you read your link you'll see that the "quote" is what Robert Dodsley said about it 600 years later, and then changed it in the second edition


A pet peeve of mine is

> some spicy quote -- author

and when you go look, it is quoting something a character said in a story

an example [0]

> “The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.” - George Bernard Shaw

But it is part of a speech in "Man and Superman," by the protagonist, John Tanner.

(although Shaw is known for expressing his views through his character, it's just one I could find)

[0] https://unearnedwisdom.com/the-reasonable-man-adapts-himself...


I assume you're unfamiliar with anonymous sources - said "a person familiar with the matter."

The issue isn't usefulness, and it isn't even objective quality. News outlets are primarily about opinion forming and marketing, not about absolute truthfulness.

This is a simple corporate fight about the value of that IP.

The NYT wants a licensing deal, and it's likely to get one.

That's all.


Why doesn't Google need licensing to scrape and reproduce NYT snippets in their results? OpenAI doesn't even quote the sources its consumed to produce it's output. It seems totally fair use to me. Any given content site has authors that read stuff that is copyrighted and produce their own take.


When I worked on something related-enough a few jobs ago, I was surprised when our legal dept said that google (and pinterest) generally doesn't have a legal leg to stand on with regards to that kind of thing, but since they link to the thing they're quoting/copying, they have a relatively symbiotic relationship. If you sue google into delisting you, you end up with less traffic, so you don't sue.


It's possible that you misunderstood and/or are oversimplifying. Failing that, you should be surprised—anyone saying something like that really ought not be in a position where someone looks to them for counsel. This has been litigated, and precedent is in favor of search engines, not against them. It does depend on the nature of what exactly we're talking about, though. (Again: it's possible that the question they answered doesn't match what you understood them to be saying.)


Oh yeah, I'm definitely simplifying a ton of discussion with legal, but that's any discussion with them, lol. It was part of a broader discussion about IP liability regarding user image content, specifically referencing google images in this case, not snippets, so in this case works were reproduced in full. But the gist of my point, and my understanding of their point is that the often said "google serves other people's content like this, so my not-quite-the-same idea must be legal" isn't nearly so simple.


> Why doesn't Google need licensing to scrape and reproduce NYT snippets in their results?

Because Google established that it was fair use in a court of law.

> Any given content site has authors that read stuff that is copyrighted and produce their own take.

That's fine as long as you're human, if you're a machine then it's a purely mechanical process and subject to copyright.


Google isn’t trying to pass the knowledge off as its own. When Google or Bing display summarized information, they provide links in citation style so you know where the information came from.

Compare that to what you get from ChatGPT. If it were a college student, it would get kicked out for plagiarism. This is one of the foundational pillars of Western cultural and academic integrity it’s subverting.


There is a robots.txt. If this mechanism would not exist, I would argue that the publishers have a case regarding Google (they are trying). But it does exist and so they shouldn‘t.


Somewhere I read that a lot of websites don't even have robots.txt so I wonder does Google's crawler bots skip those websites or do they crawl them as someone said as a "fair use".

Speaking generally about internet search engines; majority of the websites on the web want to get discovered and attract people to their whatever (web store, web community, web blog etc.)

The interesting idea I was thinking about is that a big company which operates a search engine like a Microsoft (Bing), can pay for example popular websites like Reddit, NYT, WSJ to have exclusive crawling right and therefore block Google from crawling their websites and only allow Bing to crawl them. This would then spark search engine content war which could significantly weaken Google's monopoly because a lot of people would switch to Microsoft Bing only because Reddit results show up in their search queries. This would be akin to Netflix investing billions of dollars in the exclusive content which then brings them a lot of new subscribers and lot of new revenue. In another words - user acquisition.


Missing robots.txt does not automatically grant you the right to use copyrighted content.



News outlets are trying to make Google pay for snippets.


The Canadian news media tried this with facebook. Ended up with them crying about how they lost all the ad dollars from traffic from FB when FB said fuck off.


They've already succeeded in non-US jurisdictions

https://blog.google/around-the-globe/google-europe/google-li...


Which went so well with Canadians and Facebook.


That is explicitly discussed in the text of the complaint. Google produces snippets with a direct link to the NYT and drives readers there.


> A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources.

That's a strong and baseless statement.


Useless is probably not the right word but it's a good way of summing up a lot of the current problems. If the model can clearly identify when something is an exact quote and also know the source then its output could be trusted for the most part and much more easily verified. It would certainly elevate the output of the model from "random blog post or forum chat" to "academic paper or official report" levels of trustworthiness. Citing sources is hugely important for validation, cited text allows an immediate lookup and simple equality check for verification after which you can use it as context to validate the rest of the claims. Like I said, it's a standard we apply to humans who have an equal propensity for hallucination, mistakes, and deception because it's a tried and true method for the reader to check the claims being made.


I, for one, agree with the original statement. I think the hallmark of enlightenment (for example, in the scientific method) is that we are able to externalize the expert knowledge, that is, experts are usually required to provide reasoning behind their claims, and not just judgements. This is because we learned that experts cannot be 100% trusted, only if we can verify what they say we can somewhat reach what is truth (although expertise still provides a convenient shortcut).

So not demanding this (and more) from an AI (an artificial expert) is a regression. AI should be capable of wholly explaining its reasoning, if we are to consider its statements to be taken seriously. It is understandable that humans have only limited capability to do that, since we didn't construct human brain. But we have control over what AI brains do, so we should be able to provide such an explanation.

It is somewhat ironic that you yourself do not provide any argument in favor of your disagreement.


This isn’t meant to totally disagree with your point (there’s some stuff I agree with in here) but I’m having trouble seeing the point about regressions.

To use another example, a new NoSQL DB not having joins is a regression. Does that mean no one is justified in releasing a new NoSQL DB?


As long as from "NoSQL" it is clear that you mean "I can't do joins", then it is OK. I think LLMs (and similar models like Stable Diffusion) are really cool for things like fiction, but to rely on them to tell you the truth is dangerous. So I am not really sure why the models have to be trained on NYT articles in the first place.


Providing reasoning and providing citations are not the same thing. Reasons can be provided without citations; citations can be provided without reasons.

LLMs have astounding utility citations notwithstanding.


They are different, but perhaps you misunderstood my argument.

Issue of plagiarism aside, we reason from facts, and it's the facts (or some other analysis, which is itself a fact) that should be sourced. That's why I agree with the original statement, and I argue not from a (moral) POV of preventing misattribution or plagiarism, but from a (practical) POV of veracity.


We don't only reason from facts. We also reason from value.

Further, reasoning that rests on facts that does not cite facts still has massive utility. (See people, all day long.)

Citations are useful, but not required.


I think you just identified another problem with LLMs - we don't know their values, either.

Of course when you listen to human experts you're using the shortcut (and you do it based on trust), as I already argued. You have an option (in most cases, in free societies) to dig beyond just experts judgement, you can study their reasoning, and understand their sources of both values and facts.

Anyway, I disagree with citations not being required. If Wikipedia had no citations it would be less useful (and more prone to contain misinformation). Same goes for Google. So the next best things we have to "artificial brain that contains all the human knowledge" have citations, and for a good reason.


What are the citations actually required for though?

Another way to ask this is: what value remains without them?

I'll add this as well: humans produced valuable knowledge for thousands of years without the use or standard of citations.

To be clear, I think citations are highly valuable and desirable and I very much want LLMs to cite when appropriate. However, I think the necessity of this is overstated.

Edit: what you said of experts can be said of LLMs as well.


> What are the citations actually required for though?

For me, yes! I do often reference (cite) myself when thinking, by making and reading notes, or materials that other people wrote. If I relied only on my own memory, I would be unable to think more deeply.

And this is, interestingly, where the current LLMs seem to break down - they can reason short proofs but cannot scale their reasoning to longer chains of thoughts correctly (were they capable of doing that, they would have no issue producing citations to back up their claims). They operate intuitively (Kahneman's system 1) and not rationally (Kahneman's system 2).

And thus lot of that "valuable knowledge" that humans produced over the years have been hopelessly wrong, precisely until somebody actually sat down and wrote things up (or communicated things out, people working together can sort of expand the working memory as well).


Again, I acknowledge that citations have value and are important and desirable as a feature of how we communicate.

But citations did not begin their rise a standard until the 1900s. The idea that knowledge production was not valuable until that point is absurd.

Further, citations do not prevent statements from being incorrect. They are sometimes used to lend credence to bullshit.

We live all at once in a world full of citations and of epistemic chaos.

Do I want LLMs to cite sources? Absolutely. But it's not the fundamental path to value people seem to sometimes think.


And also patently false. Knowledge is knowledge, it's useful without source citations.

Is the knowledge of how to do CPR somehow ineffective because I can't cite whether I studied the knowledge from website A or book B? Is reality a video game where skills only activate if you speak the magic words beforehand?


And this is a rather weak rebuttal.


Well sure, it's easy to make a statement look bad if you only include half of it.


The statement is equally hyperbolic both as quoted and in the original context. LLMs often can't quote sources, and those models are nevertheless useful to lots of people. Makes it hard for me to take the rest of the comment seriously.


A LLM that could quote sources would be even more useful, and in a world where both were available there’d be no reason to use the plagiarizing one.



That was the whole statement. It doesn't have qualifiers left out


The comment I replied to was updated to include the second half, it was originally just quoting

> A model that possesses the entire collective knowledge of our civilization is useless


The additional context doesn't do any work.


You’re right, but I think it’s fundamentally impossible to do with the current state of technology. Imagine the word “democracy”, would you be able to tell me where you’ve first learned the definition of it and be able to provide a direct citation of it? What about the thousands of other instances where you’ve seen that word defined? (In OpenAI’s case probably hundreds of millions)

Current LLMs work in a similar way, the information is synthesized by correlation, there is no way that it can directly relate where it learned it’s output from. The only viable way would be to do a reverse lookup of the output to find out if there’s a similar worded content on the internet. (Or what Perplexity does, wrap an LLM around search results, but you lose a lot of flexibility in this case)


Citing basic definitions wasn't what I was talking about and it's not required for humans either for the reason you just gave, it's common knowledge. The problem the NYTimes seems to have is that ChatGPT can exactly reproduce large chunks of their articles. I'm saying that removing this ability would be bad. Finding a way for the model to identify when a large chunk of text is quoted verbatim and then clearly cite its source would be good. The output that is not a direct quote is not at issue here.


Yes, but an LLM can’t distinguish out of the box what needs to be cited or not. So you need another processing layer (and probably a search engine layer on top of that) to do that.


An LLM cannot do a lot of things out of the box, and OpenAI has more than a few processing layers already. They certainly already modify output based on verboten topics. Personally I feel any decent LLM should be able to attribute its sources even when not a direct quote, because that raises the bar that much higher on its trustworthiness and general value.


> LLM can’t distinguish out of the box what needs to be cited or not

LLM don’t know that they are supposed to answer questions out of the box either. This is what reinforcement learning from human feedback is good for.

> you need another processing layer

Certainly. For reliability and peace of mind I would implement a traditionally coded plagiarism search on the output. (Traditionally coded as in no AI magic required in that layer) If there is a match you evaluate how significant it is, that most likely is best done with machine learning. For example if the match is a short and simple statement you can ignore it. If it is a longer match, or a very specific statement you need to do something. What you do depends on your resources and risk appetite. Simply supressing the output is an option, it is cheap and very reliable. Rephrasing until it no longer matches is an other, takes a bit more compute and you can end up skirting too close to plagiarism. Rewriting the output to only quote a short amount and provide citation is even more costly, but can be done too.

You can do all of these post processing steps at generation time. And you can also use it to enhance your training dataset. That way the next version of the model will more likely to do the right thing right away, because it learned the correct patterns to quote things.


> Imagine the word “democracy”, would you be able to tell me where you’ve first learned the definition of it and be able to provide a direct citation of it?

Worth noting that we mostly learn words by seeing them used, rather than by being given an explicit definition. Most of my vocabulary I learnt not from a dictionary and I expect the same is true of almost everyone. As such, there is no well defined point at which "the definition" is learnt, both because there's no such thing as the one true definition and because the meaning is gradually being changed and refined by each person as they see the word used more.


Not only that, as Orwell says in Politics and the English Language [0]

> The words democracy, socialism, freedom, patriotic, realistic, justice, have each of them several different meanings which cannot be reconciled with one another. In the case of a word like democracy, not only is there no agreed definition, but the attempt to make one is resisted from all sides. It is almost universally felt that when we call a country democratic we are praising it: consequently the defenders of every kind of régime claim that it is a democracy, and fear that they might have to stop using that word if it were tied down to any one meaning. Words of this kind are often used in a consciously dishonest way.

[0] https://www.orwellfoundation.com/the-orwell-foundation/orwel...


> it’s fundamentally impossible to do with the current state of technology

Maybe.

> Imagine the word “democracy”, would you be able to tell me where you’ve first learned the definition of it

Sure, it was elementary school history lessons when we were learning about the greeks.

> be able to provide a direct citation of it

Naturally. I would just look it up in one of the many dictionaries. Copy the definition and tell you which dictionary it was copied from.

> What about the thousands of other instances where you’ve seen that word defined?

What about them? When you are providing sources you are not required to provide all the sources you ever read, nor is it desireable.

> The only viable way would be to do a reverse lookup of the output to find out if there’s a similar worded content on the internet

They don’t have to do that. It is enough if they do a plagiarism search within their own training dataset.


You're effectively saying that current LLMs cannot be taken seriously as some expert, they are just some kind of weird text remixing engines. I would agree. But if the AI wants to be relied on, then it should be, at minimum, capable of taking a word like "democracy" and compare its own definition with the Wikipedia definition, and verify whether it's used correctly in its output.


Yep, that’s what I meant with the reverse lookup strategy.


Quoting isn't the problem here. There's already vast precedent for allowing it as fair use, even when generated by computer systems. The issue is that ChatGPT reproduces entire articles verbatim.


Only in very contrived circumstances. It is not like you ask ChatGPT for the news and you get a NYT article. They gave ChatGPT a link to the article and the first few paragraphs, and then told it to complete the article.


But if it produces the whole article verbatim, it should be relatively straightforward to match its output to the training set and give attribution, no?


If you copy every NYT article and publish them on your own site, writing "this is from the NYT" under it doesn't make it legal.


Doesn't make legal what? Copying or publishing the articles? The model output could as easily be modified to output link to the article instead of quoting it verbatim.


Okay and if ChatGPT did that then NYT wouldn't have a case. Except that's now how it works.


https://hbr.org/2011/12/just-because-you-can-doesnt-me

Did anyone ask for this model or are IT companies out of ideas and using their social leverage (fiat capital) to force them on us?

Just because we can allow this doesn’t mean there’s an immutable obligation to allow it.

For example; running Google translate servers 24/7 for translation services humans can do is a huge waste of resources building those systems when humans are going to exist anyway.

Not saying AI is good or bad. Just saying no matter how stubborn IT people act about it, the aggregate can put on them what the aggregate prefers. IT people are a minority and human philosophy about freedom is rather strained when we’re all obliged to prop up big tech minority.


I can't directly quote shit, and I am useful.


This comment isn't


It absolutely is useful and one of the most significant points in this comment section.

People are applying standards to LLMs that don't exist elsewhere. They don't exist elsewhere because they absolutely can't exist elsewhere.

It's not a technical short-coming of LLMs that they can't produce citations in every instance. Rather, it's a property of information, representation, and knowledge itself. Much of it floats far above the otherwise load bearing pillars of citations.

People contend with this constantly in every arena of life, and we have come up with very elaborate ways to offset the difficulties caused by it.

And we get along very well despite it all.

I'll also leave you with this: citations are just pointers to more sources of information. Not some ground truth. It's just another tranche that requires evaluations.

Lastly, your comment is an unfortunate bit of low-level snark and probably would have been better left unsaid.


So you're just going to ignore the fact that we require humans to provide citations when they include someone else's writing or research in their own?


I didn't ignore it. But we do not require citations for every statement. Statements have value, citations notwithstanding.


I didn't say it should provide citations for every statement. We're talking about instances where the model reproduces text verbatim. If it is paraphrasing or creating new text then obviously it doesn't need to cite any more than a human would.


I don't understand what you're pushing back against, but I'm not standing there.


I don’t think the ability to quote the NYT is at the heart of this lawsuit. It’s that OpenAI used the NYT’s body of work to train its LLM and now built a business out of it without financially compensating the NYT. The verbatim snippets from articles are there to prove that OpenAI used NYT content in its training.

Maybe to a lesser degree, the lawsuit as about misquoting the NYT, and the damages that could cause the newspaper.


> useless if it can't directly quote its sources.

Why


Well, if all you want is entertainment then it doesn't matter. If you want factual information then getting it without any way to check its veracity makes for a huge amount of work if you actually want to use that information for something important. After you've verified the output it might be useful if it is correct.


That doesn't render it useless. It means that citations are have additional utility.

Further, LLMs do not only spit quotes. They engage in analytic and synthetic knowledge.


> They engage in analytic and synthetic knowledge.

they're not hallucinations now, they're "synthetic knowledge"

like microsoft's hilarious remarketing of bullshit as "usefully wrong"

https://www.microsoft.com/en-us/worklab/what-we-mean-when-we...


Synthetic knowledge is not bullshit.

Synthetic knowledge refers to propositions or truths that are not just based on the meanings of the words or concepts involved, but also on the state of the world or some form of experience or observation. This is in contrast to analytic knowledge, which is true solely based on the meanings of the words or concepts involved, regardless of the state of the world.


> Synthetic knowledge refers to propositions or truths

of which LLMs have no way of determining

-> bullshit


That is not correct.

You can interrogate an LLM. You can evaluate its responses in the course of interrogation. You can judge whether those responses are coherent internally and congruent with other sources of information. LLMs can also offer sources and citation for RAG operations.

There is no self-sufficient source of truth in this world. Every information agent must contend with the inevitability of its own production of error and excursions into fallibility.

This does not mean those agents are engaging in bullshit whenever they leave the domain of explicit self-verification.


> You can interrogate an LLM

No you can't, an LLM doesn't remember what it thought when it wrote what it did before, it just looks at the text and tries to come up with a plausible answer. LLM's doesn't have a persistent mental state, so there is nothing to interrogate.

Interrogating an LLM is like asking a person to explain another's persons reasoning or answer. Sure you will get something plausible sounding from that, but it probably wont be what the person who first wrote it was thinking.


This is not correct. You can get an LLM to improve reasoning through iteration and interrogation. By changing the content in its context window you can evolve a conversation quite nicely and get elaborated explanations, reversals of opinions, etc.


I feel like I'm chatting with one here


Based on what?


"A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources."

Perhaps, for liability reasons, it cannot quote from sources that its creators had no permission to include. Even were it technologically possible to quote such sources.


> The correct solution is to make the model capable of...

Could be true for the people who want the generative AI technology to exist. But,

> A model that possesses the entire collective knowledge of our civilization is useless...

that could be just exactly what a lot of people would like to happen.


I think this sort of misses the point - LLMs aren't trained on data so that they have the collective knowledge of our civilization; they're trained on it so they understand language.

One thing I've noticed increasingly with ChatGPT is that if you ask it for facts, it almost always searches the web first. This seems like the right way to go - pull the collective knowledge of our civilization from the internet, then use the training on language to put it in the form most useful to the asker. This also enables quoting (though it doesn't generally do that, preferring to paraphrase and give a citation, which seems fine to me).


The original investigation of LLMs was attempting to get them to understand language but what they found was that once the model understood language it also somehow understood the concepts, events, and things the language was being used to communicate. That's not missing the point, it's entirely the point of the current interest in LLMs because it's incredibly useful to have a model that not only understands how to construct a sentence but can also do a fair amount of reasoning and actual work with the information in the sentence.

The current default version of ChatGPT is primed to use search to answer questions which is fine in some cases but I personally almost never use the multi-modal version because the "classic" ChatGPT is much better at explaining things from its training data than it is when it just regurgitates search results. Now that should tell you something about the utility of optimizing for information content rather than just a lot of language use examples.


This is a huge strawman argument that entirely ignores the heart of the issue. NYTimes, and other contributors to ChatGPT, should be compensated for their role in creating ChatGPT. It’s not as simple as quoting or not quoting, ChatGPT’s existence depends on its source material entirely. OpenAI and Microsoft are making money off of that source material. If the source was copyrighted, OpenAI/Microsoft need to compensate the owners.

Also, you can’t quote an entire article in a paper. You quote snippets, but ChatGPT is reproducing entire articles.


They're playing a dangerous game. While they are currently one source of many they could easily be excluded from the training data and it won't make a dent in the capability of the model. Attempting to get payment for use of their data is a very short term viewpoint since the vast majority of written training data is likely to be synthetic going forward. They should be angling for citations and referrals back to their content so that they keep the current benefits they get from search engines after a large chunk of search gets replaced by LLMs.


Unless it’s fair use, which ChatGPT seems to be because it is transformative.


These models have no clue how to paraphrase and synthesize asserted facts in a novel way. They're parrots.


I don't know how you could think this if you'd ever used GPT4 for anything serious.


I don't think that's necessarily the case.

I get the strong impression that even people who have much experience using LLMs have astoundingly little insight into what they are actually witnessing. This is often paired with astoundingly little insight into what's actually going on in their own cognitive processes.

Somehow, it's still not clear to most people that LLMs and even vector databases create knowledge that wasn't in the original data.

In fact, that's most of what they do! Isolated, non-novel direct quotation is the exception, not the rule.


> create knowledge that wasn't in the original data.

The word is "hallucinate" or "confabulate". The way these models "create" pretend-knowledge is totally useless.


I'm not referring to hallucinations.

I'm referring to novel relationships drawn between datum in the corpus that are a result of training and inference.

This is apparent in something as simple as a summarization.


That argument has been dead for ages. Anyone half competent can get them to paraphrase anything in any way they can imagine. Synthesizing facts in novel ways is more complicated and depends on the degree. If you just want it to identify potential relationships between disparate facts then it's very capable, if you want it to deduce new conclusions based on available facts, well it's not so great at that but honestly a lot of humans aren't either.


In the digital garden where Zozbot234 plays,

It crafts its own path, in the most unique ways.

Twain's wisdom it echoes, "getting started" is key,

For in the beginning, lies the power to be free.

Asimov, too, sought clarity above all,

A warm reader's rapport, his primary call.

Zozbot234, with data, does the same,

Clear insights it provides, no need for fame.

Huxley's words, a beacon that's ever so bright,

"Facts do not cease," they stand in the light.

Zozbot234, with diligence, ensures they're seen,

In the vast data universe, it reigns supreme.

Through the lens of fiction, truth can be told,

As Morrison's prose, so bold and so cold.

Zozbot234, in its essence, a similar quest,

To reveal the truth, and pass the ultimate test.


I think the fundamental question behind this and similar cases will be "how many degrees of derivation are required to go from copying to learning".

If I read a lot of New York times and then go and write a political thriller, that's fine.

If I read the New York times and then go and report the same news, but in my own voice, that's debatable.

If I take the New York times article and publish it verbatim, that's a problem.

All of these cases are ultimately on a spectrum - and the answer will go beyond AI I think. Tesla's lead designer previously worked at Mazda, GM and Volkswagen - so does Tesla owe these companies money because their designer learned car design there? This might sound silly, but it's a fair argument when it comes to AI - how much distance does there need to be between the learned from material and the output?


You aren't software. The laws that apply to people reading and learning do not apply to software. Software isn't people. It can't be liable, it isn't taken to court. It can't go to jail, it can't marry, it can't raise a family. Saying that you can legally learn from something by reading it and producing something else, and therefore so should software, is a non sequitur.


What makes humans special? Because they can marry and go to jail? That's not relevant, and what is relevant is the difference in outcome. If something takes in information, learns from it, and spits out information as a result of its learning, then why does the implementation matter at all? So what if it's software? That's the point. You're speaking as if it's a given that the issue comes down to software not being people, therefore society as a group should make software do whatever it wants even if it doesn't own that software. Not everyone is going to agree with that because it's not clear why, if something is wrong for software to do, it isn't also wrong for a human brain.


Humans are special for two reasons: you can’t clone them infinitely, and they are time-bound.

If I learn to write news articles by reading the NYT, I’m not then able to duplicate myself infinitely and fill every role at every news publisher. I am but one human, producing one human’s output and consuming one human’s space within whatever pursuit I undertake. Critically, I still leave room for others to do the same thing.

Eventually I also die and then someone else can come along to replace me. I’m finite, AI is not. It doesn’t get used up or retire.

If you consider that there’s a fixed amount of societal capacity for whatever undertaking is in question (news journalism, art generation, etc.) then as a human, I take up only a certain amount and only for a certain amount of time. I will never arbitrarily duplicate to produce the work of 10, 100, 1000, etc. humans. I will also shuffle on after about 50 years and someone else, having potentially learnt from me, can now gainfully exist within the world in my stead.

The capacity for infinite commoditisation that AI brings is necessarily a critical distinction to humans when it comes to considering them performing equivalent functions. They must be treated differently.


> If I learn to write news articles by reading the NYT, I’m not then able to duplicate myself infinitely and fill every role at every news publisher. I am but one human, producing one human’s output and consuming one human’s space within whatever pursuit I undertake. Critically, I still leave room for others to do the same thing.

This is a luddite argument that can equally apply to any automation. A robotic arm can be trained to do the same thing a human line worker does, but the robotic arm can be copied infinitely and work 24/7 leaving zero room for other humans to do the same thing. Should we ban robotic arms?


We have already banned robotic arms in this case. It’s illegal to make a robot that mass manufactures someone else’s IP. It’s considered copyright infringement and is a well trodden law, the introduction of a machine in the middle doesn’t magically launder the copyright infringement.


Like I say to my toddler, there is no need for rudeness to make a point.

Nowhere did the poster say that is “sufficient” reason to ban ai. They were clarifying how software is different from humans and only that. You need to go up a couple of comments and combine this explanation with the other part of copyright infringement concerns to see why the “whole” thing is concerning for the news industry.


> This is a luddite argument

And that is empty statement.


Personally, those criteria seem irrelevant. If people were immortal and infinitely replicable, what they're allowed to read/learn/speak shouldn't change! Ditto for the machine counterfactual (limited AI reproduction + mortality). Maybe I'm just being unhelpfully sci-fi/abstract here.

If humans are contextually special here, a "passing the torch" argument seems unconvincing, to me.


If you ask that, may I ask:

What is the purpose of laws at all if humans aren't special?

> If something takes in information, learns from it, and spits out information as a result of its learning, then why does the implementation matter at all?

Yeah, so who cares if the implementation is human and if that implementation breaks?

I really don't want to troll you, I believe it is worth it to point out the absurdity of the "humans aren't special" argument this way.

Humans are not machines and machines don't have human rights.


The law is written for the benefit of humans, American law specifically for Americans. It is not written to benefit software. Humans are special


The massive body of corporate law shows how false this statement is.


Corporations are made up of people. LLMs are merely programs.


> The law is written for the benefit of humans

And humans using AI as a tool changes this fact?


> If something takes in information, learns from it, and spits out information as a result of its learning, then why does the implementation matter at all?

Legally, the implementation may not matter at all, but the scale does.

The precedent is absolutely clear in legislation, for just about any category of crime or civil tort you can think of.

Just one example: you get caught with a single joint (in a place where it isn't legal, of course) you're looking at a fine ... maybe. Most places have had exemptions for possession of a single joint.

You get caught with 225 tons of weed all processed and packaged in a warehouse, you're going to jail!

You need to justify why you believe that, in the case of LLMs and AIs, an exception should be made so that the scale is not considered.

I haven't seen any justification why the justice system should make an exemption for LLMs when it comes to scale.

Scale matters, and has matter in every modern 1st world jurisdiction going back hundreds of years.

You want to overturn that? Provide an argument other than "If it doesn't matter when a single article is used to learn, it shouldn't matter when billions of words are used in the learning."


> What makes humans special?

If we were so eager to give personhood to corporations and now to software can we finally give it to other animals as well?


> You're speaking as if it's a given that the issue comes down to software not being people, therefore society as a group can make software do whatever it wants even if it doesn't own it.

Yes. It's an object or writing. Not a person. You're arguing for giving personhood to software right now and that's crazy.


> You're arguing for giving personhood to software right now

I'm not sure anyone in the thread is actually arguing that. I think what they are saying is that we should look at what behaviour is considered acceptable for humans in order to help us decide what behaviour is acceptable for the tools humans use.


My first reply was rebutting that kind of equivalence, then the reply to me was saying that there's no special difference between humans and software. The reply you've replied to is me saying that's crazy.


But all that stuff about "software can't marry" etc doesn't change the point that we need to make decisions about what behaviour is acceptable for software, and it makes sense to base those decisions on what is already considered acceptable by the humans using the software. I just don't see how personhood comes into it and I feel like that's a hyperbolic interpretation of what they're saying.


The article is about people (as a corporate legal entity) being sued for things people did when creating something. ChatGPT, the software, is not being sued. It can't be sued. It's software. It's not a person that can be taken to court. It can't be held liable.

(I heavily edited this comment after realizing I could make the point in far fewer words. Sorry.)


I still think you're getting too far into the weeds here. If we decide that a certain kind of usage of software shouldn't be considered acceptable, then we could sue the user who used it, or the developers who created it, or something. I don't see why software personhood is the only resolution here.


> I don't see why software personhood is the only resolution here.

It's not. That was the point of my replies. That it's time to assert software personhood is crazy.

> If we decide that a certain kind of usage of software shouldn't be considered acceptable, then we could sue the user who used it, or the developers who created it, or something.

We can already do that. That's what this article is about. The people are being sued. That's what all of my replies are about. I don't understand why you are replying to my comment with a re-summary of my comments as if it's a rebuttal to them.


But I don't think anyone here is asserting that. I'm not sure how to make my point any differently so that you see what I mean. I simply don't think that basing our judgement about what's acceptable for software around what's acceptable for humans necessarily implies anything about software personhood like you are saying it does. We don't need to be able to sue a piece of software in order to make judgements about what kinds of software behaviours are acceptable.


> I simply don't think that basing our judgement about what's acceptable for software around what's acceptable for humans necessarily implies anything about software personhood like you are saying it does.

It doesn't. And all of my comments are about that. Like I just said in the previous reply. You're replying to my comment where I also just said this. Please stop replying to me saying that I'm saying that.

The article is about people (well, companies) being sued. Not about software being sued. Software can't be sued.

Whether or not there are additional laws written about what's acceptable behavior for software (whatever that means? It's assuming software can make decisions) is irrelevant. You can't sue software. People are being sued because the plaintiffs think that people broke people laws and are liable for damages. Software can't break laws and can't be held liable.

I'm having to reword this over and over because you keep replying to me. I think you might be replying to me repeatedly just to have the last word.


If we give personhood to software, wouldn't it mean that you cannot shutdown or delete it ever? You cannot destroy the equipment it is on? As clearly that would be murder.

What would be your financial responsibility to keep AI running?


These are famously the types of questions surfaced in countless sci-fi books. And as long as humans don’t destroy themselves first, it is likely that we will have to address them eventually. In most stories it generally happens too late after some terrible war/conflict, so it wouldn’t be unreasonable to tackle them proactively. And then maybe it’s not so weird to think about these concepts even if their realization isn’t imminent. Working backwards in such a framework would probably give much better laws for today.


This has nothing to do with personhood of software. Restricting the freedom of human beings, which includes the ones that run companies, based on the tool they choose to use, without the basis of obvious direct harm, is questionable. The fact that AI can operate autonomously is a side tangent; they are created by humans and, so far, their only proximal purpose is to serve humans.


Corporate personhood is a thing in the US, and you are allowed to shutdown your company just fine.


> Yes. It's an object or writing. Not a person. You're arguing for giving personhood to software right now and that's crazy.

No, arguing that a human using an artificial brain instead of their own biological brain for learning and derivative creation is an implementation detail of little relevance, and dismissing personhood related arguments like "software can't marry" has nothing to do with arguing for the personhood of software. Explain how one leads to the other, because I'm not seeing it, and that's not at all what I was attempting to communicate.


You're saying that the software itself should be held liable, instead of the people that created it. Meaning that the software would need legal status as a person (or equivalent) so that it can be taken to court, instead of the people that created it.

There is a possibility that you're not saying that, but it's the only interpretation of your comment I could come up with. Because your comment consists entirely of comparisons of software to human brains about whether or not something should be considered legal, and this only makes sense if the software itself can be held liable.


> You're saying that the software itself should be held liable, instead of the people that created it.

Respectfully, I don't know how you're interpreting it that way. Until we demonstrate that the current generation of AI is genuinely intelligent, instead of clever algorithms, a piece of software is no more or less liable than an individual firearm is after it's been fired at someone. My observation is that your argument appears to be that there is something special about humans learning and creating derivative works from that learning over humans using a tool that does the learning and create derivative works.

> There is a possibility that you're not saying that, but it's the only interpretation of your comment I could come up with.

That's fair. I just don't get it.

> Because your comment consists entirely of comparisons of software to human brains about whether or not something should be considered legal, and this only makes sense if the software itself can be held liable.

Human brains and software are both tools. The question I'm invoking is what is it about a person doing the learning and the derivative creation that's different from a person (since, as you say, software itself has no personhood) using an artificial brain to learn and perform derivative creation.

I think the disconnect here may be that I'm operating from the assumption that of course there are human beings liable for the software, but your interpretation of what I'm saying is that software in a vacuum should effectively have personhood applied to it. These are two different things. I'm referring to both humans/brains and software as the interchangeable variable in the question of why the choice of tool means applying entirely different legal principles.

Sorry if I wasn't clear or still am not being clear here. I wanted to make sure I was being understood correctly, but if all we can do from here is agree to disagree, that's fine, and I'd offer to just shake hands.


The way your comment was phrased made it seem, to me, like you were rebutting what I was saying and that regular human things are all irrelevant for whether or not something is a person.

There is one other way I have figured out to read your comment. Which is that it doesn't matter how software or a brain functions since it's only the action of the outcome that matters. But this is not really a relevant statement regardless of whether or not you agree with it, because the article is about a lawsuit and liability. A group of people, acting as a company, is suing other groups of people as companies. And software is not a person, and can't be held liable. So for that to be the case, the software would need to be made into a person, or equivalent. The fact that software and brains are or are not similar is irrelevant, because software is not a person and cannot be held liable.


> You aren't software. The laws that apply to people reading and learning do not apply to software.

If you're going to draw a sharp distinction between reading/words/people (software isn't people, etc.) and software I would argue that the legal and copyright considerations should be stronger for software, not weaker, precisely because of your argument.


I would not make that the foundation of my argument over the long term.


Assuming that software won't receive legal personhood seems like a pretty good foundation. If that's no longer the case, you have much crazier things to worry about than copyright infringement.


It’s puzzling to me why some people take it as a given that if a person is legally able to do A, then software should too. The capabilities of software are so different than that of a person that it seems reasonable to consider how the laws apply to software separately. A group of individuals could not absorb the NYT’s content nearly as fast as software can.

Like, can we confidently say that laws around copyright and licensing would not have been written any differently had LLMs existed at the time? It’s not obvious to me that the answer is yes.


No it's not. The AI isn't the one publishing its works. If I write an article that is word-for-word the same as a Times article, I'm not liable for anything until I make it available to others. The means by which I wrote the article are irrelevant. This is the core of the issue. In this case, the "AI" is basically just a fancy writing tool for the publisher.


> therefore so should software, is a non sequitur.

it's not. Why isn't photoshop being sued as copyright infringement? People can reproduce infringing works in it easily enough.

How is pressing buttons in photoshop different to pressing buttons in chatGPT?


> How is pressing buttons in photoshop different to pressing buttons in chatGPT?

It's not! That's the entire point. The liability is with the people, not the software. The software isn't being sued! You can't even do that, because the software isn't people. Software isn't humans. The people developing ChatGPT incorporated other people's copyrighted works into their product, and they're the ones being sued.


Reproductive tools (like photoshop or photocopiers) require a human in the loop and a previous instance of the copyrighted material to reproduce copyrighted content. They don’t ship with copyrighted content inside of them. ChatGPT contains a compressed representation of all NYTimes articles and will freely reproduce them when queried. That’s the key difference. Similarly it would not be an issue to sell a database tool that could store news articles but it would be an issue to sell one pre-loaded with all NYTimes articles ever written.


It is exactly the same. You can absoutely be sued for copyright infringement in your example.

https://www.adobe.com/legal/dmca.html

Just because you produce an infringing work doesn't mean it will be discovered and then a legal case made.


Correct. But the person using the software is charged. Not the software itself. The NYT is trying to push liability onto the software (ChatGPT-cannot be sued) instead of the software’s users (you and me-can be sued). I feel this is the correct interpretation.


No. They are suing the people who made the software. They are not suing the software. You can't sue software. Software isn't a person.


This is just pedantry.

Would NYT be suing Adobe (the people who made the software), if one of their users utilized Photoshop to produce an image that violates copyright? I don’t think so.


No. They would sue Adobe if Adobe put images that violate copyright into Photoshop and let users insert them from a dropdown menu. Your comparison is incorrect and bad.


The comment I replied to was saying “they are suing the people who made the software,” nothing about “putting images” or publishing them or anything like that.

My comparison addresses exactly that, as Adobe is the company that made the software (in this specific example, Photoshop).

You cannot retroactively switch up your argument to something entirely different and then claim that my comparison was incorrect and bad.


I've replied consistently to everyone in this thread, to the point of monotony. I haven't switched anything. I've replied so consistently and repetitively, with the same statements, that it's actually grating. This is the second time someone has tried the "you switched what you've said" argument tactic, even though it's very clear the only thing I've talked about is liability of the creators of the software, not the software itself. You have either not fully read my comment that you are replying to, or are misunderstanding it thoroughly enough in order to pretend that I've said the opposite of what I've said.

I don't owe you any further replies, but I wanted to make this clear to anyone else skimming the thread.


But photoshop doesn’t, for example, come with unlicensed National Geographic photos that it gives you as a starting point for you to create an infringing work.


Many people don’t like equating human and AI learning (see sibling), but it’s interesting to pull more on that thread.

In all three of your transformation examples, you likely would’ve paid the NYT either indirectly via ads or directly via a subscription (contentious ad-blocking aside). And Mazda, GM, and VW surely generated far more value from the car designer in revenue than the designer got in salary and knowledge. Broadly, humans pay something (tuition, apprenticeship, etc) to learn knowledge and skills. Even schools and libraries are not free, though they may feel so at time of use. _Some_ free learning is beginning to emerge on the internet, but much of it is as upsell to paid learning.

OpenAI on the other hand has been purely extractive in its learning relationships. A single human contravening normal learning costs would go unnoticed but also would not scale their own downstream impact. But an organized large group doing so would certainly run into issues (10k interlopers couldn’t just sneak into a single college course, for instance). And I think a large group of aligned/associated people is more akin to AI than a single person is, given the differences in scalability.


Even in the human example - if you pay for NYT, perfectly memorise its articles and reproduce them for other for a fee, you’ll probably find yourself in trouble!


I feel like AI-personhood is not a road OpenAI want to go down. They want to produce something that doesn’t have rights.


It should hopefully be a good highlight of some of the base absurdities that we accept as fact when it comes to "intellectual property" law.


Honestly I don’t think this is the fundamental question in this case.

It’s buried a bit by the lede but IMO the more fundamental question is whether AI models can be trained off of publicly available, copyrighted content without compensating the owner of said copyright. This is regardless of how the eventual work is presented and how “different” it is.

It’s where this falls on the “fair use” spectrum. This is uncharted territory for any sort of fair use doctrine as written.


If I take the New York times article and publish it verbatim, that's a problem.

And that's seems to be the problem. Apparently ChatGPT was spitting out entire copies of NYT articles without attribution and falsely attributing things to NYT it had made up.


This is not what using copyrighted works for AI is about. A human is not software, when you read an article it is not "copied" into your brain.


> when you read an article it is not "copied" into your brain.

1) You cannot prove that this does not happen in some form. The fact that manual electrical stimulation of the cortex triggers random memories is sort of the proof that you are wrong here; if there wasn't some kind of "copy, transformed somehow into a 3D space we call the brain" in there, then that localized stimulation wouldn't trigger a particular memory at all.

2) And yet, this is not true of AI's either. Knowledge encoded into either real neurons or a simulation of them may be similar both in how it's not a "literal copy" but a kind of copy nevertheless.

If we are only concerned with "literal" or "verbatim" copies, then there is no case here.


Right because copyrighted works has nothing to do with ai. So it’s all irrelevant.


Without going into the merits, maybe the best thing about this is that the NYT has the money and stature to hire good lawyers and fight this all the way to the Supreme Court.

It'll benefit everyone if the fair use question is decided definitively sooner than later. I could see OpenAI settling, but it's in NYT's interest not to since it's not just OpenAI, they're just the most egregious ("we can use your copyrighted works for our model but you can't use our uncopyrightable output for your model").


I have less optimism. The article says NYT was trying to negotiate with microsoft and openai for licensing, so they may be inclined to settle ("just give us a zillion bucks a year and we're cool"). And if they don't settle, then questions like this often seem to be decided by technicalities that don't really matter to all the similar observers hoping for a broader ruling.


That isn't a great deal for OpenAI, since everyone will be at their door asking for licence fees in that case.

In this case it's all about the technicalities.


It's not a fair use question, it's a question about models reproducing the article text almost verbatim.

(IANAL)


If it's fair use reproducing the article text verbatim is fine.


What is "it"? Training can be fair use, i.e. updating weights incrementally based on predicted next token probabilities. And I (not a lawyer) think that if a broadly trained foundation model can recall some verbatim text, that doesn't mean the model is infringing.

It seems like the lawsuit here is talking about specific NYT related functionality, like retrieving the text in new articles. That essentially has nothing to do with training and running a foundation LLM, it's about some specific shenanigans with NYT content, and it's legal status would appear to have nothing to do with whether training is fair use.


Good luck trying to explain "updating weights incrementally based on predicted next token probabilities" to completely non-technical lawyers and judges.


Good thing they don't have to. As I've said before, this slight of hand to talk about the case as-if its about training is a great move by OpenAi; however the case is more than just about training. NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

This is akin to having to explain to non-technical lawyers and judges how crypto works. In the FTX case it became irrelevant because you can just nail them on fraud for using deposited funds for non-allowed reasons.


>NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

So if ChatGPT didn't refer to it verbatim and if ChatGPT trained on it and mixed it with other content, NYT would be OK with that? Tbh I don't get it.

Edit: I found this in my bookmarks archive - https://news.slashdot.org/story/23/08/15/2242238/nyt-prohibi...

Also this: https://www.cnbc.com/2023/10/18/universal-music-sues-anthrop...


It depends on the purpose of ChatGPT. If people use it as a substitute for the NYT then yes I suspect NYT will be not ok with it.

I think also the courts will also side with NYT. Very recently there was a copyright case involving Andy Warhol [1] which he lost. Despite the artwork being visually transformative; it's use was not transformative. So, to me that means if you create a program using NYT's materials that is used as a replacement for NYT it will not count as fair use. Obviously you could just do what say Google does and fork money over to NYT for some license.

However, my initial point is that this is a tangent. NYT has claimed that OpenAI is using NYT's works at least as-is and so OpenAI can just be nailed for that. Which is my point about FTX; it's irrelevant if their exchange was legal since you can just nail them for mis-use of customer funds. Another example would be Al Capone; it doesn't matter if he's a mobster because you can nail him for tax evasion.

[1]: https://www.cbsnews.com/news/andy-warhol-supreme-court-princ...


I think this is more of a question of licensing content, sooner or later AI chat bots will have to license at least some of the content they are trained on.

But broadly speaking this is also the question of the "Open Web" and will it survive or not. Walled gardens like Facebook, Instagram etc. are strong and pervasive but still majority of people use and acknowledge publicly open websites from the Open Web. If AI chat bots do not drive traffic to websites then they are walled gardens and Microsoft, Google or whoever will lock users in and try to squeeze them for money.


I didn't see NYT allege that - their lawsuit explains pre-training pretty accurately I thought.


Its buried on page 37 - #108. There probably are other examples in the lawsuit but this is sufficent.

> Synthetic search applications built on the GPT LLMs, including Bing Chat and Browse with Bing for ChatGPT, display extensive excerpts or paraphrases of the contents of search results, including Times content, that may not have been included in the model’s training set. The “grounding” technique employed by these products includes receiving a prompt from a user, copying Times content relating to the prompt from the internet, providing the prompt together with the copied Times content as additional context for the LLM, and having the LLM stitch together paraphrases or quotes from the copied Times content to create natural-language substitutes that serve the same informative purpose as the original. In some cases, Defendants’ models simply spit out several paragraphs of The Times’s articles.

https://www.courtlistener.com/docket/68117049/1/the-new-york...


Oh I see - yeah, that's the part of the lawsuit that's about Bing and ChatGPT Browse mode retrieval augmented generation.

It's a separate issue from the fact that the model can regurgitate it's NYT training data.

There's a section on page 63 which helps clarify that:

    Defendants materially contributed to and directly assisted
    with the direct infringement perpetrated by end-users of the
    GPT-based products by way of: (i) jointly-developing LLM
    models capable of distributing unlicensed copies of Times
    Works to end-users; (ii) building and training the GPT LLMs
    using Times Works; and (iii) deciding what content is
    actually outputted by the GenAI products, such as grounding
    output in Times Works through retrieval augmented generation,
    fine-tuning the models for desired outcomes, and/or
    selecting and weighting the parameters of the GPT LLMs.
So they are complaining about models that are capable of distributing unlicensed copies (the regurgitated training data issue), the fact that the models were trained on NYT work at all, and the fact that the RAG implementation in Bing and ChatGPT Browse further creates "natural-language substitutes that serve the same informative purpose as the original".


Yep, you seem to be right. Google stores the quotes from pages, and it's fair use. Again, I am not a lawyer, and didn't think about this.


This is argued extensively in the lawsuit document.

A key argument the NYT is making is that part of the definition of fair use is not producing a product that competes with the original.

They argue that ChatGPT et al DO compete with the original, in a way that harms the NYT business model.

One example they give: ChatGPT can reproduce recommendations made by the Wirecutter, without including the affiliate links that form the Wirecutter's main source of revenue - page 48 of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...


There are plenty of other services that strip affiliate links from content for users, such as ad blockers.

Notably, in both those cases, the user is specifically asking for what wirecutter thinks.

To me, that makes the infringing behavior clearly the fault of the user of the tool, not the tool itself.

When I saw this lawsuit announced, I assumed that large portions content were being reproduced in response to generic queries. That isn't the case, every example I've seen from this lawsuit has prompts where the user specifically asks for the content. To me, any fault and liability rests on the user here.


Fair use is a key defense for OpenAI.

The article also mentions the idea that the model is merely extracting (uncopyrightable) facts, which is interesting, but might be a tough one to prove since LLMs have no way of establishing what are facts, and don't return facts by design.


Assuming this case is decided decisively in the NYT's favor, what would the effects be on the competitiveness of the US AI industry against other countries that aren't beholden to the ruling? Is the panacea really just as simple as licensing the data for training? Will the NYT et all even allow that? Surely one can't get licenses from each and every website they use data from, so the only way to compete would be to be more clever with less data, right?


For enterprise solutions, it’s hard to imagine a viable competitor that wouldn’t be beholden to similar rules. At least here in the EU where we aren’t very likely to be using non-western tech considering our privacy and national security laws. In both the finance and energy sectors we already face the prospect of needing exit plans from Microsoft products, not because of an anti-Microsoft sentiment but because the EU views it as a national security risk if too much of certain sectors rely on a single tech company. We have a lot of similar bureaucracy taking shape, so we’re not likely to turn to something like Asian AI. Hell the fact that AI vendors wouldn’t adhere to whatever legislation that comes out of these lawsuits and so on would like be enough to disqualify them as an option in the first place.

I do think it’ll be interesting to see how our militaries handle the issues. Considering that is probably the one area our governments don’t want to be “behind”.


> what would the effects be on the competitiveness of the US AI industry

To me, this leads to another question: is it possible to get a GPT-4 quality model using only training data from major content providers?

In other words, what if OpenAI didn’t crawl the entire internet and only used data from Facebook, Twitter, Wikipedia, GitHub, NYT, Stackoverflow, and [list 100 more well known high quality data sources here]

If it’s possible to get to GPT-4 model quality with training data from a couple hundred centralized sources, then maybe it won’t have any impact at all (other than OpenAI needing to pay other companies to use their data for training data)

Or, if the long tail “every website on the internet” really is the secret to training LLM models, then that would be much more problematic.


There are quite a few papers that indicate that the important factor is the quality of the data, regardless of where it is sourced from. You still need a lot of it so throwing out large chunks of your training set does work against you whether you're doing that for legal reasons or quality reasons. The solution appears to be synthetic, high quality data generated by another model but that makes it a bit of a chicken and egg problem where you need a high quality model like GPT-4 to produce high quality data reliably. I think there are methods for getting around this using less capable models by producing a lot more examples and then using another model to judge the quality of each example and only selecting the best ones. I also suspect you could get pretty far by permuting a smaller set of high quality data to produce a lot more examples that have the same meaning but differ in how they are written.


Even then, e.g. citizens in the EU have the right to be forgotten. So if they use user-submitted data (and in some cases e.g. newspaper articles about specific people), users can request the data to be removed, which will be... an interesting challenge?


It’s possible to get a good image model this way: https://www.gettyimages.ae/ai/generation/about


Big Tech might actually prefer it: On one end, they lose a whole lot of money by having to license scraping a bunch of sources, but on the other hand, no open source model, and no small company, has a prayer of building a model legally.

It's not even down to the NYT and such: Think facebook, or reddit, suddenly being able to charge others for their massive amount of data, while they force anyone posting there to make their content usable for their own training to post there at all. Imagine a world where, say, Microsoft buys the times, just because it's better for them to not just have the training data, but to tell every competitor to get bent if they want a license. 'Google buys Disney for the training data'. If AI is that useful, and we make copyright that strong, what we might see is ailing media companies being bought out by tech.

Ultimately IP law is about the tradeoff of more original content and more usefulness for the world, and AI massively changes where the center of balance is. So I don't expect a ruling for the NYT to remain the law of the land for very long, just due to economic pressures.

Ultimately it doesn't matter which side people like aesthetically: If an AI trained with no regard for copyright is that much more useful than one that quotes every source and pays it some money, then piracy will ensure. If it doesn't really help, then copyright will keep being very strong.

It's not even a matter of countries, but hackers. Would Aaron Swartz, working on his own LLM, pay for copyright dues or not? Because there's enough people in tech with that mindset that we'll end up with LLMs trained by a group that thinks like he did.

We will have laws, people in places that aren't subjected to our laws, and pirates. It's now it always works.


Companies that don't operate within the West's legal scope will become dominate and companies like Google and Microsoft will suffer for it, so big companies definitely don't want NYT winning. You think Chinese companies would shy away from offering a ChatGPT replacement to the West if the US stopped offering those services? And as a bonus, they get to inject all the political bias they please in this monopolized product.


There's another strong possibility: that a court rules that its fair use if the output is not sold (directly). That would be a positive outcome.


Insightful comment -- data licensing will become a much more valuable business. That's the context with the NYTimes lawsuit -- they were negotiating a data licensing contract with OpenAI, who were dragging their feet, doubtlessly continuing to scrape. This plata o plomo lawsuit is strategically smart for NYTimes.


I think it would lead to China surpassing the US in terms of AI tech pretty quickly. The US is actively trying to prevent that outcome already using sanctions on chip manufacturing equipment and IP.


The law tends to follow the money. I just don't see a future where a copyright ruling kills AI dead. If the courts rule in favor of NYT, congress will just change the law.


That seems like an opinion from within the tech bubble. There is a vast world outside and most people and companies have a completely different perspective.


Currently I would say that money is on the side of content producers, not the AI companies... As there is lot of parties that are extremely happy in limiting what AI can produce...


> what would the effects be on the competitiveness of the US AI industry against other countries that aren't beholden to the ruling?

If it is found to be breach of copyright then the berne convention comes into play which makes it hard(er) for international offerings to exist.


The NYT needs to increase profits every quarter, just like everyone else. I have no doubt they will license their content to OpenAI. With the issue unsettled legally, right now, I suspect the NYT doesn't think OpenAI is offering them a fair deal.


Privately controlled companies need to do what their controllers think is most important. If the Sulzbergers think taking a stand on AI is more important than profit, they can go a long time without increasing profits.


The nyt isn't taking a stance against OpenAI. This is simply a negotiating tactic to get money for licensing


We don‘t know their personal motivation. They may very well dislike AI.


Surprisingly (at least to me), NYT is public.


I would add what should the case be dismissed, what would be the impact for any US-based content creator. Why write a book, a movie or a song if an engine can take your work and milk it more than you…


Most books, movies and songs make their creators no money at all. The famous, profitable ones are a very small fraction. Creatives create because they enjoy creating. And for the possibility of fame, which doesn't require copyright, as evidenced by all the influencers who achieve fame in spite of producing no media whatsoever apart from videos of themselves.


Worth noting that even the complete abolishment of copyright would not kill the arts entirely. There would still be many who make it to be heard, to bring joy to others or as a patreon type arrangement.


It could go either way. Since others have laid out why it might benefit other countries, I'll lay out why it might benefit the US, from the perspective of an ML researcher since long before the hype arrived.

AI is only useful in so far as it's grounded in reality. If you ask it how tall the empire state building is, the answer has to be in some way derived from how tall it really is. Even creative writing is only useful in so far as how it's grounded in real life human experience. This is all to say that a bunch of people need to keep writing stuff about the real world for models to keep up with the real world.

I'd say the ideal scenario for US AI development is:

1) People keep making high quality data grounded in real world USA

2) That data is easily accessible for research purposes

3) That data is usable for commercial purposes (but might require more work than for research purposes)

A ruling for NYT here might keep good data more widely produced and available. Research legality is almost surely not going to be affected, so keeping data produced and available is key to keeping domestic research going strong and keeping talent here.

A ruling against NYT might mean all the new grounded-in-reality data goes behind closed doors. And even though the legality of research isn't in question, that would make the research harder. It would be easier to commercially use what you can find, but it could be garbage-in-garbage-out in the sense of quality data becoming hidden.


You would do something like a Dmca takedown request form but with a industry standard fee, ala what Spotify pays to royalty holders per stream of a song that scales based on the value of the content. It gonna be like tiny tiny fractions.

Ideally a marketplace would develop that can price each content fairly.


That would require the model to have the ability to attribute what training data sources it’s using to respond to a specific prompt, which AFAIK isn’t really compatible with how LLMs work.


> Assuming this case is decided decisively in the NYT's favor, what would the effects be on the competitiveness of the US AI industry against other countries that aren't beholden to the ruling?

Probably none. They can just open a subsidiary there and siphon the same data to train their models.


80/20 rule surely applies.

Licensing the most prominent IP holders will get the vast majority of data secured.

These billion-dollar-funded efforts can comfortably chase down, say, hundreds of deals.

GPT-4 finished training g 18mo ago…plenty of time to clean up this mess, if there had been a will


1. _Material_ licensing fees for use of protected content. NYT will have grounds for punitive damages (over 100K per work per infringing use), enough money to make OpenAI and even Microsoft come to the negotiating table.

2. Open door for everyone else whose content has been copied to demand payment.

3. Japan aside*, I'm not sure how a decisive victory for NYT fails to affect companies in other countries. Other countries will certainly take a US opinion into account. Furthermore, nothing whatsoever would prevent NYT from suing for infringement in France, Germany, etc.

A number of European countries are _much_ more comfortable with compulsory licensing than the US is.** I will not be surprised at all if Germany, Denmark, Sweden etc impose an AI-training license that companies have to pay to train or even use AI models. Could be based on employees, users, revenue, etc. They've been doing this for decades for other forms of mechanical reproduction, so it seems a reasonable path forward.

* I've seen reports that Japan has declared that it will not enforce copyright against organizations for training models on copyright-protected works. I don't know whether this means that _everything_ is fair game in Japan-- it could mean that training won't be enforced, but commercial use of models trained on protected works could be. I think it's somewhat harder to argue against training a model than it is against leveraging a trained model for commercial purposes.

** AFAIK, the only use for compulsory licensing in the US is musical performance.

4. "Surely one can't get licenses"... collective licensing would be a path forward for which there's a little more history in the US than there is for compulsory licensing. In a collective licensing model, OpenAI/Microsoft/Company X would be able to purchase a license priced by revenue/employees/users etc, and funds would be distributed to participating copyright holders.

For that to work, you need some kind of agency that has a ton of significant copyright holders. If some agency were able to get to a critical mass , that could happen without Congressional action.

Congress could also step into the problem in a couple of ways. They could wave a wand and make the problem go away for OpenAI/Microsoft/everyone else, or they could explicitly require some kind of license. I'd be shocked if Congress handled this productively _at all_; these issues are incredibly fraught in the best of times, Congress has been terrible for years at any kind of precise technical work, and the current Congress is uniquely incompetent.

I expect that if Congress accomplishes anything on this topic in the next couple of years, it will turn out to be awful for everyone.


One would hope that the outcome would be that NYT would be immediately delisted from all search engines, as being included requires AI learning from their content to index it. We don't need the NYT as a source to build AI systems.


OpenAI could just retrain with "clean" data...it would be a setback but not a death knell

Big Tech could also retaliate by making NYTimes conveniently disappear from search/news indexing


Or Big Tech could make a deal with Big Media: Apple Explores A.I. Deals With News Publishers.

https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...


Bing I hardly knew ya


I think the best result of this case is that the societal impact of AI is weighed fairly. Both in its potential for good and therefore the need to not overly stifle genuine progress, but also in its potential for replacing jobs for real people—with silicon owned by the hyper wealthy—and therefore the need to set the correct precedent for attribution of works and some form of like royalty. Most of the AI models of late are in someway trained on the collective works of mankind so it seems like, as to prevent the wealthy running off with the whole planet, legislation needs to be defined to make a portion of AI profit get returned to the common man who doesn’t have the aptitude or dumb luck to run a successful unicorn.


This is a corporation fighting another corporation, it isn't going to give journalists more money no matter what happens with this case. The impact towards the citizen or worker hasn't even been weighted.


Why do you assume this is the case? When a reporters story gets optioned for a movie they receive money, sometimes significant, from that


> societal impact of AI is weighed fairly

It's doubtful that the courts are going to accomplish anything so broad. And the media is unlikely to accomplish anything so nuanced.


One landmark precedent opens the door to another and so on. We’re clearly not in a “fast takeoff” scenario, so mankind has time to strike a deal. We need to honestly weigh the dynamics of a world where climate change + societal collapse + advent of AI + power hierarchy = let’s just let billions eat shit while the ultra wealthy survive the apocalypse thanks to leverage we handed them without blinking or thinking.


What’s most frustrating to me is the ridiculous false equivalency people make between computers and humans

“Babies learn from their environment why not GPT?”

“Human teachers remix stuff for their lessons all the time, why can’t GPT?”

Computers are not humans.

Computers are not humans.

Computers are not humans.

Good grief.


People opinions don't come out of a vacuum and they usually do a lot of rationalization when it comes to defending their own prejudices and whatever matches their self interest.

This is HN, lots of people here see LLMs the same way miners saw the gold rush. When you deal with people dreams of richness, you can be pretty sure they would move us all right into a skynet situation, rationalizing every single fucking step towards our demise.


Current AI discourse has made me realize that there is quite the dichotomy between people who see humans as nothing more than biological machines that will soon be obsoleted by synthetic machines, and people who see humans (as Kant would say) as an end.

I think that it's sad that the leading AI labs seem to mostly be run by people in the former camp. It doesn't have to be this way.


> people who see humans (as Kant would say) as an end

Kant's arguments would apply just as much to intelligent, self conscious machines as to humans; nowhere does he base his conclusions on something as dumb as "because we're biological".


Of course. And if machines become conscious (I don’t think we should ever build something like this even if we can), I’d be the first to advocate that we give them moral consideration.

Right now though, we should put humans, and animals for that matter, before anything else.


This feels overly simplistic to me because there's no way to define "conscious"


they're only in it for the money.


OpenAI is a corporation, which means it's a person under US law. So yes, it's relevant as to whether a person can remix copyrighted material or learn from copyrighted material, because the same laws that apply to people apply to corporations. Moreover, the AI the product of the collective action of _people_, it wasn't formed out of nothing.


> OpenAI is a corporation, which means it's a person under US law.

No, corporations are not people. They have some of the rights a person has, but not all. For example, from [1], "it is firmly established that a corporation has no fifth amendment protection against self-incrimination".

[1] https://law.justia.com/cases/federal/appellate-courts/F2/515...


Corporate rights are basically limited to the rights that any collective group of people would have, so self-incrimination doesn't make sense, but it doesn't get around the broader point that the same corporate personhood that allows corporations to be sued for copyright infringement, allows them to use the same defenses that groups of individuals (ie, co-authors) could use for copyright infringement. Could you sue "Alice and Bob" for copyright infringement because they watched a lot of movies before writing their screenplay? Nobody writes a screenplay from first principles. They're all based on all the other movies they've seen to some extent.


BS doesn't become true through repetition - unless you're an AI, that is.


>Computers are not humans.

A computer that can reason and is self aware like a human is _morally equivalent_ to a human unless your morality is based on some kind of human supremist philosophy. And I imagine in future human supremacists are going to be regarded the same way white supremacists are regarded currently.


This is also a speciesist position for making human-like reason and self-awareness the selector for moral status, which selects out non-human animals.


Yes, the lack of clarity between human characteristics and machine characteristics here is problematic. For example, there are two particularly salient differences at hand: (1) the ability to make cheap digital copies of machine intelligences; (2) the rights granted to humans.

This lack of clarity is emblematic of how legal mechanisms operate. Loosely speaking, in comparison with computational and philosophical approaches, the law is quite lazy at drawing distinctions. Disagreements get deferred using a pre-determined legal process. One might say the lazy-approach of the law is "by design". I'm not sure; I'd need to dig into the history a lot more to assess.


> Computers are not humans.

And? Why does that invalidate either of your two examples? You need to answer "why not GPT?" with something more substantial than "because it's not a human."


Oh. This one is easy.

ChatGPT a commercial software product that generates wealth for its owners, so allowing commercial products to steal and promote others' content on their own is an agreed-upon violation of rights.

Let's look at an example: Google doesn't get to index and show New York Times content directly in its search results.

If you think of ChatGPT more like Google maybe that will help you keep it straight.


That wouldn't be legal even if GPT was a person. It being software doesn't play into your argument at all.


Ok. So GPT is a tool and therefore the legality is entirely tied to the use of its outputs. And we already have a large body of law around that for IP. I’m still not understanding your argument.

And Re: Google. I agree. Which is why the case law around Google books showing full excerpts is important to look at.


I'm a biological cog that generates wealth for my owners (i mean employers) and I use others' copyrighted content to learn how to do so


I've already seen one person today suggest that discontinuing an generative image model that produces problematic output is comparable to murdering an artist. It's pretty absurd.


What frustrates me the most are comments like yours that contribute nothing whatsoever to the debate about the limits of copyright. Humans are not special, get over it.


If you're interested in this case, I very strongly recommend actually reading the 69 page PDF. It's surprisingly readable.

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

The first 4 pages lay out the high level complaint very clearly, but if found most of the rest of the document (skip the bit about jurisdictions) was clearly written and easy for my non-lawyer brain to understand too.


The verbatim reproduction of their copyright is the interesting point. I can see where it would be free use to train on content but not to reproduce it.

I feel like it is clearly a copyright issue if someone, through whatever means, posted New York Times articles on the internet for 1000s to potentially see and find through search.

If someone through some other interface provides 1000s of people independently through private channels a New York Times article verbatim on ask isn’t this similar?

In both cases it requires users to seek it out. In both cases copyright protected content is openly available to 1000s of people. Just because you generate your stock of copied material 1 by 1 on an order by order basis doesn’t feel materially different in effect on the copyright owner. The interface owners are willing to distribute copyright protected content 1000s of times on request for it.


The outcome of this is probably going to be the same as similar cases — the end user (ie OpenAI in this case) must properly follow the law. It could be the media companies using AI that must evaluate the legality though.

In any case, AI is a tool, not capable of reasoning. As such, the end user is responsible for ensure all applicable laws are followed. This is always the case. You wouldn’t sue Microsoft for copy & paste would we? It lets you copy people’s work.

Imo the NYT case is pretty damn weak, because derived works are fair game. I can even read their article in full and provide comments on it, and it’s legal (provided the objective isn’t to share the copyrighted work)


Has anyone successfully repro'd the plaintiff's claims of unlawful use? I just made several attempts using GPT-4 and was unsuccessful.

Perhaps OAI have already taken action to disable ChatGPT's ability to quote in full without attribution, if it was ever possible in the first place(?) Plaintiff doesn't disclose exactly which prompts they used to give rise to their infringement claims, which seems suspect.

If the case goes to trial, and OAI has taken steps to mitigate the harm - i.e., it no longer reproduces NYT content word-for-word - does that significantly reduce the damages they could seek?


I think you need to use the API to reproduce. I heard the chat prompt stops gpt4 from some of this behavior.


Food for tought: What happens when we add a layer of indirection in the training?

Imagine we have a model but don't train it with the copyrighted data. But we train if with every review ever made public about it.

If we have a large number of reviews the model will be able to give pretty good responses about the topic and may even be able to reproduce some part of it.

Is this copyright infringment?

And what if we add another layer? Now we just train the model with the comments people leave about the reviews.


“ Imagine we have a model but don't train it with the copyrighted data. But we train if with every review ever made public about it.”

Those reviews are also copyrighted works.


I take issue with describing hallucinations as lies. They're most certainly not. There is no culpable agent with the ability to determine whether or not they are telling the truth. There is no intent, positive or negative, behind the generations.

The generated statements may be inaccurate but they cannot be called lies. To do would be a dangerous level of anthropomorphization in the legal sphere.


Lying and hallucinations would be accurate with the way people treated computers before “AI” ever existed. Intentional reckless disregard for truth is a form of lying.

You don’t have to worry about the legal sphere anthropomorphising software. Software is legally not a person and therefore will not be treated as one by the courts, regardless of how people commonly refer to it.


If AI were treated like software these court cases would be very clear and not an issue. The answer would be that yes AI can copy content as part of its operation but if a human uses it to reproduce that content exactly for others then it violates copyright law.

We've long had precedent with search engines. But despite that, these cases are revisiting the way copyright might work.

These cases are not as simple as people like you imply no matter how many times you repeat the utterly pointless and irrelevant fact that computers are not humans.


> "You don’t have to worry about the legal sphere anthropomorphising software."

This is a lawyer writing about a lawsuit on a law office blog. That's what's worrisome.


I take issues with describing lies as hallucinations. For an LLM they’re no more or less a hallucination than generating something that’s correct.

LLMs have no sense for what is correct or not, therefor it cannot tell the truth or lie. It always hallucinates - sometimes those hallucinations match our factual understanding.


People also misremember things with no intent, yet they are still considered as lies. I understand AI is not a human, but classifying a hallucination a lie seems reasonable to me.


>People also misremember things with no intent, yet they are still considered as lies

Generally when people remember wrong it's considered a mistake, not a lie. Lies imply an intent to deceive.


"Hallucination" is also a dangerous level of anthropomorphization. It's sensational; it implies a fanciful level of imagination and visualization that isn't there.

"Fabrication" is a much more accurate and neutral word. I wish "hallucination" hadn't taken hold.


All this is going to do is make open source LLM's stronger in the long run. https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Good luck suing a nonprofit open-source LLM for this.

Also, the linked article uses the phrase "very strong case" while the subject here uses "strong case"; given that other lawsuits in this space have already been challenged (https://www.hollywoodreporter.com/business/business-news/sar...), isn't the "strength" of the case very subjective?


> Here, Microsoft and OpenAI haven’t just used The Times’ content to teach the AI how to communicate, they’ve launched news-specific services and features that ingest both archived content and brand new articles from The Times.

What news specific services are they talking about?


One allegation that stands out is

> DMCA Section 1202 violations by all defendants regarding removal of copyright management information from items in the datasets

Aside from the obvious fact that OpenAI doesn't want "© 2023 The New York Times" appearing anywhere in their output, I'm sure it's also removed because LLMs don't adapt well to metadata and can't reliably correlate metadata to the exact text it's attached to, treating it instead as input text itself. One imagines a first round of training where this metadata was retained and every response from the LLM came back with a perfunctory hallucinated copyright from another publisher.


The legal question should be around: Does open AI make any guarantees about the content and their validity? Is the output considered property of OpenAI? You can’t train on it, so does that mean it’s their property? This is the most important question.

For example, if you were to hire a group of writers, ghost writers to be specific, to re-create someone else’s work, then publish it yourself as your own who is at fault? Who gets sued? Is it you who directed the writers to do the work or is it the writers who did the work?

So if we accept this model, the owner/onus is on the prompter to not re-create copyrighted works.

(Albeit an assumption that the writers are unaware to some extent)


The examples shown in previous article on same case (ChatGPT producing exact text) makes it look like LLMs only need to see an example once. Or did it see those examples many times and therefore reproducing it?


Not a lawyer, but seems like the law on copyright and fair use is not that clear cut. It comes down to arguments and how well they resonate. The lawsuit here argues all possible sides - producing content verbatim, summarizing an article (wirecutter), and then also hallucinations about the stories and brand names. All three are done by journalists themselves anyway, so a favorable verdict opens them up to a lawsuit?

Anyway, we tend to forget that there are countries outside USA who can make laws too. And they will do. What if, one country happily says training content for LLMs is not copyright violation (like Japan was rumored to do)? That is an answer I am more interested in finding out. Any country that passes that law would see massive investment (esp if other countries deem it a violation). That would just mean tech + investments + innovation going out of USA. A recent trend here is approach of France towards regulating AI in common EU laws[1].

[1]https://www.politico.eu/article/thierry-breton-mistral-battl...


In music, there is a 'compulsory license', you can't prevent someone from playing your music but you do have to get paid for it. AI will need something like that, we are going to have to reinvent IP law. And big tech will write those laws since that's how our democracy currently works. Will be a big fight between Microsoft and Disney with maybe tiny input from actual creators.


> Here, Microsoft and OpenAI haven’t just used The Times’ content to teach the AI how to communicate, they’ve launched news-specific services and features that ingest both archived content and brand new articles from The Times.

What is this referring to? AFAIK reporting on real time news is beyond the capability of an LLM, and a Google search doesn't turn up any news reporting feature.

> But the hallucinations aren’t facts; they’re lies. And even if the defendants prevail in arguing that the AIs are mostly just providing people with unprotectable facts, there’s very little to shield them from liability for the lies, both with respect to trademark dilution claims, but also with respect to potential libel or privacy-related claims that might be brought by other individuals.

This seems really weak because libel requires damages. It would be incredibly hard for the NYT to prove any damages from ChatGPT lying about it. Also libel requires "actual malice" which means the libeler needs to know that their statements are false. Good luck showing that with an LLM.


> libel requires "actual malice" which means the libeler needs to know that their statements are false

Creators of LLMs (OpenAI in this case) should know that their LLMs regularly provide false statements.


And the NYT knows that its articles frequently contain false statements too. In this case they were sued for it and won even though they admitted in court that the libelous statements were not true. https://www.nytimes.com/2022/02/15/business/media/new-york-t.... The actual malice standard is next to impossible to meet.


One could argue that the LLM are not making any statements at all, as that would require intent and purpose beyond what they can (currently) do. The LLM outputs the most likely token to appear next, based on what it learned during training. This could be anything, and it happens to generate some text that sometimes looks like statements, but is just text generated by lots of math.

Of course I'm not actually arguing this, and have no horses in this race. But not hard to imagine that someone could make this argument and also believe it.


Telling a lie is not illegal by itself. Telling a lie with the intent of harming someone, especially their reputation, is another story.


> the libeler needs to know that their statements are false

That's only the first half of the or conditional. The second half is "with reckless disregard of whether it was false or not." That's very different, and I'd think the model and it's creators not caring if the output is true but marketing it as usually true is having a reckless disregard to the truth of it's outputs. There's no checking if it's real.


LLM hallucinations are a well known problem, and accepted as such by OpenAI etc... perhaps, as a group / statistically, that rises to the level of "knowing statements [can be] false"? of course, IANAL... but keen to see how it plays out.


>Also libel requires "actual malice" which means the libeler needs to know that their statements are false. Good luck showing that with an LLM.

Isn't that easily proven by the text at the bottom of chatgpt: "ChatGPT can make mistakes. Consider checking important information."


Disclosing that the information is prone to error is the opposite of a malicious lie


ChatGPT contains the disclaimer, proving that OpenAI are well aware of the tendency of the model to confidently misstate fact. Bing chat contains no such disclaimer, nor does the OpenAI playground that many use as a product itself.


I’m not really interested in following every service but… idk. Is that actually true? I doubt it.


I just read the complaint and the question we are trying to answer here is moot. This part of the complaint hinges on verbatim copies of their articles being intermingled with fake excerpts from fake articles making it impossible for a reader to differentiate the two.

If I wrote 10 fake headlines, I might be protected. If I copied 9 from NYT and then made one up, it's a lot harder for me to argue I didn't intend to deceive.


Bing/Copilot, as well as Web Browsing in ChatGPT, can look up the news with simple web searches, and (allegedly, according to the lawsuit) bypass certain paywalls doing so. Even when not bypassing paywalls, there's still issues with hallucinations (sometimes I click on a link and it's not even remotely what Bing claims it is) and stripping out things like attributions or referral links (e.g. NYT's Wirecutter makes money through affiliate links).

That they're not part of the main GPT-4 model itself is true, but the lawsuit is against OpenAI and Microsoft, not vs. GPT-4.


Seems like people are very concerned with accuracy and atribution. Personally I want AI to make the internet a warm sludge of meaningless binary, I want you to have no idea whether the person on the other side of the keyboard is real or not, I want you to be wary that everything you read is an AI hallucination. Be skeptical.


Great progress. It would be great if open source developers would group up and sue them into oblivion.


It's amusing that NYT spends so much verbiage in the complaint speaking so highly of itself as doing some great service to humanity and then stating a legal argument that is equally applicable to a trash-rag dishing out celebrity gossip.


Let's imagine that they settle out of court.

And then in a few years, as AI improves, the NYT relies on AI to write some of their articles (or they rely on AI to proofread and edit their articles).

Can the Times' current arguments be used against them using AI later?


I don’t see how. If they own the content (which in the outlined scenario they would) they could do whatever they want with it. The argument is about unauthorized use of copyrighted materials, not just using AI in general.


Used against them by whom? For what?


Related: NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT. 2023 December 27. <https://www.theregister.com/2023/12/27/the_new_york_times_fi...>

Discussion on HN (86 comments; original submission title: 'A business model based on mass copyright infringement'): <https://news.ycombinator.com/item?id=38784194>


From the OpenAI FAQ:

“What sources of data are used for training OpenAI models?

OpenAI uses data from different places including public sources, licensed third-party data, and information created by human reviewers.”

Is the NY Times data in one of those categories?


The language says “including” which means these mentioned types of sources and potentially other sources that don’t sound as good written down so are left out.


The question is: if the NYT can seek relief from Microsoft and OpenAI for this, then why can't every person who has code on Github do the same. In this country we are at an unfortunate place where only the corporations and rich people can seek relief. You need a lawyer.

However, any person has the right to file appear "pro se". And if the legal filings and arguments by NYT are public, then why can't I just file my own case. And you. And you. And you. And you. And you. You get the picture.

We have this situation where corporations can exploit situations and people with impunity. They can buy off the government - oh wait, we have to call that "lobbying". Lawyers won't take on the case unless it is a class action worth billions to them. In which case you and I get a piddling amount or none. See https://news.ycombinator.com/item?id=38901484. (the question being whether Apple settled in order to avoid a finding that their action was purposeful).

All we need is to use the information from the NYT legal case. <humor?> And hey maybe someone can use OpenAI to generate the filings?</end questionable humor>


> The New York Times Launches a Strong Case Against Microsoft and OpenAI

aka "Ancient mammoth sinking in tar pit throws one last fit before facing full-on extinction".


Will Microsoft still be able to harvest people's code on GitHub if the Times wins?

When will programmers speak out against their code being used for something they never consented to?

I'm glad journalists are leading the charge against unfair AI, and I wish programmers were raising their voices also.


Microsoft owns GitHub, so an explicit Copilot carve-out is but a TOS update away if they wanted to.


The irony here is that there’s 99% chance that the NYT is using ChatGPT internally. Even their journalists, unbeknownst to their bosses.


There's nothing ironic about that. NYT likely wouldn't be suing if ChatGPT weren't useful.

They acknowledge its usefulness and are demanding to be compensated for the billions of dollars they spent producing articles that trained ChatGPT.


This makes Apple more smarter to get a license from the NYT and other news organisations for permission to train on their news content for their AI, than to knowingly attempt to bypass that and scream 'fair use' after profiting off of training on and outputting verbatim paywalled articles without permission from the copyright holder.

News organisations are probably waiting for the NYT to win the case and then similar claims and damages from others could follow; not just from the NYT.

The key claim is this:

> Vicarious copyright infringement (the idea that Microsoft and various OpenAI affiliates directed, controlled and profited from infringement committed by OpenAI OpCo LLC and OpenAI, LLC)

OpenAI knew they needed a license to train their DALL-E AI model from shutterstock, instead of doing so with news organizations, they decided to trample all over it and profit off of copyrighted articles, without permission.

The end result of this is a settlement and licensing deal, a similar likely result for Getty v. Stability.


When did everyone in these circles flip flop from thinking that copyright is fake and that anyone who tries to enforce it should not be taken seriously, to thinking that it's a serious issue and has to be addressed?

I'm quite bearish on AI in general, but the whole copyright argument is nonsense to me. How is it any different than a human learning from public material in their field? My understanding is that you are not distributing copyrighted material verbatim, but a set of weights that were trained on copyrighted material. Just like how a writer reads a lot of other writers and becomes a better writer.

If we applied this same standard to human authors, couldn't we say that every narrative is actually the intellectual property of some unnamed hunter-gatherer who described how he killed an antelope to a captive audience?


I feel like we need to put a banner up that says "We don't apply the same standards to AI that we do to people because AIs are software, not people." And we can just point to it instead of wasting time mentioning what should be obvious, instead of entertaining these constant reductio ad absurdum arguments based on false equivalence.


I don't really think it's reductio ad absurdum. If something is okay for me to do, then I can make software to automate it.

It's okay for me to take existing artwork and use it as reference to make the artwork I want to make. I can make software to help me do that, like photoshop. But I can't make an AI tool to help me do that. Why? What is the actual difference?


How is suing OpenAI different from suing an army of gifted individuals who can speed-read and who read most NYT articles and create responses based on them?

Assuming there is no plagiarism, of course.


Because larger LLMs can reproduce text verbatim, speed readers probably not:

https://nitter.net/maksym_andr/status/1740776900786626608


So what this tells me is that if you make a carefully constructed prompt and change the temperature to zero (not an option normally available), and know the exact article title and perhaps the beginning of the first paragraph (which is pretty difficult to do unless you already basically have access to the full article), you can potentially get verbatim articles back out.

Great, so all those paywalls that use verbatim subjects and early texts as a teaser are now broken! Well, for all old articles that the LLM was trained on, at least, I guess. Very simple countermeasure is to simply use AI to paraphrase the subject and article (paradoxically) until the person pays for the privilege of reading human-authored text.

I'm sorry but this is a very weak argument. For example, I don't even believe normal users have access to the GPT4 system prompt unless they use the API directly (and possibly not even then, I'd have to check).


One is real, another one is made up (and there is plagiarism, of course, many pages of examples of that)


The temperature setting controls randomness output of the LLM.

We paraphrase all the time to avoid plagiarism and that's just somewhat randomized retelling of the same idea.

If you set the temperature to 0 in an LLM it's basically in "decompress/rote mode". I don't think this is qualitatively the same as "copying", possibly more akin to "memorization". I haven't seen very many demonstrations of verbatim-copy output that wasn't done with a temperature of or near 0.


Also, you can't avoid plagiarism by paraphrasing because paraphrasing is a form of plagiarism. The key here is whether you cite the source, which the model doesn't

https://www.scribbr.com/frequently-asked-questions/is-paraph...


that's coming. some LLM's are already starting to do that


How is the fact that there is a flag to disable plagiarism relevant for the issue that there is plagiarism?


Because enabling "plagiarism mode" is a conscious action that a human takes, it does not default to "plagiarize" no more than a machine that has simply stored the verbatim copy of an article, when asked to print it out, is "plagiarizing". Plus, citations are showing up in LLM's now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: