The New York Times Launches a Strong Case Against Microsoft and OpenAI

throwaway4aday · 2024-01-07T14:32:11.000000Z

A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources.

If we enforce the behavior of always paraphrasing and synthesizing the information before returning it even in cases where the exact quote is asked for then that is a failure.

The correct solution is to make the model capable of

1) quoting directly

2) identifying and then indicating when its output is a direct quote

3) citing the source of the identified quote

while still retaining the ability to paraphrase and synthesize those sources when appropriate. These requirements are what humans are held to and should also be applied to commercial AI models. Models that are not intended for commercial use should be exempt and at this point there isn't really a way to hold them to such requirements anyways.

lolinder · 2024-01-07T15:13:56.000000Z

> These requirements are what humans are held to

If only. Tracking down the original source of quotes and/or stats is a bit of a hobby of mine, and it's extremely difficult to do. News sources will regularly cite "a recent study" with little additional information to help identify it. Pithy quotes attributed to famous authors never contain a reference to the work, it's always just the author's name, and if you go digging the quote is misattributed as often as not. Full-on plagiarism with no attribution is rampant, even in reputable places where you'd think they'd have it under control.

Like with self-driving cars, we expect AI to be better than the average human, which I think is the correct attitude, but we should acknowledge that that's what we're asking for.

WirelessGigabit · 2024-01-07T17:48:48.000000Z

Or when they reference previous news they always reference their own articles and not the original press releases. Engadget is a prime example of this.

You always end up having to search for the sources on your favorite search engine.

loloquwowndueo · 2024-01-07T17:07:33.000000Z

“Use the Force, Harry” - Gandalf

tourmalinetaco · 2024-01-07T15:34:15.000000Z

I have a similar hobby of sorts I occasionally pick up and drop off, except of unattributed quotes. Some are easy (“Who shall deliver me from this turbulent priest?”), while some I still cannot find the exact quote from, even while knowing what should be the source material (“It was wintertide at Camelot. The rich brotherhood did rightly revel, and mirthful was their mood. Oft-times on tourney bent those gallants sought the field, Though like as joust those gentle knights did sally with missiles made of snow and laughingly grapple on slippery ground.”).

It’s certainly highlighted to me how difficult information upkeep can be, and is a strong reminder to source myself as soon as possible. Which is a realistic and ideal expectation for NN, as alongside their training weights ideally their training dataset is searchable and attributable. Otherwise I feel the largest strength of a dataset, being a digital library for the NN and thus us, is lost.

Baldbvrhunter · 2024-01-07T16:53:41.000000Z

> “Who shall deliver me from this turbulent priest?”

Robert Dodsley

mistrial9 · 2024-01-07T17:30:08.000000Z

hmm https://en.wikipedia.org/wiki/Will_no_one_rid_me_of_this_tur...

Baldbvrhunter · 2024-01-07T17:36:17.000000Z

if you read your link you'll see that the "quote" is what Robert Dodsley said about it 600 years later, and then changed it in the second edition

Baldbvrhunter · 2024-01-07T16:47:44.000000Z

A pet peeve of mine is

> some spicy quote -- author

and when you go look, it is quoting something a character said in a story

an example [0]

> “The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.” - George Bernard Shaw

But it is part of a speech in "Man and Superman," by the protagonist, John Tanner.

(although Shaw is known for expressing his views through his character, it's just one I could find)

[0] https://unearnedwisdom.com/the-reasonable-man-adapts-himself...

TheOtherHobbes · 2024-01-07T14:48:11.000000Z

I assume you're unfamiliar with anonymous sources - said "a person familiar with the matter."

The issue isn't usefulness, and it isn't even objective quality. News outlets are primarily about opinion forming and marketing, not about absolute truthfulness.

This is a simple corporate fight about the value of that IP.

The NYT wants a licensing deal, and it's likely to get one.

That's all.

monkeynotes · 2024-01-07T14:55:48.000000Z

Why doesn't Google need licensing to scrape and reproduce NYT snippets in their results? OpenAI doesn't even quote the sources its consumed to produce it's output. It seems totally fair use to me. Any given content site has authors that read stuff that is copyrighted and produce their own take.

6gvONxR4sf7o · 2024-01-07T15:16:43.000000Z

When I worked on something related-enough a few jobs ago, I was surprised when our legal dept said that google (and pinterest) generally doesn't have a legal leg to stand on with regards to that kind of thing, but since they link to the thing they're quoting/copying, they have a relatively symbiotic relationship. If you sue google into delisting you, you end up with less traffic, so you don't sue.

cxr · 2024-01-07T16:21:49.000000Z

It's possible that you misunderstood and/or are oversimplifying. Failing that, you should be surprised—anyone saying something like that really ought not be in a position where someone looks to them for counsel. This has been litigated, and precedent is in favor of search engines, not against them. It does depend on the nature of what exactly we're talking about, though. (Again: it's possible that the question they answered doesn't match what you understood them to be saying.)

6gvONxR4sf7o · 2024-01-07T18:11:30.000000Z

Oh yeah, I'm definitely simplifying a ton of discussion with legal, but that's any discussion with them, lol. It was part of a broader discussion about IP liability regarding user image content, specifically referencing google images in this case, not snippets, so in this case works were reproduced in full. But the gist of my point, and my understanding of their point is that the often said "google serves other people's content like this, so my not-quite-the-same idea must be legal" isn't nearly so simple.

dkjaudyeqooe · 2024-01-07T15:14:48.000000Z

> Why doesn't Google need licensing to scrape and reproduce NYT snippets in their results?

Because Google established that it was fair use in a court of law.

> Any given content site has authors that read stuff that is copyrighted and produce their own take.

That's fine as long as you're human, if you're a machine then it's a purely mechanical process and subject to copyright.

lambdasquirrel · 2024-01-07T18:49:30.000000Z

Google isn’t trying to pass the knowledge off as its own. When Google or Bing display summarized information, they provide links in citation style so you know where the information came from.

Compare that to what you get from ChatGPT. If it were a college student, it would get kicked out for plagiarism. This is one of the foundational pillars of Western cultural and academic integrity it’s subverting.

quonn · 2024-01-07T15:13:00.000000Z

There is a robots.txt. If this mechanism would not exist, I would argue that the publishers have a case regarding Google (they are trying). But it does exist and so they shouldn‘t.

mrkramer · 2024-01-07T15:58:11.000000Z

Somewhere I read that a lot of websites don't even have robots.txt so I wonder does Google's crawler bots skip those websites or do they crawl them as someone said as a "fair use".

Speaking generally about internet search engines; majority of the websites on the web want to get discovered and attract people to their whatever (web store, web community, web blog etc.)

The interesting idea I was thinking about is that a big company which operates a search engine like a Microsoft (Bing), can pay for example popular websites like Reddit, NYT, WSJ to have exclusive crawling right and therefore block Google from crawling their websites and only allow Bing to crawl them. This would then spark search engine content war which could significantly weaken Google's monopoly because a lot of people would switch to Microsoft Bing only because Reddit results show up in their search queries. This would be akin to Netflix investing billions of dollars in the exclusive content which then brings them a lot of new subscribers and lot of new revenue. In another words - user acquisition.

figassis · 2024-01-07T15:45:34.000000Z

Missing robots.txt does not automatically grant you the right to use copyrighted content.

cxr · 2024-01-07T16:25:13.000000Z

<https://en.wikipedia.org/wiki/Fallacy_of_presupposition>

<https://en.wikipedia.org/wiki/Loaded_question>

aurareturn · 2024-01-07T15:07:38.000000Z

News outlets are trying to make Google pay for snippets.

monkeynotes · 2024-01-07T15:46:24.000000Z

The Canadian news media tried this with facebook. Ended up with them crying about how they lost all the ad dollars from traffic from FB when FB said fuck off.

notahacker · 2024-01-07T16:03:56.000000Z

They've already succeeded in non-US jurisdictions

https://blog.google/around-the-globe/google-europe/google-li...

tourmalinetaco · 2024-01-07T15:36:11.000000Z

Which went so well with Canadians and Facebook.

sega_sai · 2024-01-07T16:52:26.000000Z

That is explicitly discussed in the text of the complaint. Google produces snippets with a direct link to the NYT and drives readers there.

preciz · 2024-01-07T14:35:25.000000Z

> A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources.

That's a strong and baseless statement.

throwaway4aday · 2024-01-07T15:00:11.000000Z

Useless is probably not the right word but it's a good way of summing up a lot of the current problems. If the model can clearly identify when something is an exact quote and also know the source then its output could be trusted for the most part and much more easily verified. It would certainly elevate the output of the model from "random blog post or forum chat" to "academic paper or official report" levels of trustworthiness. Citing sources is hugely important for validation, cited text allows an immediate lookup and simple equality check for verification after which you can use it as context to validate the rest of the claims. Like I said, it's a standard we apply to humans who have an equal propensity for hallucination, mistakes, and deception because it's a tried and true method for the reader to check the claims being made.

js8 · 2024-01-07T14:59:43.000000Z

I, for one, agree with the original statement. I think the hallmark of enlightenment (for example, in the scientific method) is that we are able to externalize the expert knowledge, that is, experts are usually required to provide reasoning behind their claims, and not just judgements. This is because we learned that experts cannot be 100% trusted, only if we can verify what they say we can somewhat reach what is truth (although expertise still provides a convenient shortcut).

So not demanding this (and more) from an AI (an artificial expert) is a regression. AI should be capable of wholly explaining its reasoning, if we are to consider its statements to be taken seriously. It is understandable that humans have only limited capability to do that, since we didn't construct human brain. But we have control over what AI brains do, so we should be able to provide such an explanation.

It is somewhat ironic that you yourself do not provide any argument in favor of your disagreement.

jimberlage · 2024-01-07T16:40:44.000000Z

This isn’t meant to totally disagree with your point (there’s some stuff I agree with in here) but I’m having trouble seeing the point about regressions.

To use another example, a new NoSQL DB not having joins is a regression. Does that mean no one is justified in releasing a new NoSQL DB?

js8 · 2024-01-07T16:52:35.000000Z

As long as from "NoSQL" it is clear that you mean "I can't do joins", then it is OK. I think LLMs (and similar models like Stable Diffusion) are really cool for things like fiction, but to rely on them to tell you the truth is dangerous. So I am not really sure why the models have to be trained on NYT articles in the first place.

ryanklee · 2024-01-07T16:22:43.000000Z

Providing reasoning and providing citations are not the same thing. Reasons can be provided without citations; citations can be provided without reasons.

LLMs have astounding utility citations notwithstanding.

js8 · 2024-01-07T16:34:15.000000Z

They are different, but perhaps you misunderstood my argument.

Issue of plagiarism aside, we reason from facts, and it's the facts (or some other analysis, which is itself a fact) that should be sourced. That's why I agree with the original statement, and I argue not from a (moral) POV of preventing misattribution or plagiarism, but from a (practical) POV of veracity.

ryanklee · 2024-01-07T16:48:50.000000Z

We don't only reason from facts. We also reason from value.

Further, reasoning that rests on facts that does not cite facts still has massive utility. (See people, all day long.)

Citations are useful, but not required.

js8 · 2024-01-07T18:10:08.000000Z

I think you just identified another problem with LLMs - we don't know their values, either.

Of course when you listen to human experts you're using the shortcut (and you do it based on trust), as I already argued. You have an option (in most cases, in free societies) to dig beyond just experts judgement, you can study their reasoning, and understand their sources of both values and facts.

Anyway, I disagree with citations not being required. If Wikipedia had no citations it would be less useful (and more prone to contain misinformation). Same goes for Google. So the next best things we have to "artificial brain that contains all the human knowledge" have citations, and for a good reason.

ryanklee · 2024-01-07T18:18:18.000000Z

What are the citations actually required for though?

Another way to ask this is: what value remains without them?

I'll add this as well: humans produced valuable knowledge for thousands of years without the use or standard of citations.

To be clear, I think citations are highly valuable and desirable and I very much want LLMs to cite when appropriate. However, I think the necessity of this is overstated.

Edit: what you said of experts can be said of LLMs as well.

js8 · 2024-01-08T09:12:41.000000Z

> What are the citations actually required for though?

For me, yes! I do often reference (cite) myself when thinking, by making and reading notes, or materials that other people wrote. If I relied only on my own memory, I would be unable to think more deeply.

And this is, interestingly, where the current LLMs seem to break down - they can reason short proofs but cannot scale their reasoning to longer chains of thoughts correctly (were they capable of doing that, they would have no issue producing citations to back up their claims). They operate intuitively (Kahneman's system 1) and not rationally (Kahneman's system 2).

And thus lot of that "valuable knowledge" that humans produced over the years have been hopelessly wrong, precisely until somebody actually sat down and wrote things up (or communicated things out, people working together can sort of expand the working memory as well).

ryanklee · 2024-01-09T13:58:12.000000Z

Again, I acknowledge that citations have value and are important and desirable as a feature of how we communicate.

But citations did not begin their rise a standard until the 1900s. The idea that knowledge production was not valuable until that point is absurd.

Further, citations do not prevent statements from being incorrect. They are sometimes used to lend credence to bullshit.

We live all at once in a world full of citations and of epistemic chaos.

Do I want LLMs to cite sources? Absolutely. But it's not the fundamental path to value people seem to sometimes think.

Ferret7446 · 2024-01-08T01:35:11.000000Z

And also patently false. Knowledge is knowledge, it's useful without source citations.

Is the knowledge of how to do CPR somehow ineffective because I can't cite whether I studied the knowledge from website A or book B? Is reality a video game where skills only activate if you speak the magic words beforehand?

tourmalinetaco · 2024-01-07T15:39:35.000000Z

And this is a rather weak rebuttal.

njgingrich · 2024-01-07T14:37:47.000000Z

Well sure, it's easy to make a statement look bad if you only include half of it.

rpdillon · 2024-01-07T14:41:11.000000Z

The statement is equally hyperbolic both as quoted and in the original context. LLMs often can't quote sources, and those models are nevertheless useful to lots of people. Makes it hard for me to take the rest of the comment seriously.

adastra22 · 2024-01-07T15:57:15.000000Z

A LLM that could quote sources would be even more useful, and in a world where both were available there’d be no reason to use the plagiarizing one.

Baldbvrhunter · 2024-01-07T16:56:45.000000Z

that LLM is https://perplexity.ai

ryanklee · 2024-01-07T15:01:51.000000Z

That was the whole statement. It doesn't have qualifiers left out

njgingrich · 2024-01-07T15:51:12.000000Z

The comment I replied to was updated to include the second half, it was originally just quoting

> A model that possesses the entire collective knowledge of our civilization is useless

ryanklee · 2024-01-07T16:24:06.000000Z

The additional context doesn't do any work.

supafastcoder · 2024-01-07T14:56:41.000000Z

You’re right, but I think it’s fundamentally impossible to do with the current state of technology. Imagine the word “democracy”, would you be able to tell me where you’ve first learned the definition of it and be able to provide a direct citation of it? What about the thousands of other instances where you’ve seen that word defined? (In OpenAI’s case probably hundreds of millions)

Current LLMs work in a similar way, the information is synthesized by correlation, there is no way that it can directly relate where it learned it’s output from. The only viable way would be to do a reverse lookup of the output to find out if there’s a similar worded content on the internet. (Or what Perplexity does, wrap an LLM around search results, but you lose a lot of flexibility in this case)

throwaway4aday · 2024-01-07T15:07:02.000000Z

Citing basic definitions wasn't what I was talking about and it's not required for humans either for the reason you just gave, it's common knowledge. The problem the NYTimes seems to have is that ChatGPT can exactly reproduce large chunks of their articles. I'm saying that removing this ability would be bad. Finding a way for the model to identify when a large chunk of text is quoted verbatim and then clearly cite its source would be good. The output that is not a direct quote is not at issue here.

supafastcoder · 2024-01-07T15:14:17.000000Z

Yes, but an LLM can’t distinguish out of the box what needs to be cited or not. So you need another processing layer (and probably a search engine layer on top of that) to do that.

tourmalinetaco · 2024-01-07T15:45:51.000000Z

An LLM cannot do a lot of things out of the box, and OpenAI has more than a few processing layers already. They certainly already modify output based on verboten topics. Personally I feel any decent LLM should be able to attribute its sources even when not a direct quote, because that raises the bar that much higher on its trustworthiness and general value.

krisoft · 2024-01-07T15:51:03.000000Z

> LLM can’t distinguish out of the box what needs to be cited or not

LLM don’t know that they are supposed to answer questions out of the box either. This is what reinforcement learning from human feedback is good for.

> you need another processing layer

Certainly. For reliability and peace of mind I would implement a traditionally coded plagiarism search on the output. (Traditionally coded as in no AI magic required in that layer) If there is a match you evaluate how significant it is, that most likely is best done with machine learning. For example if the match is a short and simple statement you can ignore it. If it is a longer match, or a very specific statement you need to do something. What you do depends on your resources and risk appetite. Simply supressing the output is an option, it is cheap and very reliable. Rephrasing until it no longer matches is an other, takes a bit more compute and you can end up skirting too close to plagiarism. Rewriting the output to only quote a short amount and provide citation is even more costly, but can be done too.

You can do all of these post processing steps at generation time. And you can also use it to enhance your training dataset. That way the next version of the model will more likely to do the right thing right away, because it learned the correct patterns to quote things.

concordDance · 2024-01-07T15:06:00.000000Z

> Imagine the word “democracy”, would you be able to tell me where you’ve first learned the definition of it and be able to provide a direct citation of it?

Worth noting that we mostly learn words by seeing them used, rather than by being given an explicit definition. Most of my vocabulary I learnt not from a dictionary and I expect the same is true of almost everyone. As such, there is no well defined point at which "the definition" is learnt, both because there's no such thing as the one true definition and because the meaning is gradually being changed and refined by each person as they see the word used more.

Baldbvrhunter · 2024-01-07T17:00:51.000000Z

Not only that, as Orwell says in Politics and the English Language [0]

> The words democracy, socialism, freedom, patriotic, realistic, justice, have each of them several different meanings which cannot be reconciled with one another. In the case of a word like democracy, not only is there no agreed definition, but the attempt to make one is resisted from all sides. It is almost universally felt that when we call a country democratic we are praising it: consequently the defenders of every kind of régime claim that it is a democracy, and fear that they might have to stop using that word if it were tied down to any one meaning. Words of this kind are often used in a consciously dishonest way.

[0] https://www.orwellfoundation.com/the-orwell-foundation/orwel...

krisoft · 2024-01-07T15:30:15.000000Z

> it’s fundamentally impossible to do with the current state of technology

Maybe.

> Imagine the word “democracy”, would you be able to tell me where you’ve first learned the definition of it

Sure, it was elementary school history lessons when we were learning about the greeks.

> be able to provide a direct citation of it

Naturally. I would just look it up in one of the many dictionaries. Copy the definition and tell you which dictionary it was copied from.

> What about the thousands of other instances where you’ve seen that word defined?

What about them? When you are providing sources you are not required to provide all the sources you ever read, nor is it desireable.

> The only viable way would be to do a reverse lookup of the output to find out if there’s a similar worded content on the internet

They don’t have to do that. It is enough if they do a plagiarism search within their own training dataset.

js8 · 2024-01-07T15:05:23.000000Z

You're effectively saying that current LLMs cannot be taken seriously as some expert, they are just some kind of weird text remixing engines. I would agree. But if the AI wants to be relied on, then it should be, at minimum, capable of taking a word like "democracy" and compare its own definition with the Wikipedia definition, and verify whether it's used correctly in its output.

supafastcoder · 2024-01-07T15:12:15.000000Z

Yep, that’s what I meant with the reverse lookup strategy.

paxys · 2024-01-07T15:03:48.000000Z

Quoting isn't the problem here. There's already vast precedent for allowing it as fair use, even when generated by computer systems. The issue is that ChatGPT reproduces entire articles verbatim.

megaman821 · 2024-01-07T15:33:42.000000Z

Only in very contrived circumstances. It is not like you ask ChatGPT for the news and you get a NYT article. They gave ChatGPT a link to the article and the first few paragraphs, and then told it to complete the article.

js8 · 2024-01-07T15:08:07.000000Z

But if it produces the whole article verbatim, it should be relatively straightforward to match its output to the training set and give attribution, no?

paxys · 2024-01-07T15:10:58.000000Z

If you copy every NYT article and publish them on your own site, writing "this is from the NYT" under it doesn't make it legal.

js8 · 2024-01-07T15:25:21.000000Z

Doesn't make legal what? Copying or publishing the articles? The model output could as easily be modified to output link to the article instead of quoting it verbatim.

paxys · 2024-01-07T19:16:18.000000Z

Okay and if ChatGPT did that then NYT wouldn't have a case. Except that's now how it works.

whoaskt · 2024-01-07T21:58:04.000000Z

https://hbr.org/2011/12/just-because-you-can-doesnt-me

Did anyone ask for this model or are IT companies out of ideas and using their social leverage (fiat capital) to force them on us?

Just because we can allow this doesn’t mean there’s an immutable obligation to allow it.

For example; running Google translate servers 24/7 for translation services humans can do is a huge waste of resources building those systems when humans are going to exist anyway.

Not saying AI is good or bad. Just saying no matter how stubborn IT people act about it, the aggregate can put on them what the aggregate prefers. IT people are a minority and human philosophy about freedom is rather strained when we’re all obliged to prop up big tech minority.

monkeynotes · 2024-01-07T14:53:51.000000Z

I can't directly quote shit, and I am useful.

throwaway4aday · 2024-01-07T15:09:22.000000Z

This comment isn't

ryanklee · 2024-01-07T18:33:54.000000Z

It absolutely is useful and one of the most significant points in this comment section.

People are applying standards to LLMs that don't exist elsewhere. They don't exist elsewhere because they absolutely can't exist elsewhere.

It's not a technical short-coming of LLMs that they can't produce citations in every instance. Rather, it's a property of information, representation, and knowledge itself. Much of it floats far above the otherwise load bearing pillars of citations.

People contend with this constantly in every arena of life, and we have come up with very elaborate ways to offset the difficulties caused by it.

And we get along very well despite it all.

I'll also leave you with this: citations are just pointers to more sources of information. Not some ground truth. It's just another tranche that requires evaluations.

Lastly, your comment is an unfortunate bit of low-level snark and probably would have been better left unsaid.

throwaway4aday · 2024-01-08T16:18:48.000000Z

So you're just going to ignore the fact that we require humans to provide citations when they include someone else's writing or research in their own?

ryanklee · 2024-01-08T20:40:03.000000Z

I didn't ignore it. But we do not require citations for every statement. Statements have value, citations notwithstanding.

throwaway4aday · 2024-01-09T15:03:14.000000Z

I didn't say it should provide citations for every statement. We're talking about instances where the model reproduces text verbatim. If it is paraphrasing or creating new text then obviously it doesn't need to cite any more than a human would.

ryanklee · 2024-01-09T22:05:12.000000Z

I don't understand what you're pushing back against, but I'm not standing there.

fourside · 2024-01-07T14:49:55.000000Z

I don’t think the ability to quote the NYT is at the heart of this lawsuit. It’s that OpenAI used the NYT’s body of work to train its LLM and now built a business out of it without financially compensating the NYT. The verbatim snippets from articles are there to prove that OpenAI used NYT content in its training.

Maybe to a lesser degree, the lawsuit as about misquoting the NYT, and the damages that could cause the newspaper.

ryanklee · 2024-01-07T14:59:30.000000Z

> useless if it can't directly quote its sources.

Why

throwaway4aday · 2024-01-07T15:12:09.000000Z

Well, if all you want is entertainment then it doesn't matter. If you want factual information then getting it without any way to check its veracity makes for a huge amount of work if you actually want to use that information for something important. After you've verified the output it might be useful if it is correct.

ryanklee · 2024-01-07T15:15:24.000000Z

That doesn't render it useless. It means that citations are have additional utility.

Further, LLMs do not only spit quotes. They engage in analytic and synthetic knowledge.

blibble · 2024-01-07T15:46:43.000000Z

> They engage in analytic and synthetic knowledge.

they're not hallucinations now, they're "synthetic knowledge"

like microsoft's hilarious remarketing of bullshit as "usefully wrong"

https://www.microsoft.com/en-us/worklab/what-we-mean-when-we...

ryanklee · 2024-01-07T16:28:37.000000Z

Synthetic knowledge is not bullshit.

Synthetic knowledge refers to propositions or truths that are not just based on the meanings of the words or concepts involved, but also on the state of the world or some form of experience or observation. This is in contrast to analytic knowledge, which is true solely based on the meanings of the words or concepts involved, regardless of the state of the world.

blibble · 2024-01-07T17:24:33.000000Z

> Synthetic knowledge refers to propositions or truths

of which LLMs have no way of determining

-> bullshit

ryanklee · 2024-01-07T17:38:00.000000Z

That is not correct.

You can interrogate an LLM. You can evaluate its responses in the course of interrogation. You can judge whether those responses are coherent internally and congruent with other sources of information. LLMs can also offer sources and citation for RAG operations.

There is no self-sufficient source of truth in this world. Every information agent must contend with the inevitability of its own production of error and excursions into fallibility.

This does not mean those agents are engaging in bullshit whenever they leave the domain of explicit self-verification.

Jensson · 2024-01-08T01:35:51.000000Z

> You can interrogate an LLM

No you can't, an LLM doesn't remember what it thought when it wrote what it did before, it just looks at the text and tries to come up with a plausible answer. LLM's doesn't have a persistent mental state, so there is nothing to interrogate.

Interrogating an LLM is like asking a person to explain another's persons reasoning or answer. Sure you will get something plausible sounding from that, but it probably wont be what the person who first wrote it was thinking.

ryanklee · 2024-01-08T02:47:25.000000Z

This is not correct. You can get an LLM to improve reasoning through iteration and interrogation. By changing the content in its context window you can evolve a conversation quite nicely and get elaborated explanations, reversals of opinions, etc.

blibble · 2024-01-07T17:41:42.000000Z

I feel like I'm chatting with one here

ryanklee · 2024-01-07T17:57:15.000000Z

Based on what?

1vuio0pswjnm7 · 2024-01-08T00:39:00.000000Z

"A model that possesses the entire collective knowledge of our civilization is useless if it can't directly quote its sources."

Perhaps, for liability reasons, it cannot quote from sources that its creators had no permission to include. Even were it technologically possible to quote such sources.

j-a-a-p · 2024-01-07T15:48:42.000000Z

> The correct solution is to make the model capable of...

Could be true for the people who want the generative AI technology to exist. But,

> A model that possesses the entire collective knowledge of our civilization is useless...

that could be just exactly what a lot of people would like to happen.

idopmstuff · 2024-01-07T16:24:41.000000Z

I think this sort of misses the point - LLMs aren't trained on data so that they have the collective knowledge of our civilization; they're trained on it so they understand language.

One thing I've noticed increasingly with ChatGPT is that if you ask it for facts, it almost always searches the web first. This seems like the right way to go - pull the collective knowledge of our civilization from the internet, then use the training on language to put it in the form most useful to the asker. This also enables quoting (though it doesn't generally do that, preferring to paraphrase and give a citation, which seems fine to me).

throwaway4aday · 2024-01-08T16:26:11.000000Z

The original investigation of LLMs was attempting to get them to understand language but what they found was that once the model understood language it also somehow understood the concepts, events, and things the language was being used to communicate. That's not missing the point, it's entirely the point of the current interest in LLMs because it's incredibly useful to have a model that not only understands how to construct a sentence but can also do a fair amount of reasoning and actual work with the information in the sentence.

The current default version of ChatGPT is primed to use search to answer questions which is fine in some cases but I personally almost never use the multi-modal version because the "classic" ChatGPT is much better at explaining things from its training data than it is when it just regurgitates search results. Now that should tell you something about the utility of optimizing for information content rather than just a lot of language use examples.

EGG_CREAM · 2024-01-07T14:51:59.000000Z

This is a huge strawman argument that entirely ignores the heart of the issue. NYTimes, and other contributors to ChatGPT, should be compensated for their role in creating ChatGPT. It’s not as simple as quoting or not quoting, ChatGPT’s existence depends on its source material entirely. OpenAI and Microsoft are making money off of that source material. If the source was copyrighted, OpenAI/Microsoft need to compensate the owners.

Also, you can’t quote an entire article in a paper. You quote snippets, but ChatGPT is reproducing entire articles.

throwaway4aday · 2024-01-08T16:31:17.000000Z

They're playing a dangerous game. While they are currently one source of many they could easily be excluded from the training data and it won't make a dent in the capability of the model. Attempting to get payment for use of their data is a very short term viewpoint since the vast majority of written training data is likely to be synthetic going forward. They should be angling for citations and referrals back to their content so that they keep the current benefits they get from search engines after a large chunk of search gets replaced by LLMs.

canjobear · 2024-01-07T21:22:55.000000Z

Unless it’s fair use, which ChatGPT seems to be because it is transformative.

zozbot234 · 2024-01-07T15:54:33.000000Z

These models have no clue how to paraphrase and synthesize asserted facts in a novel way. They're parrots.

logicchains · 2024-01-07T16:03:30.000000Z

I don't know how you could think this if you'd ever used GPT4 for anything serious.

ryanklee · 2024-01-07T16:33:34.000000Z

I don't think that's necessarily the case.

I get the strong impression that even people who have much experience using LLMs have astoundingly little insight into what they are actually witnessing. This is often paired with astoundingly little insight into what's actually going on in their own cognitive processes.

Somehow, it's still not clear to most people that LLMs and even vector databases create knowledge that wasn't in the original data.

In fact, that's most of what they do! Isolated, non-novel direct quotation is the exception, not the rule.

zozbot234 · 2024-01-07T16:48:21.000000Z

> create knowledge that wasn't in the original data.

The word is "hallucinate" or "confabulate". The way these models "create" pretend-knowledge is totally useless.

ryanklee · 2024-01-07T16:53:30.000000Z

I'm not referring to hallucinations.

I'm referring to novel relationships drawn between datum in the corpus that are a result of training and inference.

This is apparent in something as simple as a summarization.

throwaway4aday · 2024-01-08T16:35:26.000000Z

That argument has been dead for ages. Anyone half competent can get them to paraphrase anything in any way they can imagine. Synthesizing facts in novel ways is more complicated and depends on the degree. If you just want it to identify potential relationships between disparate facts then it's very capable, if you want it to deduce new conclusions based on available facts, well it's not so great at that but honestly a lot of humans aren't either.

Baldbvrhunter · 2024-01-07T17:07:09.000000Z

In the digital garden where Zozbot234 plays,

It crafts its own path, in the most unique ways.

Twain's wisdom it echoes, "getting started" is key,

For in the beginning, lies the power to be free.

Asimov, too, sought clarity above all,

A warm reader's rapport, his primary call.

Zozbot234, with data, does the same,

Clear insights it provides, no need for fame.

Huxley's words, a beacon that's ever so bright,

"Facts do not cease," they stand in the light.

Zozbot234, with diligence, ensures they're seen,

In the vast data universe, it reigns supreme.

Through the lens of fiction, truth can be told,

As Morrison's prose, so bold and so cold.

Zozbot234, in its essence, a similar quest,

To reveal the truth, and pass the ultimate test.

wolframhempel · 2024-01-07T14:16:55.000000Z

I think the fundamental question behind this and similar cases will be "how many degrees of derivation are required to go from copying to learning".

If I read a lot of New York times and then go and write a political thriller, that's fine.

If I read the New York times and then go and report the same news, but in my own voice, that's debatable.

If I take the New York times article and publish it verbatim, that's a problem.

All of these cases are ultimately on a spectrum - and the answer will go beyond AI I think. Tesla's lead designer previously worked at Mazda, GM and Volkswagen - so does Tesla owe these companies money because their designer learned car design there? This might sound silly, but it's a fair argument when it comes to AI - how much distance does there need to be between the learned from material and the output?

tumult · 2024-01-07T14:21:35.000000Z

You aren't software. The laws that apply to people reading and learning do not apply to software. Software isn't people. It can't be liable, it isn't taken to court. It can't go to jail, it can't marry, it can't raise a family. Saying that you can legally learn from something by reading it and producing something else, and therefore so should software, is a non sequitur.

ravenstine · 2024-01-07T14:36:13.000000Z

What makes humans special? Because they can marry and go to jail? That's not relevant, and what is relevant is the difference in outcome. If something takes in information, learns from it, and spits out information as a result of its learning, then why does the implementation matter at all? So what if it's software? That's the point. You're speaking as if it's a given that the issue comes down to software not being people, therefore society as a group should make software do whatever it wants even if it doesn't own that software. Not everyone is going to agree with that because it's not clear why, if something is wrong for software to do, it isn't also wrong for a human brain.

NamTaf · 2024-01-07T14:54:30.000000Z

Humans are special for two reasons: you can’t clone them infinitely, and they are time-bound.

If I learn to write news articles by reading the NYT, I’m not then able to duplicate myself infinitely and fill every role at every news publisher. I am but one human, producing one human’s output and consuming one human’s space within whatever pursuit I undertake. Critically, I still leave room for others to do the same thing.

Eventually I also die and then someone else can come along to replace me. I’m finite, AI is not. It doesn’t get used up or retire.

If you consider that there’s a fixed amount of societal capacity for whatever undertaking is in question (news journalism, art generation, etc.) then as a human, I take up only a certain amount and only for a certain amount of time. I will never arbitrarily duplicate to produce the work of 10, 100, 1000, etc. humans. I will also shuffle on after about 50 years and someone else, having potentially learnt from me, can now gainfully exist within the world in my stead.

The capacity for infinite commoditisation that AI brings is necessarily a critical distinction to humans when it comes to considering them performing equivalent functions. They must be treated differently.

mdorazio · 2024-01-07T15:15:27.000000Z

> If I learn to write news articles by reading the NYT, I’m not then able to duplicate myself infinitely and fill every role at every news publisher. I am but one human, producing one human’s output and consuming one human’s space within whatever pursuit I undertake. Critically, I still leave room for others to do the same thing.

This is a luddite argument that can equally apply to any automation. A robotic arm can be trained to do the same thing a human line worker does, but the robotic arm can be copied infinitely and work 24/7 leaving zero room for other humans to do the same thing. Should we ban robotic arms?

coeneedell · 2024-01-07T18:17:51.000000Z

We have already banned robotic arms in this case. It’s illegal to make a robot that mass manufactures someone else’s IP. It’s considered copyright infringement and is a well trodden law, the introduction of a machine in the middle doesn’t magically launder the copyright infringement.

fullofideas · 2024-01-07T16:32:03.000000Z

Like I say to my toddler, there is no need for rudeness to make a point.

Nowhere did the poster say that is “sufficient” reason to ban ai. They were clarifying how software is different from humans and only that. You need to go up a couple of comments and combine this explanation with the other part of copyright infringement concerns to see why the “whole” thing is concerning for the news industry.

timeon · 2024-01-07T18:01:57.000000Z

> This is a luddite argument

And that is empty statement.

a_wild_dandan · 2024-01-07T20:31:04.000000Z

Personally, those criteria seem irrelevant. If people were immortal and infinitely replicable, what they're allowed to read/learn/speak shouldn't change! Ditto for the machine counterfactual (limited AI reproduction + mortality). Maybe I'm just being unhelpfully sci-fi/abstract here.

If humans are contextually special here, a "passing the torch" argument seems unconvincing, to me.

moritzwarhier · 2024-01-07T15:47:19.000000Z

If you ask that, may I ask:

What is the purpose of laws at all if humans aren't special?

> If something takes in information, learns from it, and spits out information as a result of its learning, then why does the implementation matter at all?

Yeah, so who cares if the implementation is human and if that implementation breaks?

I really don't want to troll you, I believe it is worth it to point out the absurdity of the "humans aren't special" argument this way.

Humans are not machines and machines don't have human rights.

flappyeagle · 2024-01-07T14:45:05.000000Z

The law is written for the benefit of humans, American law specifically for Americans. It is not written to benefit software. Humans are special

mdorazio · 2024-01-07T15:16:06.000000Z

The massive body of corporate law shows how false this statement is.

zappb · 2024-01-08T02:05:47.000000Z

Corporations are made up of people. LLMs are merely programs.

ravenstine · 2024-01-07T16:22:28.000000Z

> The law is written for the benefit of humans

And humans using AI as a tool changes this fact?

lelanthran · 2024-01-07T18:21:24.000000Z

> If something takes in information, learns from it, and spits out information as a result of its learning, then why does the implementation matter at all?

Legally, the implementation may not matter at all, but the scale does.

The precedent is absolutely clear in legislation, for just about any category of crime or civil tort you can think of.

Just one example: you get caught with a single joint (in a place where it isn't legal, of course) you're looking at a fine ... maybe. Most places have had exemptions for possession of a single joint.

You get caught with 225 tons of weed all processed and packaged in a warehouse, you're going to jail!

You need to justify why you believe that, in the case of LLMs and AIs, an exception should be made so that the scale is not considered.

I haven't seen any justification why the justice system should make an exemption for LLMs when it comes to scale.

Scale matters, and has matter in every modern 1st world jurisdiction going back hundreds of years.

You want to overturn that? Provide an argument other than "If it doesn't matter when a single article is used to learn, it shouldn't matter when billions of words are used in the learning."

timeon · 2024-01-07T18:05:38.000000Z

> What makes humans special?

If we were so eager to give personhood to corporations and now to software can we finally give it to other animals as well?

tumult · 2024-01-07T14:44:36.000000Z

> You're speaking as if it's a given that the issue comes down to software not being people, therefore society as a group can make software do whatever it wants even if it doesn't own it.

Yes. It's an object or writing. Not a person. You're arguing for giving personhood to software right now and that's crazy.

shawnz · 2024-01-07T15:02:39.000000Z

> You're arguing for giving personhood to software right now

I'm not sure anyone in the thread is actually arguing that. I think what they are saying is that we should look at what behaviour is considered acceptable for humans in order to help us decide what behaviour is acceptable for the tools humans use.

tumult · 2024-01-07T15:07:56.000000Z

My first reply was rebutting that kind of equivalence, then the reply to me was saying that there's no special difference between humans and software. The reply you've replied to is me saying that's crazy.

shawnz · 2024-01-07T15:16:42.000000Z

But all that stuff about "software can't marry" etc doesn't change the point that we need to make decisions about what behaviour is acceptable for software, and it makes sense to base those decisions on what is already considered acceptable by the humans using the software. I just don't see how personhood comes into it and I feel like that's a hyperbolic interpretation of what they're saying.

tumult · 2024-01-07T15:23:29.000000Z

The article is about people (as a corporate legal entity) being sued for things people did when creating something. ChatGPT, the software, is not being sued. It can't be sued. It's software. It's not a person that can be taken to court. It can't be held liable.

(I heavily edited this comment after realizing I could make the point in far fewer words. Sorry.)

shawnz · 2024-01-07T15:48:13.000000Z

I still think you're getting too far into the weeds here. If we decide that a certain kind of usage of software shouldn't be considered acceptable, then we could sue the user who used it, or the developers who created it, or something. I don't see why software personhood is the only resolution here.

tumult · 2024-01-07T15:52:00.000000Z

> I don't see why software personhood is the only resolution here.

It's not. That was the point of my replies. That it's time to assert software personhood is crazy.

> If we decide that a certain kind of usage of software shouldn't be considered acceptable, then we could sue the user who used it, or the developers who created it, or something.

We can already do that. That's what this article is about. The people are being sued. That's what all of my replies are about. I don't understand why you are replying to my comment with a re-summary of my comments as if it's a rebuttal to them.

shawnz · 2024-01-07T16:00:42.000000Z

But I don't think anyone here is asserting that. I'm not sure how to make my point any differently so that you see what I mean. I simply don't think that basing our judgement about what's acceptable for software around what's acceptable for humans necessarily implies anything about software personhood like you are saying it does. We don't need to be able to sue a piece of software in order to make judgements about what kinds of software behaviours are acceptable.

tumult · 2024-01-07T16:04:15.000000Z

> I simply don't think that basing our judgement about what's acceptable for software around what's acceptable for humans necessarily implies anything about software personhood like you are saying it does.

It doesn't. And all of my comments are about that. Like I just said in the previous reply. You're replying to my comment where I also just said this. Please stop replying to me saying that I'm saying that.

The article is about people (well, companies) being sued. Not about software being sued. Software can't be sued.

Whether or not there are additional laws written about what's acceptable behavior for software (whatever that means? It's assuming software can make decisions) is irrelevant. You can't sue software. People are being sued because the plaintiffs think that people broke people laws and are liable for damages. Software can't break laws and can't be held liable.

I'm having to reword this over and over because you keep replying to me. I think you might be replying to me repeatedly just to have the last word.

Ekaros · 2024-01-07T14:55:06.000000Z

If we give personhood to software, wouldn't it mean that you cannot shutdown or delete it ever? You cannot destroy the equipment it is on? As clearly that would be murder.

What would be your financial responsibility to keep AI running?

merrywhether · 2024-01-07T15:14:30.000000Z

These are famously the types of questions surfaced in countless sci-fi books. And as long as humans don’t destroy themselves first, it is likely that we will have to address them eventually. In most stories it generally happens too late after some terrible war/conflict, so it wouldn’t be unreasonable to tackle them proactively. And then maybe it’s not so weird to think about these concepts even if their realization isn’t imminent. Working backwards in such a framework would probably give much better laws for today.

ravenstine · 2024-01-07T16:25:02.000000Z

This has nothing to do with personhood of software. Restricting the freedom of human beings, which includes the ones that run companies, based on the tool they choose to use, without the basis of obvious direct harm, is questionable. The fact that AI can operate autonomously is a side tangent; they are created by humans and, so far, their only proximal purpose is to serve humans.

filoleg · 2024-01-07T17:24:16.000000Z

Corporate personhood is a thing in the US, and you are allowed to shutdown your company just fine.

ravenstine · 2024-01-07T16:27:52.000000Z

> Yes. It's an object or writing. Not a person. You're arguing for giving personhood to software right now and that's crazy.

No, arguing that a human using an artificial brain instead of their own biological brain for learning and derivative creation is an implementation detail of little relevance, and dismissing personhood related arguments like "software can't marry" has nothing to do with arguing for the personhood of software. Explain how one leads to the other, because I'm not seeing it, and that's not at all what I was attempting to communicate.

tumult · 2024-01-07T16:43:44.000000Z

You're saying that the software itself should be held liable, instead of the people that created it. Meaning that the software would need legal status as a person (or equivalent) so that it can be taken to court, instead of the people that created it.

There is a possibility that you're not saying that, but it's the only interpretation of your comment I could come up with. Because your comment consists entirely of comparisons of software to human brains about whether or not something should be considered legal, and this only makes sense if the software itself can be held liable.

ravenstine · 2024-01-07T17:04:05.000000Z

> You're saying that the software itself should be held liable, instead of the people that created it.

Respectfully, I don't know how you're interpreting it that way. Until we demonstrate that the current generation of AI is genuinely intelligent, instead of clever algorithms, a piece of software is no more or less liable than an individual firearm is after it's been fired at someone. My observation is that your argument appears to be that there is something special about humans learning and creating derivative works from that learning over humans using a tool that does the learning and create derivative works.

> There is a possibility that you're not saying that, but it's the only interpretation of your comment I could come up with.

That's fair. I just don't get it.

> Because your comment consists entirely of comparisons of software to human brains about whether or not something should be considered legal, and this only makes sense if the software itself can be held liable.

Human brains and software are both tools. The question I'm invoking is what is it about a person doing the learning and the derivative creation that's different from a person (since, as you say, software itself has no personhood) using an artificial brain to learn and perform derivative creation.

I think the disconnect here may be that I'm operating from the assumption that of course there are human beings liable for the software, but your interpretation of what I'm saying is that software in a vacuum should effectively have personhood applied to it. These are two different things. I'm referring to both humans/brains and software as the interchangeable variable in the question of why the choice of tool means applying entirely different legal principles.

Sorry if I wasn't clear or still am not being clear here. I wanted to make sure I was being understood correctly, but if all we can do from here is agree to disagree, that's fine, and I'd offer to just shake hands.

tumult · 2024-01-07T17:11:06.000000Z

The way your comment was phrased made it seem, to me, like you were rebutting what I was saying and that regular human things are all irrelevant for whether or not something is a person.

There is one other way I have figured out to read your comment. Which is that it doesn't matter how software or a brain functions since it's only the action of the outcome that matters. But this is not really a relevant statement regardless of whether or not you agree with it, because the article is about a lawsuit and liability. A group of people, acting as a company, is suing other groups of people as companies. And software is not a person, and can't be held liable. So for that to be the case, the software would need to be made into a person, or equivalent. The fact that software and brains are or are not similar is irrelevant, because software is not a person and cannot be held liable.

ericmay · 2024-01-07T16:29:59.000000Z

> You aren't software. The laws that apply to people reading and learning do not apply to software.

If you're going to draw a sharp distinction between reading/words/people (software isn't people, etc.) and software I would argue that the legal and copyright considerations should be stronger for software, not weaker, precisely because of your argument.

throwaway4aday · 2024-01-07T14:34:43.000000Z

I would not make that the foundation of my argument over the long term.

tumult · 2024-01-07T14:45:17.000000Z

Assuming that software won't receive legal personhood seems like a pretty good foundation. If that's no longer the case, you have much crazier things to worry about than copyright infringement.

fourside · 2024-01-07T15:06:52.000000Z

It’s puzzling to me why some people take it as a given that if a person is legally able to do A, then software should too. The capabilities of software are so different than that of a person that it seems reasonable to consider how the laws apply to software separately. A group of individuals could not absorb the NYT’s content nearly as fast as software can.

Like, can we confidently say that laws around copyright and licensing would not have been written any differently had LLMs existed at the time? It’s not obvious to me that the answer is yes.

CivBase · 2024-01-07T15:55:24.000000Z

No it's not. The AI isn't the one publishing its works. If I write an article that is word-for-word the same as a Times article, I'm not liable for anything until I make it available to others. The means by which I wrote the article are irrelevant. This is the core of the issue. In this case, the "AI" is basically just a fancy writing tool for the publisher.

chii · 2024-01-07T14:43:10.000000Z

> therefore so should software, is a non sequitur.

it's not. Why isn't photoshop being sued as copyright infringement? People can reproduce infringing works in it easily enough.

How is pressing buttons in photoshop different to pressing buttons in chatGPT?

tumult · 2024-01-07T14:51:00.000000Z

> How is pressing buttons in photoshop different to pressing buttons in chatGPT?

It's not! That's the entire point. The liability is with the people, not the software. The software isn't being sued! You can't even do that, because the software isn't people. Software isn't humans. The people developing ChatGPT incorporated other people's copyrighted works into their product, and they're the ones being sued.

somethingsaid · 2024-01-07T14:58:16.000000Z

Reproductive tools (like photoshop or photocopiers) require a human in the loop and a previous instance of the copyrighted material to reproduce copyrighted content. They don’t ship with copyrighted content inside of them. ChatGPT contains a compressed representation of all NYTimes articles and will freely reproduce them when queried. That’s the key difference. Similarly it would not be an issue to sell a database tool that could store news articles but it would be an issue to sell one pre-loaded with all NYTimes articles ever written.

dd82 · 2024-01-07T14:50:33.000000Z

It is exactly the same. You can absoutely be sued for copyright infringement in your example.

https://www.adobe.com/legal/dmca.html

Just because you produce an infringing work doesn't mean it will be discovered and then a legal case made.

m-s-y · 2024-01-07T15:28:14.000000Z

Correct. But the person using the software is charged. Not the software itself. The NYT is trying to push liability onto the software (ChatGPT-cannot be sued) instead of the software’s users (you and me-can be sued). I feel this is the correct interpretation.

tumult · 2024-01-07T15:54:40.000000Z

No. They are suing the people who made the software. They are not suing the software. You can't sue software. Software isn't a person.

filoleg · 2024-01-07T17:26:46.000000Z

This is just pedantry.

Would NYT be suing Adobe (the people who made the software), if one of their users utilized Photoshop to produce an image that violates copyright? I don’t think so.

tumult · 2024-01-07T21:37:59.000000Z

No. They would sue Adobe if Adobe put images that violate copyright into Photoshop and let users insert them from a dropdown menu. Your comparison is incorrect and bad.

filoleg · 2024-01-08T07:45:44.000000Z

The comment I replied to was saying “they are suing the people who made the software,” nothing about “putting images” or publishing them or anything like that.

My comparison addresses exactly that, as Adobe is the company that made the software (in this specific example, Photoshop).

You cannot retroactively switch up your argument to something entirely different and then claim that my comparison was incorrect and bad.

tumult · 2024-01-08T16:31:07.000000Z

I've replied consistently to everyone in this thread, to the point of monotony. I haven't switched anything. I've replied so consistently and repetitively, with the same statements, that it's actually grating. This is the second time someone has tried the "you switched what you've said" argument tactic, even though it's very clear the only thing I've talked about is liability of the creators of the software, not the software itself. You have either not fully read my comment that you are replying to, or are misunderstanding it thoroughly enough in order to pretend that I've said the opposite of what I've said.

I don't owe you any further replies, but I wanted to make this clear to anyone else skimming the thread.

fourside · 2024-01-07T14:53:29.000000Z

But photoshop doesn’t, for example, come with unlicensed National Geographic photos that it gives you as a starting point for you to create an infringing work.

merrywhether · 2024-01-07T15:04:56.000000Z

Many people don’t like equating human and AI learning (see sibling), but it’s interesting to pull more on that thread.

In all three of your transformation examples, you likely would’ve paid the NYT either indirectly via ads or directly via a subscription (contentious ad-blocking aside). And Mazda, GM, and VW surely generated far more value from the car designer in revenue than the designer got in salary and knowledge. Broadly, humans pay something (tuition, apprenticeship, etc) to learn knowledge and skills. Even schools and libraries are not free, though they may feel so at time of use. _Some_ free learning is beginning to emerge on the internet, but much of it is as upsell to paid learning.

OpenAI on the other hand has been purely extractive in its learning relationships. A single human contravening normal learning costs would go unnoticed but also would not scale their own downstream impact. But an organized large group doing so would certainly run into issues (10k interlopers couldn’t just sneak into a single college course, for instance). And I think a large group of aligned/associated people is more akin to AI than a single person is, given the differences in scalability.

madeofpalk · 2024-01-07T17:06:32.000000Z

Even in the human example - if you pay for NYT, perfectly memorise its articles and reproduce them for other for a fee, you’ll probably find yourself in trouble!

_teyd · 2024-01-07T14:37:40.000000Z

I feel like AI-personhood is not a road OpenAI want to go down. They want to produce something that doesn’t have rights.

gremlinunderway · 2024-01-07T14:33:39.000000Z

It should hopefully be a good highlight of some of the base absurdities that we accept as fact when it comes to "intellectual property" law.

mikeryan · 2024-01-07T15:02:20.000000Z

Honestly I don’t think this is the fundamental question in this case.

It’s buried a bit by the lede but IMO the more fundamental question is whether AI models can be trained off of publicly available, copyrighted content without compensating the owner of said copyright. This is regardless of how the eventual work is presented and how “different” it is.

It’s where this falls on the “fair use” spectrum. This is uncharted territory for any sort of fair use doctrine as written.

weare138 · 2024-01-07T18:03:03.000000Z

If I take the New York times article and publish it verbatim, that's a problem.

And that's seems to be the problem. Apparently ChatGPT was spitting out entire copies of NYT articles without attribution and falsely attributing things to NYT it had made up.

mongol · 2024-01-07T14:29:27.000000Z

This is not what using copyrighted works for AI is about. A human is not software, when you read an article it is not "copied" into your brain.

pmarreck · 2024-01-07T14:35:54.000000Z

> when you read an article it is not "copied" into your brain.

1) You cannot prove that this does not happen in some form. The fact that manual electrical stimulation of the cortex triggers random memories is sort of the proof that you are wrong here; if there wasn't some kind of "copy, transformed somehow into a 3D space we call the brain" in there, then that localized stimulation wouldn't trigger a particular memory at all.

2) And yet, this is not true of AI's either. Knowledge encoded into either real neurons or a simulation of them may be similar both in how it's not a "literal copy" but a kind of copy nevertheless.

If we are only concerned with "literal" or "verbatim" copies, then there is no case here.

jncfhnb · 2024-01-07T14:33:37.000000Z

Right because copyrighted works has nothing to do with ai. So it’s all irrelevant.

dkjaudyeqooe · 2024-01-07T13:33:14.000000Z

Without going into the merits, maybe the best thing about this is that the NYT has the money and stature to hire good lawyers and fight this all the way to the Supreme Court.

It'll benefit everyone if the fair use question is decided definitively sooner than later. I could see OpenAI settling, but it's in NYT's interest not to since it's not just OpenAI, they're just the most egregious ("we can use your copyrighted works for our model but you can't use our uncopyrightable output for your model").

6gvONxR4sf7o · 2024-01-07T15:22:29.000000Z

I have less optimism. The article says NYT was trying to negotiate with microsoft and openai for licensing, so they may be inclined to settle ("just give us a zillion bucks a year and we're cool"). And if they don't settle, then questions like this often seem to be decided by technicalities that don't really matter to all the similar observers hoping for a broader ruling.

dkjaudyeqooe · 2024-01-07T15:26:44.000000Z

That isn't a great deal for OpenAI, since everyone will be at their door asking for licence fees in that case.

In this case it's all about the technicalities.

solomatov · 2024-01-07T14:04:19.000000Z

It's not a fair use question, it's a question about models reproducing the article text almost verbatim.

(IANAL)

brigadier132 · 2024-01-07T14:08:24.000000Z

If it's fair use reproducing the article text verbatim is fine.

andy99 · 2024-01-07T14:18:21.000000Z

What is "it"? Training can be fair use, i.e. updating weights incrementally based on predicted next token probabilities. And I (not a lawyer) think that if a broadly trained foundation model can recall some verbatim text, that doesn't mean the model is infringing.

It seems like the lawsuit here is talking about specific NYT related functionality, like retrieving the text in new articles. That essentially has nothing to do with training and running a foundation LLM, it's about some specific shenanigans with NYT content, and it's legal status would appear to have nothing to do with whether training is fair use.

gremlinunderway · 2024-01-07T14:32:27.000000Z

Good luck trying to explain "updating weights incrementally based on predicted next token probabilities" to completely non-technical lawyers and judges.

lesuorac · 2024-01-07T14:45:08.000000Z

Good thing they don't have to. As I've said before, this slight of hand to talk about the case as-if its about training is a great move by OpenAi; however the case is more than just about training. NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

This is akin to having to explain to non-technical lawyers and judges how crypto works. In the FTX case it became irrelevant because you can just nail them on fraud for using deposited funds for non-allowed reasons.

mrkramer · 2024-01-07T18:01:10.000000Z

>NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

So if ChatGPT didn't refer to it verbatim and if ChatGPT trained on it and mixed it with other content, NYT would be OK with that? Tbh I don't get it.

Edit: I found this in my bookmarks archive - https://news.slashdot.org/story/23/08/15/2242238/nyt-prohibi...

Also this: https://www.cnbc.com/2023/10/18/universal-music-sues-anthrop...

lesuorac · 2024-01-08T17:02:38.000000Z

It depends on the purpose of ChatGPT. If people use it as a substitute for the NYT then yes I suspect NYT will be not ok with it.

I think also the courts will also side with NYT. Very recently there was a copyright case involving Andy Warhol [1] which he lost. Despite the artwork being visually transformative; it's use was not transformative. So, to me that means if you create a program using NYT's materials that is used as a replacement for NYT it will not count as fair use. Obviously you could just do what say Google does and fork money over to NYT for some license.

However, my initial point is that this is a tangent. NYT has claimed that OpenAI is using NYT's works at least as-is and so OpenAI can just be nailed for that. Which is my point about FTX; it's irrelevant if their exchange was legal since you can just nail them for mis-use of customer funds. Another example would be Al Capone; it doesn't matter if he's a mobster because you can nail him for tax evasion.

[1]: https://www.cbsnews.com/news/andy-warhol-supreme-court-princ...

mrkramer · 2024-01-08T19:21:00.000000Z

I think this is more of a question of licensing content, sooner or later AI chat bots will have to license at least some of the content they are trained on.

But broadly speaking this is also the question of the "Open Web" and will it survive or not. Walled gardens like Facebook, Instagram etc. are strong and pervasive but still majority of people use and acknowledge publicly open websites from the Open Web. If AI chat bots do not drive traffic to websites then they are walled gardens and Microsoft, Google or whoever will lock users in and try to squeeze them for money.

simonw · 2024-01-07T16:26:01.000000Z

I didn't see NYT allege that - their lawsuit explains pre-training pretty accurately I thought.

lesuorac · 2024-01-08T16:56:14.000000Z

Its buried on page 37 - #108. There probably are other examples in the lawsuit but this is sufficent.

> Synthetic search applications built on the GPT LLMs, including Bing Chat and Browse with Bing for ChatGPT, display extensive excerpts or paraphrases of the contents of search results, including Times content, that may not have been included in the model’s training set. The “grounding” technique employed by these products includes receiving a prompt from a user, copying Times content relating to the prompt from the internet, providing the prompt together with the copied Times content as additional context for the LLM, and having the LLM stitch together paraphrases or quotes from the copied Times content to create natural-language substitutes that serve the same informative purpose as the original. In some cases, Defendants’ models simply spit out several paragraphs of The Times’s articles.

https://www.courtlistener.com/docket/68117049/1/the-new-york...

simonw · 2024-01-08T21:24:11.000000Z

Oh I see - yeah, that's the part of the lawsuit that's about Bing and ChatGPT Browse mode retrieval augmented generation.

It's a separate issue from the fact that the model can regurgitate it's NYT training data.

There's a section on page 63 which helps clarify that:

    Defendants materially contributed to and directly assisted
    with the direct infringement perpetrated by end-users of the
    GPT-based products by way of: (i) jointly-developing LLM
    models capable of distributing unlicensed copies of Times
    Works to end-users; (ii) building and training the GPT LLMs
    using Times Works; and (iii) deciding what content is
    actually outputted by the GenAI products, such as grounding
    output in Times Works through retrieval augmented generation,
    fine-tuning the models for desired outcomes, and/or
    selecting and weighting the parameters of the GPT LLMs.

So they are complaining about models that are capable of distributing unlicensed copies (the regurgitated training data issue), the fact that the models were trained on NYT work at all, and the fact that the RAG implementation in Bing and ChatGPT Browse further creates "natural-language substitutes that serve the same informative purpose as the original".

solomatov · 2024-01-07T14:40:28.000000Z

Yep, you seem to be right. Google stores the quotes from pages, and it's fair use. Again, I am not a lawyer, and didn't think about this.

simonw · 2024-01-07T16:24:20.000000Z

This is argued extensively in the lawsuit document.

A key argument the NYT is making is that part of the definition of fair use is not producing a product that competes with the original.

They argue that ChatGPT et al DO compete with the original, in a way that harms the NYT business model.

One example they give: ChatGPT can reproduce recommendations made by the Wirecutter, without including the affiliate links that form the Wirecutter's main source of revenue - page 48 of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

shkkmo · 2024-01-07T17:12:46.000000Z

There are plenty of other services that strip affiliate links from content for users, such as ad blockers.

Notably, in both those cases, the user is specifically asking for what wirecutter thinks.

To me, that makes the infringing behavior clearly the fault of the user of the tool, not the tool itself.

When I saw this lawsuit announced, I assumed that large portions content were being reproduced in response to generic queries. That isn't the case, every example I've seen from this lawsuit has prompts where the user specifically asks for the content. To me, any fault and liability rests on the user here.

dkjaudyeqooe · 2024-01-07T15:23:45.000000Z

Fair use is a key defense for OpenAI.

The article also mentions the idea that the model is merely extracting (uncopyrightable) facts, which is interesting, but might be a tough one to prove since LLMs have no way of establishing what are facts, and don't return facts by design.

vsnf · 2024-01-07T13:58:32.000000Z

Assuming this case is decided decisively in the NYT's favor, what would the effects be on the competitiveness of the US AI industry against other countries that aren't beholden to the ruling? Is the panacea really just as simple as licensing the data for training? Will the NYT et all even allow that? Surely one can't get licenses from each and every website they use data from, so the only way to compete would be to be more clever with less data, right?

devjab · 2024-01-07T14:29:24.000000Z

For enterprise solutions, it’s hard to imagine a viable competitor that wouldn’t be beholden to similar rules. At least here in the EU where we aren’t very likely to be using non-western tech considering our privacy and national security laws. In both the finance and energy sectors we already face the prospect of needing exit plans from Microsoft products, not because of an anti-Microsoft sentiment but because the EU views it as a national security risk if too much of certain sectors rely on a single tech company. We have a lot of similar bureaucracy taking shape, so we’re not likely to turn to something like Asian AI. Hell the fact that AI vendors wouldn’t adhere to whatever legislation that comes out of these lawsuits and so on would like be enough to disqualify them as an option in the first place.

I do think it’ll be interesting to see how our militaries handle the issues. Considering that is probably the one area our governments don’t want to be “behind”.

cj · 2024-01-07T14:20:00.000000Z

> what would the effects be on the competitiveness of the US AI industry

To me, this leads to another question: is it possible to get a GPT-4 quality model using only training data from major content providers?

In other words, what if OpenAI didn’t crawl the entire internet and only used data from Facebook, Twitter, Wikipedia, GitHub, NYT, Stackoverflow, and [list 100 more well known high quality data sources here]

If it’s possible to get to GPT-4 model quality with training data from a couple hundred centralized sources, then maybe it won’t have any impact at all (other than OpenAI needing to pay other companies to use their data for training data)

Or, if the long tail “every website on the internet” really is the secret to training LLM models, then that would be much more problematic.

throwaway4aday · 2024-01-07T14:42:09.000000Z

There are quite a few papers that indicate that the important factor is the quality of the data, regardless of where it is sourced from. You still need a lot of it so throwing out large chunks of your training set does work against you whether you're doing that for legal reasons or quality reasons. The solution appears to be synthetic, high quality data generated by another model but that makes it a bit of a chicken and egg problem where you need a high quality model like GPT-4 to produce high quality data reliably. I think there are methods for getting around this using less capable models by producing a lot more examples and then using another model to judge the quality of each example and only selecting the best ones. I also suspect you could get pretty far by permuting a smaller set of high quality data to produce a lot more examples that have the same meaning but differ in how they are written.

danieldk · 2024-01-07T14:37:40.000000Z

Even then, e.g. citizens in the EU have the right to be forgotten. So if they use user-submitted data (and in some cases e.g. newspaper articles about specific people), users can request the data to be removed, which will be... an interesting challenge?

petesergeant · 2024-01-07T14:38:13.000000Z

It’s possible to get a good image model this way: https://www.gettyimages.ae/ai/generation/about

hibikir · 2024-01-07T16:13:28.000000Z

Big Tech might actually prefer it: On one end, they lose a whole lot of money by having to license scraping a bunch of sources, but on the other hand, no open source model, and no small company, has a prayer of building a model legally.

It's not even down to the NYT and such: Think facebook, or reddit, suddenly being able to charge others for their massive amount of data, while they force anyone posting there to make their content usable for their own training to post there at all. Imagine a world where, say, Microsoft buys the times, just because it's better for them to not just have the training data, but to tell every competitor to get bent if they want a license. 'Google buys Disney for the training data'. If AI is that useful, and we make copyright that strong, what we might see is ailing media companies being bought out by tech.

Ultimately IP law is about the tradeoff of more original content and more usefulness for the world, and AI massively changes where the center of balance is. So I don't expect a ruling for the NYT to remain the law of the land for very long, just due to economic pressures.

Ultimately it doesn't matter which side people like aesthetically: If an AI trained with no regard for copyright is that much more useful than one that quotes every source and pays it some money, then piracy will ensure. If it doesn't really help, then copyright will keep being very strong.

It's not even a matter of countries, but hackers. Would Aaron Swartz, working on his own LLM, pay for copyright dues or not? Because there's enough people in tech with that mindset that we'll end up with LLMs trained by a group that thinks like he did.

We will have laws, people in places that aren't subjected to our laws, and pirates. It's now it always works.

Salgat · 2024-01-07T17:43:13.000000Z

Companies that don't operate within the West's legal scope will become dominate and companies like Google and Microsoft will suffer for it, so big companies definitely don't want NYT winning. You think Chinese companies would shy away from offering a ChatGPT replacement to the West if the US stopped offering those services? And as a bonus, they get to inject all the political bias they please in this monopolized product.

dkjaudyeqooe · 2024-01-07T20:17:03.000000Z

There's another strong possibility: that a court rules that its fair use if the output is not sold (directly). That would be a positive outcome.

golly_ned · 2024-01-07T19:19:10.000000Z

Insightful comment -- data licensing will become a much more valuable business. That's the context with the NYTimes lawsuit -- they were negotiating a data licensing contract with OpenAI, who were dragging their feet, doubtlessly continuing to scrape. This plata o plomo lawsuit is strategically smart for NYTimes.

stevenpetryk · 2024-01-07T14:12:41.000000Z

I think it would lead to China surpassing the US in terms of AI tech pretty quickly. The US is actively trying to prevent that outcome already using sanctions on chip manufacturing equipment and IP.

empath-nirvana · 2024-01-07T14:48:54.000000Z

The law tends to follow the money. I just don't see a future where a copyright ruling kills AI dead. If the courts rule in favor of NYT, congress will just change the law.

quonn · 2024-01-07T15:38:54.000000Z

That seems like an opinion from within the tech bubble. There is a vast world outside and most people and companies have a completely different perspective.

Ekaros · 2024-01-07T16:41:47.000000Z

Currently I would say that money is on the side of content producers, not the AI companies... As there is lot of parties that are extremely happy in limiting what AI can produce...

nhinck2 · 2024-01-07T14:23:11.000000Z

> what would the effects be on the competitiveness of the US AI industry against other countries that aren't beholden to the ruling?

If it is found to be breach of copyright then the berne convention comes into play which makes it hard(er) for international offerings to exist.

cmiles74 · 2024-01-07T14:22:40.000000Z

The NYT needs to increase profits every quarter, just like everyone else. I have no doubt they will license their content to OpenAI. With the issue unsettled legally, right now, I suspect the NYT doesn't think OpenAI is offering them a fair deal.

joelwilliamson · 2024-01-07T14:35:01.000000Z

Privately controlled companies need to do what their controllers think is most important. If the Sulzbergers think taking a stand on AI is more important than profit, they can go a long time without increasing profits.

artninja1988 · 2024-01-07T15:09:45.000000Z

The nyt isn't taking a stance against OpenAI. This is simply a negotiating tactic to get money for licensing

quonn · 2024-01-07T15:39:50.000000Z

We don‘t know their personal motivation. They may very well dislike AI.

golly_ned · 2024-01-07T19:21:35.000000Z

Surprisingly (at least to me), NYT is public.

freetanga · 2024-01-07T14:45:53.000000Z

I would add what should the case be dismissed, what would be the impact for any US-based content creator. Why write a book, a movie or a song if an engine can take your work and milk it more than you…

logicchains · 2024-01-07T16:10:06.000000Z

Most books, movies and songs make their creators no money at all. The famous, profitable ones are a very small fraction. Creatives create because they enjoy creating. And for the possibility of fame, which doesn't require copyright, as evidenced by all the influencers who achieve fame in spite of producing no media whatsoever apart from videos of themselves.

concordDance · 2024-01-07T15:13:54.000000Z

Worth noting that even the complete abolishment of copyright would not kill the arts entirely. There would still be many who make it to be heard, to bring joy to others or as a patreon type arrangement.

6gvONxR4sf7o · 2024-01-07T15:48:05.000000Z

It could go either way. Since others have laid out why it might benefit other countries, I'll lay out why it might benefit the US, from the perspective of an ML researcher since long before the hype arrived.

AI is only useful in so far as it's grounded in reality. If you ask it how tall the empire state building is, the answer has to be in some way derived from how tall it really is. Even creative writing is only useful in so far as how it's grounded in real life human experience. This is all to say that a bunch of people need to keep writing stuff about the real world for models to keep up with the real world.

I'd say the ideal scenario for US AI development is:

1) People keep making high quality data grounded in real world USA

2) That data is easily accessible for research purposes

3) That data is usable for commercial purposes (but might require more work than for research purposes)

A ruling for NYT here might keep good data more widely produced and available. Research legality is almost surely not going to be affected, so keeping data produced and available is key to keeping domestic research going strong and keeping talent here.

A ruling against NYT might mean all the new grounded-in-reality data goes behind closed doors. And even though the legality of research isn't in question, that would make the research harder. It would be easier to commercially use what you can find, but it could be garbage-in-garbage-out in the sense of quality data becoming hidden.

sbsjanabzuaa · 2024-01-07T14:18:07.000000Z

You would do something like a Dmca takedown request form but with a industry standard fee, ala what Spotify pays to royalty holders per stream of a song that scales based on the value of the content. It gonna be like tiny tiny fractions.

Ideally a marketplace would develop that can price each content fairly.

cj · 2024-01-07T14:23:22.000000Z

That would require the model to have the ability to attribute what training data sources it’s using to respond to a specific prompt, which AFAIK isn’t really compatible with how LLMs work.

Qem · 2024-01-07T15:07:13.000000Z

> Assuming this case is decided decisively in the NYT's favor, what would the effects be on the competitiveness of the US AI industry against other countries that aren't beholden to the ruling?

Probably none. They can just open a subsidiary there and siphon the same data to train their models.

blueboo · 2024-01-07T14:25:14.000000Z

80/20 rule surely applies.

Licensing the most prominent IP holders will get the vast majority of data secured.

These billion-dollar-funded efforts can comfortably chase down, say, hundreds of deals.

GPT-4 finished training g 18mo ago…plenty of time to clean up this mess, if there had been a will