Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Isn't ChatGPT unfair to the sources it scraped data from?
65 points by wxce on Feb 5, 2023 | hide | past | favorite | 183 comments
ChatGPT scraped data from various sources on the internet.

> The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system.

I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them.

Scraping data seems fine in contexts where clicks aren't driven away from the very site the data was scraped from. But in ChatGPT's case, it seems really unfair to these sources and the work that the authors put, as people would no longer even to attempt to go to these sources.

Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?




Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".


> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.

The question is if this data is legal to scrape, which it is: Web scraping is legal, US appeals court reaffirms [https://news.ycombinator.com/item?id=31075396].

As long as the content is not copyrighted and it's not regurgitating the exact same content, then it should be okay.


Being allowed to scrape something does not absolve you of all intellectual property, copyright, moral, etc. issues arising from subsequent use of the scraped data.


Exactly, besides, the question isn’t about legality, it’s about what the law should be, I think. The question isn’t whether it’s legal, the question is whether we need to change the law in response to technology.


ChatGPT isn't doing the scraping, humans are. And humans are using computers to both read the article and create content or to scrape it.

So not it's not a false equivalence.


There’s a reason scraping is a legally grey area.

> Web scraping is legal, US appeals court reaffirms

First, the case is not closed. [0]

Second, to draw an analogy, you can use scraping in the same way you can use a computer: for legal purposes. That is, you cannot use scraping to violate copyright, just as you cannot use a computer to violate copyright.

The following being my conjecture (IANAL), there is fair use and there is copyright violation, and scraping can be used for either—it does not automatically make you a criminal, but neither is it automatically OK. If what you do is demonstrably fair use presumably you’d be fine; but OpenAI with its products cannot prove fair use in principle (and arguably the use stops being fair already at the point where it compiles works with intent to profit).

[0] https://news.ycombinator.com/item?id=31079231


It seems the issue with scraping as it pertains to copyright issues isn't the scraping, any more than buying a book to sell off photocopies of it cheaply doesn't indicate that there is a problem with buying books. The issue is the copying, and more importantly, the distribution of those copies.

Fair use of course being the exception.

Now, as for accessing things like credentials that get left in unsecured AWS buckets is the bigger area where courts are less likely to recognize the legality of scraping. Never mind the fact that these people literally published their private data on a globally accessible platforms in a public fashion. I'm not a lawyer but I've seen reports of this leaning both directions in court, and yes, I've seen wget listed as a "hacker tool."

This is what happens when feelings matter more to the legal system than principles.

And before it's brought up, I may as well point out that no, I don't condone the actual USE of obviously private credentials found in an AWS bucket any more than I condone the use of a credit card that one may find on the sidewalk. Both are clearly in the public sphere, unprotected, but for both there is a pretty good expectation that someone put it there by accident, and that it's not YOUR credential to use.

Basically, getting back to the OP, ChatGPT hasn't done anything I've seen that'd constitute copyright infringement -- fair use seems to apply fairly well. As for the ad-supported model, adblockers did this all first. If you wanted to stop anything accessing your site that didn't view ads, there are solutions out there to achieve this. Don't be surprised when it chases away a good amount of traffic though -- you're likely serving up ad-supported content because it's not content you expected your users to pay for to begin with.


Yes but that's a technical issue. I took the parent as making a philosophical point and responded in that spirit.


Wouldn’t it be nice if the people on these forums were not ignorant of both philosophy or the legal system before diving into incoherent conversations about both at the same time where the main thrust is the emotions they have about these tools?


One can dream.


yup


How is it not scraping? There's no other way to get all that data for training a model without scraping.


It's scraping both when humans do it and when the ChatGPT team do it, but that wasn't the point the parent made. He made a moral/philosophical point which is what i responded to.


Check me on this because I'm not a software person:

When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.

When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.


> The question is if this data is legal to scrape

...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?

> Because it's false equivalence? ChatGPT isn't a human being.

Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?

It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.


>Is this important?

Well yes, it's the whole crux of the matter. Laws govern human behaviour. As of 2023, only living beings have agency. If I shoot someone with a gun, the criminal is me and not the gun. Being a deterministic piece of silicon, a computer is perfectly equivalent. Sure, it is important to start a discussion of potential nonhuman sentience in the future, but these AI models are not unlike any previous software in legal issues. It's bizarre to me how many people are missing this.


> these AI models are not unlike any previous software in legal issues

Agreed. However, the previous 'legal issues' related to software and the emergence of the internet are also difficult to take seriously when considered on anything but extremely short time scales.

Every time we swirl around this topic, we arrive at the same stumbles which the legacy legal system refuses to address:

* If something happening on the internet is illegal, _where_ is it illegal? Different jurisdictions recognize different jurisdictional notions - they can't even agree on whose laws apply where. If you declare something to be illegal in your house, does that give it the force of law on the internet? Of course not. Yet, the internet doesn't recognize the US state any more than it does your household. It seamlessly routes around the "laws" of both.

* The "laws" that the internet is bound to follow are the fundamental forces of physics. There is no - and can be no - formal in-band way for software to be bound to the laws of men, because signals do not obey borders. The only way to enforce these "laws" are out-of-band violence.

* States continuously, and without exception, find themselves at a disadvantage when they make the futile effort to stem the evolution of the internet. For example, only 30 years ago (a tiny spec in evolutionary time scales), the US state gave non-trivial consideration to banning HTTPS.

I understand that people sometimes follow laws. But they also often don't. The internet has already formed robust immunity against human laws.

Whatever human laws are, they are not the crux of anything related to evolution of software. They are already routinely cast aside when necessary, and are very clearly headed for total irrelevance.


> It's bizarre to me how many people are missing this.

Very much this. I am too tired right now to engage with other responders, but thank you for articulating precisely the point I want to make.


> It's a product that is built upon data from other sources.

To be fair, so are you.


Does anyone actually find these arguments persuasive?

There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.

Second, try applying this logic to literally anything else and you'll see why it's absurd:

"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"

"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"


Yes, and: why shouldn’t it matter that in one case it is a person and in another it is a computer program?

Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?

But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.


I find the argument pretty persuasive.

I also agree it's not the only argument and ultimate proof.

I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.


You usually expect people to cite sources. Granted, that very often doesn't happen, and the amount of citing expected depends on the context. But ChatGPT just doesn't cite sources at all. I think there's a case to be made that they should.


People don’t remember the sources that formed their opinions, it’s just baked into the structure of their brain after reading, same for the model.


With search engines, it does feel like there was is a more clear trade of scraping access in exchange for web traffic.

With ChatGPT the traffic benefit isn’t there, so it feels like it isn’t a fair trade.

Google adding the context and data to their search results page also started blurring this trade making it unnecessary to click to the site the info was cleaned from.


How does someone site a source when they are using GPT to convert a box score into an entertaining paragraph about a baseball game? Or to convert a natural language command into a JSON format ready for downstream processing?


Right, it’s a gestalt from a huge set of sources. It’s not copying single text sources verbatim into your output.


Humans have a pretty good sense of when you need to cite sources, and when you don't. For example, long ago I learned from some website how to write a for-loop in python, and now I write them all the time without giving credit. I'm okay with ChatGPT writing a for-loop without citing its source.

I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.


And yet, exactly in this example, I HATE that people don't put sources. Perhaps not for "for loops", but search anything simple in python. "Python JSON output", for example, and you will find a billion articles that describe a simple python library ... but DON'T link to python.org or the "javadoc". They're always dicussing the most blatantly obvious simple thing, never remotely complete, never link to where you can actually find more info (but jobs, courses, ads, ... those will be linked)

It's getting me to the point of refusing to use Google, or only use Google with "site:...". I mean, the site varies, but without site limits Google's becoming useless.


ChatGPT doesn't have a concept of sources. It has weights that together define a function that allow it to guess the most likely next word from the context. As a neat side effect of this contextual next-word guessing, it often can share accurate information.

If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.


> You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

I've read that ChatGPT is not connected to the net, but if it was: Couldn't you have it do a google search (or better yet corpus search) for the string it generated and then return the most significant matches (significance by string matching, not google rank)? It would be really crude, but wouldn't this just be a handful of lines of code that don't interfere with the "transformers-based model" code at all?


Why couldn't you, as a human do that to verify it?

The other day I had GPT write a rap battle between Burger King and Ronald McDonald. One of the stanzas came back:

    Burger King:
    Your burgers are plain, your buns a bore.
    Your clown's been around since '63,
    I'm sure my flame-grilled taste will leave you impressed
    My burgers are fresh, my fries are the best
It turns out that yes, Ronald McDonald was first introduced in 1963. https://en.wikipedia.org/wiki/File:McDonald%27s_commercial_(... (from https://en.wikipedia.org/wiki/Willard_Scott#Created_Ronald_M... )

So here's the challenge for you - who do you compensate for that line?

The complaint that people have isn't that GPT isn't citing its sources but rather that it isn't compensating the people who created the data that has that information.

... and now, if you're ever asked about historical clown trivia and pull out the "Ronald has been around since 1963", who should you give a royalty to? Me (for writing this), GPT (for making me aware of it), Wikipedia (for the source of my links in this post), the estate of Willard Scott for the Joy of Living (which Wikipedia cites), some random blog author that had some clown trivia on it that happened to have been part of the training set for GPT?


Because I want to credit not verify? Because I want to trace the flow of information?

It isn't just monetary compensation that's important here.

I come at this from the point of view of a scientist who is expected to reference ideas. Not necessarily back to their original source, but at least back to a source that can theoretically point back to another link in the chain.

Sure, I can manually search for a reference based on what ChatGPT gave me. Or someone could spend a few minutes adding a few lines of code to ChatGPT to save millions of people some minutes of time.

-----

What would be awesome is an LLM that you can feed data to, and it can then write a paper based solely on the data you feed it.


I've still got the question - who should I credit with the bit that Ronald has been around since 1963?

I had it write a poem the other day in the style of Roses are read about coffee and bacon.

   Roses are red
   Bacon is greasy
   My coffee is hot
   Together they please me
If this is something that someone considers to be a derivative work of other things... who do I credit?

    Identify a word that have different meanings to two different professions at the same time and the professions that use them.  Give the definition of the word for each profession. Write a joke using this word.
to which I got back:

    The word is "band." 

    Definition for a Musician: A group of musicians who play music together.
    Definition for an Astronomer: A dark region in the sky with less stars.

    Joke: What did the astronomer say when the musician asked him to join his band? "I'm sorry, I don't do solos in the dark!"
How do you credit that?

---

> What would be awesome is an LLM that you can feed data to, and it can then write a paper based solely on the data you feed it.

https://platform.openai.com/docs/guides/fine-tuning


> If this is something that someone considers to be a derivative work of other things... who do I credit?

Based on a quick search the best credits would be ChatGPT as the arranger, and "Roud Folk Song Index number 19798" as the inspiration.

> "Joke: What did the astronomer say when the musician asked him to join his band? "I'm sorry, I don't do solos in the dark!""

> "How do you credit that?"

That you credit to ChatGPT. It's not referencing facts or discoveries, so credit isn't as important as it is for articles. If you want to credit an inspiration then I'm sure there's an index of joke forms out there that has an appropriate number to cite.

I can't actually find a definition for band in astronomy that is "a dark region in the sky with less stars." So it seems to be a pretty poor joke.

> https://platform.openai.com/docs/guides/fine-tuning

This does it solely based on the data you feed into it? And by data I mean scientific data that you discovered, and want formatted into a particular research article style.

Edit to add: Possible sources for the line "together they please me":

1) https://www.google.com/books/edition/Poetical_Works_of_Louis...

2) https://www.google.com/books/edition/Florio_s_First_fruites/...


Why did you pick that index rather than some other source material? Roses are red dates back to 1784 (year not index number) as a nursery rhyme. Does it need to be credited or is it in the public consciousness to the point where one can create a poem based on it without knowing its original source?

    Write a haiku about bacon and coffee.  Identify the syllable count for each word and line used in the haiku.
    Example:
    Bacon (2) sizzles (2)
    Aroma (3) of (1) coffee (2) too (1)
    Mouthwatering (4) bliss (1)

    Smoky (2) bacon (2)
    Brewing (3) coffee (2) aroma (3)
    Makes (1) mornings (2) bright (2)
The second poem is from GPT. Do we need to credit the dictionary where it got the syllable count for each word? Or where it got that coffee (rather than bacon) is brewed? Or that bacon and coffee are things more often consumed in the morning?

    Identify four foods or beverages that are frequently consumed in the morning and how each is prepared for breakfast.

    1. Coffee: prepared by brewing hot water over ground coffee beans.
    2. Cereal: prepared by pouring cereal into a bowl and adding milk.
    3. Toast: prepared by toasting bread and adding butter and/or jelly.
    4. Eggs: prepared by scrambling, frying, poaching, or boiling them.
There is a difference between "identifying a source where this information can be found" and "this is the (copyrighted) source of the data that GPT used to draw upon to come up with the statement."

The first is an exercise for the reader (and much better done and evaluated by the reader). The second is what people are concerned about.


I'm concerned about both. I'm a "people".

> Why did you pick that index rather than some other source material?

I told you why references were important in scientific documents already.


Scientific documents - certainly. If you are writing a research paper or encyclopedia, I expect it to be well cited.

If you are writing something that is synthesizing knowledge (not just reporting the facts), the "where are all the places were that knowledge came from" is an impossible task for human or machine.

If I ask GPT to create a poem in the style of Roses are Red about coffee and bacon - why should that request need to be citied to the same degree of scrutiny as an encyclopedia or research paper?

If, on the other hand, you're trying to use GPT to write such a paper... I would hold that you're doing it wrong. It doesn't do that well. The model is "about" transforming language. To do so, it has a fair bit of 'knowledge' that it contains to be able to do that accurately. OpenAI makes no claims about the accuracy of the content that GPT produces (its improved, it can more accurately answer data - but if you want to know the answer it is no better than your next door neighbor who has read a lot).

If you are claiming that the example of Bacon is Greasy poem that GPT wrote is infringing any more than a child's "roses are red, my cat is orange, his eyes are green, nothing rhymes with orange" then I believe you will face an uphill battle.

To say that there is plagiarism and infringement going on - it needs examples rather than a "I think it works this way and is just regurgitating material it was fed from elsewhere."


The bing leak seemed to mention sources.


Hello [Oxford Dictionary: 1827]

Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN

"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.

RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."


No, it isn't that simple. The scale and totality of the scraping is out of reach for a human.

If you previously interacted with people on this issue, you must know that.

It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.


If I woke up tomorrow and breathed all the oxygen, nobody else could breathe. But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

Air is zero-sum. Knowledge is not.


> But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

If you became the first line, go-to source for the information of those websites, those websites would stop getting click-throughs. Eventually it would become less and less worthwhile (economically or emotionally) for the people keeping those sites running to keep them running. It would become more and more difficult for people to find those sites even if they are running, or even the archives of those sites.

So yes, eventually you'd stop people from reading them too.


It is in fact that simple. There are dozens, hundreds, perhaps thousands of legitimate, genuine, serious, real reasons to be concerned and "want something to be done". This isn't it.

"Learning is unfair" is not an argument you want to win.


Love how you’ve conveniently ignored

> The scale and totality of the scraping is out of reach for a human.


Because it's conveniently irrelevant.


Why?


> It is in fact that simple.

Why?


>Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

The difference is in scale.

A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.

OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.


At a human level it falls below the noise floor. It's a fact of life that humans will learn and build from experience.

The difference is scale. At scale it becomes a problem.

Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.


At scale, it becomes a wonderful tool. Are the people in this thread so threatened or so invested in the current business models of the internet that you can’t see how amazing this sort of thing could be for our abilities as a species? Not just in its current iteration, but it will get better and better.

This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.


It is a wonderful tool but I still feel that the creators of the training data are getting shafted. I'm both amazed and horrified at our creation and what it portends.


Yeah, there will probably have to be some adjustment. In the future, maybe an ML agent will hire people to go find answers for it about questions it has, using us as researchers/mechanical Turks :-) Quality matters more than quantity for something that’s trying to understand the world well and not just building a statistical language model, I imagine that it will be worth it to pay for quality when training heavily used models, to avoid using garbage info. You don’t need 30 different superficial product reviews with a bunch of SEO text if you have one that’s very thoroughly researched.

And in the meantime, with ads no longer working, maybe crypto is actually useful for something here - lightning makes very small transactions possible with basically no fees, and makes it easy to programmatically pay for things. People hate being nickled and dimed, but a professional trying to construct an ML model could reasonably budget for use fees for fast unhindered access to quality training data. An agent could even evaluate its likelihood of learning something new/accurate vs the cost proposed by the server, and choose the subsets to pull.

Just a random idea, but I hope we don’t fight tooth and nail to preserve the trash heap of the internet’s current state.


The internet has always been a trash heap. We've just been creating new heaps with parts of the old heaps every few years or so. Sure, it's nice to imagine a future in which this isn't the case, but your imagination is not going to be the future reality.


People are already being paid to curate data for models, I’m mostly suggesting that that might become a major revenue source, and that ads might be less relevant in a world where people don’t need to sift through the trash heap to get info (and that’s a good thing overall!)


Wouldn't an AI-driven search engine be even better than a language model for that purpose though? It could even snippet highlight the most relevant parts of various web pages to save on the sifting.


Maybe, and arguably, that's what Google has been doing. But one thing I really like about the idea of using a model directly is that there's one interface to learn, whereas with web search, I'm constantly adapting to a grab bag of page types and UX conventions.


Do you think just maybe there is a diffence here because humans need money to survive, and maybe we should have compassion for humans who could hypothetically starve or freeze or suicide or whatever because they have no money? Or is it just silly to care about people like that?


That's got nothing to do with whether or not it is "fair" for a learning system to produce content after it has learned.

That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".


If "fairness" isn't worth figuring out for a society, why is our entire economic order nominally built ontop of such a virtue? How is this not the very thing we are talking about? People starve on the streets right now because any other arrangement of resources has been deemed "unfair." Do we not sign a contract for our labor or for our homes because of shared idea of fairness? Fairness is the ultimate thing we appeal to in our world, it is the only thing that can sustain the intense individuality of the modern world. Dont ambiguate it as a Nietzchean moral fairness here, we are talking about the pseudo-algorithmic fairness of a market which guarantees certain things if you trade enough of your resources.


Isn't that like saying automated looms should be banned because it meant humans would lose jobs to it? Or buggy whip drivers wanting to ban cars?

https://en.wikipedia.org/wiki/Luddite

Might as well ban computers since they automated and eliminated a lot of manual jobs.

The problem of humans with no money should be solved by a societ safety net and things like UBI.


Well, the luddites were right in that they were fighting a good and honorable fight.

Until things are, in fact, solved by whatever idea you might have, why should we just accept each new thing that makes our human lives more intolerable? How could you expect any rational person to have that kind of blind trust in a technology, much less "progress" itself, when every single aspect of our world shows that it is who owns the technology that actually benefits from it? I think it is much more crazy just totally rolling over for each new thing that takes your job than it is to maybe fight for your food and shelter.

I think we can do better than UBI, but either way, fighting against this unfairness is fighting for the things we need to continue with some shred of humanity, insofar as this technology is and will be an agent for the consolidation of labor and profit. Its all the same fight, and the historical luddites understood this consciously or not.

Who knows, maybe the internet would have been better off if some people were brave enough to smash some of Google's servers in like 2006..


This is a response to an argument the GP didn't make. One can still have grave concerns about generative AI's potential impact on human society while accepting there is nothing fundamentally unfair about how it scrapes publicly accessible data.


Ok, so, what is that argument then? Maybe you just want to say: "well this is such a big deal its going to change everything anyway, so we can't judge it on how it will affect today's society, but rather the society it will create." If so, all one can really say to such longtermism is: "well, good luck with that I guess, I will keep trying to survive over here."


Teach the suicidal ones the noble art of suicide bombing and force societal change that way.


> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

I don’t do it on an industrial scale.


It’s fair if you do this; neither fair nor legal [0] when a commercial for-profit entity, backed by a large corporation, does it at scale and capitalizes on that.

Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]

Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).

[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)

[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.


Your argument contains the implicit assumption that we have the same rules for machines that we do for humans.

That is trivially disproved, as is the rest of your argument that follows from it as a premise.


If you treat ChatGPT like a human, how high is the salary you are going to pay it?


I think this is a real concern, but imagine a couple other scenarios:

1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?

2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?

3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.

So which of these examples are a better metaphor for what a LLM does?

I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.


It is not breaking the ad-based model—it’s breaking open information sharing culture as we know it.

Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.

What exactly are the incentives to publish information openly in that world?

(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)


Why would someone only ask an LLM questions when they were in the market to buy a book? Most people I know don't buy books in order to look up the answer to a question, sure some people buy reference books and use them but that's not really what we think of when talking about authors and books. If I'm in the market for a book, I'm looking to read a book, not query something or someone for answers. I think your example should go like this:

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers to some related question or interest 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list 6) Goto step 2 from Yesterday


> 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list

My belief is that ChatGPT is actually not quite capable of that, after seeing examples of how it manufactures non-existing references. Besides, if it were capable of that, why would it not show your name as part of the answer already now?

The cynic in me thinks it’s not capable of that primarily because it is not a priority for OpenAI and training data strips attribution, with an explicit purpose: if the public knows that ChatGPT can trace back the source, OpenAI would be on the hook for paying all the countless non-consensual content providers on which work it makes money.

We should treat OpenAI as we treat Google and Microsoft. It has great talent and charismatic people working for it, but ultimately it’s a for-profit tech company and the name they chose ought to make us all the more suspicious (akin to Google’s “don’t be evil”).

> Why would someone only ask an LLM questions when they were in the market to buy a book?

Why would you be in a market for a book when you can learn the same and more by asking an LLM that already consumed said book? And therefore why would the author spend effort writing and publishing a book knowing it’d sell exactly one copy (to LLM operator)?


It's very much in their interest, if the information their models provide is impossible to verify then it severely limits its uses. You essentially can't use it as a source for anything that requires any type of citation or reliability. That's a huge handicap for selling it to businesses and researchers. The general problem of determining what training data was used to produce an output is an open problem in ML and one that is being very actively worked on since it would greatly further the field.

You believe correctly that ChatGPT is not capable of showing sources, it's currently impossible to do but we were discussing Tomorrow so I included it as a possibility. You could potentially hack it in now using traditional search or nearest neighbours but it wouldn't be 100% accurate, probably not even 50%, it would just show a bag of similar texts so not really worth doing.

I'd still be in the market for a book even if we had a perfect LLM that could answer every question I had with impeccable accuracy. I read books because I want to find out about things I don't know that I don't know. It's pretty hard to find those things if you just do question response. It's like a graph, if you start at one node it may take you a very long time to traverse the graph to another node but if you have some outside source that gives you the address of a new node you can just jump straight to it.


Exactly. As a professional artist, I am expected to have a public online portfolio and publicly available imagery of shows and exhibits. Saying that I'm forfeiting my stake in my art because I'm showing it publicly is a really great way to kill art and culture. AI is not learning to make, draw, use mediums in a skilled manner. AI is scraping my public images and plottlining them with the input of humans to label them, tag them and apply stylistic qualities to them.Just because there are massive amounts of data to dilute influence doesn't change that the computer is still simply doing what a human is telling it to do with imagery created by humans. If you took away the human input, labeling and tagging you will find that the computer has not learned anything. I can look at 'AI' art and pick out artists from the collated imagery. Unlike 'AI'I can't spit out the imagery by photocopying/plottlining/tracing it. I have to learn the skills of each artist involved to recreate what I see. Motor skills require practice and effort. 'AI' is not learning motor skills, which is the basis of the creation of art. It is mapping and applying statistical algorithms to amalgamate data from preexisting sources for those who want 'Art' without the effort of time or skill to produce it. At this very moment 'AI' art is being used to sell merchandise with zero credit or monies going to the people who used their human motor skills to create the backbone of this art. Sadly,this only agravates the ways copyright already restricts human art.Imagine if we lived in a world where people valued artists with respect for thier craft? I once had someone ask me how long it took me to draw a charcoal drawing. The short answer is half an hour. The long answer is that I was doing daily scketching practice and investing many hours a week doing charcoal excercises. I am currently out of practice with charcoal and as it is a medium with no erasing or margin of error, I doubt I could recreate my drawing myself without 'getting my hand back in'. It is obvious to me that this 'AI' tool is being used by humans, with the industry of humans, to exploit humans for the gratification of end user humans. I suppose humans could stop making art to feed the monster...


That’s my main fear. Not the fairness / unfairness but that people might be less willing to share info and a lot becomes inaccessible / secret.


I am also anxious about the web becoming fragmented and secretive. If one must gain access to the right circles to start learning, it hinders learning in general, and for myself and many people I know would basically mean we wouldn’t be doing what we’re doing if it were the case when we were younger.


Exactly. We’re in ChatGPT honeymoon but incentives to share info moving forward unclear. Could see big model owners paying for exclusive access to content / data hindering the free distribution of information and becoming like the old publishers and gatekeepers.


Yes it absolutely is, but imo less so than what GitHub Copilot and various image generation companies are doing. My theory is that if AI turns out to be as disruptive as the current hype suggests, the conflict between those who feed the AI vs. those who profit from it might be the next big social rift.

Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.

Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.


> Artists are already in full rebellion against this,

Huh? No. Some artists are maybe?

> as they should be, being nearly eclipsed by AI

Not even close. It's like looking at the newest brand of clip art.

Non-artists don't (maybe can't) know that particular feeling, at least not with regard to being told you're angry about "what's supposed to look like art".

(Heck, artists have been told that with regard to other humans' art for centuries, for one)

Going even further, a lot of artists already know how to build on this new tech without ripping people off.

I used to teach college art classes and would have loved to integrate this topic into the curriculum. It'd be a great ongoing discussion, no matter the legal outcomes.


Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.


1. get to enjoy an open network of networks

2. people share, get creative and get some sort of credit for it

3. scrap it all and feed it a large deep neural network and be a worse version of all this content but easily accessible

4. creative people don't see a reason to keep sharing what they have (no new public books, no new open source projects, ...)

5. get stuck in an AI world of recycled content

People blindly following OpenAI products have a very shortsighted vision. What they did is neither innovative, nor extraordinary, they got the data, convinced some victims into a kickstart, made sure the hardware supports the bigger deep neural network that can do the job. Check out the OpenAI alternative solutions, it's not hard.


> but techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

I came to this realisation arguing with someone in a mutual discord server, about these very topics (the negative impacts of AI). They just couldn't see it, and refused to believe it. I was constantly met with things like "Sure, we'll have to adjust but it'll come" and "Things are no worse now than when the TV and when books were invented" (completely ignoring the many of billions companies are spending to make things more addictive ot our monkey minds, which don't change). Also lots of noble "everyone can use it and it'll benefit everyone"...when really, it only benefits those who can control it. No mention of biases in training data or anything else either. They were really completely blinded to the idea that it might not be good and we should serious admit there are huge issues looming.

I also found it telling that the multiple people like that also weren't fans of in-person interaction, outside their friend group. They saw Discord interactions as just as fine as going out and having serendipitous moments in person, with other real people, and just actually living. Something else I feel technology has stolen from us with everyone always glued to their screen. It's funny how I've become something of a Luddite, proudly, and think we need less internet and more real world, cause, well, life is real world, being human is through real world interactions. And not ones mediated by your phone.


100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.


Lets turn this around the other way.

I create a omniscient copyright detection bot and face it at everything you create 24 hours a day 7 days a week.

You go home and sing happy birthday to your kid. The bot gives you a non-monetary warning for using a copyrighted work without permission. No big deal, but it is on your permanent record.

It had been a stressful day so you take up your evening hobby of painting. You like nature scenes and trees and 30 minutes in you receive a violation, evidently Bob Ross has already done this and his surviving estate is now asking you to destroy the picture.

The next day you go to into your job at the corporate bureaucracy slinging lines of javascript. It's been a productive day so far and you have a few hundred new lines of code written and then the bots going off and HR and legal are ringing the phone within seconds. Turns out some comment you'd saw on Stack Overflow years ago was imprinted in your memory well enough you committed a copyright violation. Looks like you'll be losing your job.


The first two situations you mention almost certainly aren’t copyright violations. The third is at least a solid “maybe.”


They aren't, or they shouldn't be, but that's the point of the parent's comment.

Look at the videos flagged by youtube or copyright trolls, a lot of them are not actual copyright violations, but they are flagged anyway by the algorithm and removed or demonetized. And it takes a lot of work to fight those claims.


No, that doesn’t seem to be the point of the parent’s comment. That comment is treating those activities like copyright violations when they’re not. Why would the parent imagine an omniscient copyright bot that is probably wrong?

Me singing happy birthday at home or painting a picture for myself to relax is already demonetized. There doesn’t need to be an omniscient (and wrong) copyright bot to do that.


The first situation isn't copyright violation because some monied entity went out and litigated against Warner/Chappell music. That's the problem with copyright - until you've litigated, which is expensive, you just can't tell what's in and what's out of copyright. You wrote "almost certainly" because of that.


> You wrote "almost certainly" because of that.

No, I didn’t.

I was hedging against the too-clever HN commenter coming back and saying, “The robot knew you were going to sell the painting” or “You sang the game on the Jumbotron at Yankee Stadium.”


Because of the litigation, even singing "Happy birthday to you" on the Jumbotron is not infringement. But only because of the litigation.


The third exists in a form right now. I've worked for a couple of companies that require all code to be run through a scanner intended to detect if that code has been lifted from open source codebases. You won't get fired if it has, but you will be required to remove it before it is accepted into the codebase.


All of a sudden everyone has issues with free information when Google has been doing this for years.


All I have to say is, as technologists, anyone who is criticizing ChatGPT and has not been criticizing Google is a hypocrite. It's well known Google tries to keep you on Google by parsing more and more information from websites and summarizing it. Ex, Wikipedia summaries, IMDB Scores, Review Stars, etc...

If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.


Google makes money when you click links and visit webpages. Instant info features are useful but do not directly bring Google money.


That's my point?

If the product is scraping the data and presenting it on their website like ChatGPT and Google, then that's effectively the same as taking away the ad revenue from those websites because they aren't getting the impressions.


You're confused. Where there is an ad impression (a user clicking to go to a webpage from a Google search result), that webpage pays Google for bringing them traffic. If the user never clicks the ad because Google directly presented the info natively, Google doesn't make any money.

I can only make the guess that Google offers this to remain competitive against Bing; it both reduces their income and increases their tech stack.


It seems you are the one confused. If you present any ads on your website (Google or not) then you have ad revenue. The less traffic that comes to your website, the more that impacts your traffic thus directly affecting your ad revenue.

The original topic is about taking away money from the sources. Google taking your data and presenting it is taking away traffic because there is less traffic to the website of the data source due to less ad revenue.


Google makes money when you look at and click ads. Visiting websites has a risk that it takes you to a site that does not show you Google ads.


Ah, our daily dose of a bunch of people with basically no understanding of copyright law or even the basic concepts of tort or common law jurisprudence make all sorts of silly anthropomorphic arguments about “how computers think”.

Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.


Absolutely this, well put. I guess enough people misunderstand AI models to the point of treating them like they are not software. I guess this validates Clarke's third law (Any sufficiently advanced technology is indistinguishable from magic).


The dirty secret of how so many social media giants got their initial traction in the early growth stage they scraped content. LinkedIn is one I have personal knowledge of. Facebook another. How do you think they got a critical mass of users? Scraping and fake engagement. Back in the 00's when they were startups operating in little offices in the SF Bay, they had teams of people running Beautiful Soup and were building bots to build profiles and stuff.

I'm actually not really sure I have an opinion on the ethics of it. Same argument as Adblock. You don't get to control how people consume your content if you put it out in the world for free. That goes for profiles, or articles, reddit posts, StackOverflow, etc. The only thing that's ironic is that large tech companies throw a fit whenever you want to turn the tables and scrape them.


Didn’t LinkedIn use people’s phone books


Well you could say same thing about the answers that Google displays on it's pages instead of search results! If you don't want these crawlers to index your content I am pretty sure you can disable via robots.txt just like Google.


The training model data sets have inconsistent respect for robots.txt. Also, I believe most of these models are not continuously crawling websites to update their data like a search engine does. That means if you're crawled once, you may not be crawled again and you'll still be in the datasets.

I'd also argue that Google directing traffic to your website is a good alignment of incentives. ChatGPT spitting out answers derived from your work with nothing given back to you in return is not.


I bet that fully half the time, I read the google answer, click on nothing and go on my way.


That's still better than 0%


The idea that a robots.txt will save you is laughable.


Agreed. At best, you can disallow: / and hope they're polite enough to listen.

I can't seem to find anything on OpenAI's crawler agent, so I'm skeptical they're considering robots.txt at all.


Even if they abide, this is capitalism. Somebody who wants an edge won't. Or OpenAI or Google will get desperate and stop abiding.


True. Robots.txt is already a very weak thing. I disallow all access using robots.txt, but there are many crawlers who ignore it and I have to maintain an overt blocklist for them.


It's the lack of attribution that really hurts, though I think its fairly shady of google to steal the ad revenue from smaller sites.


You can ask ChatGPT to cite its sources.


You can ask it to, but it will just make up sources. The connection between its knowledge and the original sources is not represented in the model.

(This is an active area of research, though, and version of GPT that could cite its sources is something people widely agree would be valuable.)


You can ask for the source of information.


They still link to the source though. Even when they show a snippet.


All I know is that while this isn't a new issue, the likes of ChatGPT has brought it to a head and made it more urgent. I am seriously reconsidering whether or not I want my writings to be available on the internet at all. I object to many of the uses, including this, they can be put to, and not publishing them online appears to be the only control available.

For now, I have removed my existing works, both technical and creative, from the internet and won't be adding more while I try to work out what to do.


Finally, a good answer from a content creator. It is wrong to try it hobble and control all of us with ever more detailed laws. It is right to choose to participate carefully.


The importance of source citation in ChatGPT's responses is a topic of debate, particularly as the platform shifts towards a paid model. While ChatGPT is designed to deliver information in a conversational and user-friendly way, it is important to consider the potential legal implications of using unverified or uncited information. In sensitive or controversial cases, it is advisable to properly cite sources to ensure accuracy and avoid any potential issues of intellectual property infringement.

On the other hand, the focus on the potential of ChatGPT's natural language processing capabilities highlights the significance of learning and using LLM (Language Models) in data handling. The utilization of LLM can potentially lead to a future where traditional databases become obsolete and are replaced by advanced language models. As such, the development and integration of LLM in our daily lives and processes can bring about many benefits and possibilities.


No. It's not. Also, it's not unfair if I study someone's work and then learn from it. Also, it's not unfair if you see my internet present and are inspired to do similar things.

At some point participating in the internet means your stuff is going to be seen. I wear glasses to read web content. I don't think the glasses company should pay royalties for what I read. chatGPT is a tool that allows me to understand and use the information people put onto the internet better.

Far from a matter of fairness, this is simply another way that selfish people are trying to monetize the future, to make it more and more difficult and expensive for others to participate.

"I've always wished I could charge everyone one earth. chatGTP looks like the future. If I can tap the money flow there I will get mo' money."

I'm against it.


You’re conflating human work and machine work. I think that’s the real argument here - it’s fine to study someone’s work and learn from it as a human, because that takes impossibly more time and produces impossibly less total works than chatGPT doing the same thing.

I’m not arguing for either side, just pointing out that we need to carefully consider what rights AI should share with humans.


But they are going to monetize regardless of what we want. If that's the case, they need to retribute any copyrighted content,etc...

You're being selfish too. How do you think we have phones, etc... ? Capitalism applies to knowledge too.

ChatGPT took advantage of that and wants to monetize it while cutting people who spent time, money, resources, etc... Just like copilot, plain and simple.


You can also invert this and say that without a system like ChatGPT it is physically impossible for most people to find or use those 570GB of data. A search engine can only get you so far and over time they are becoming less useful as the net floods with junk content. If you don't even know what terms to search for then ChatGPT wins out since you can start with a very simple question and then interrogate it further on details it produces. The best way to think about it is as a better search engine, a fully interactive one that also has some degree of its own agency when it comes to synthesizing data. It could be better, it would be nice to have the option to show sources for the output so that you can verify the facts or do your own research.


Google piggybacks on the same sorts of data to rank results and display ads without compensating site owners. They track billions of people without paying them. I don’t think it’s any more unfair if OpenAI built a better product.


I think chatgpt just exacerbates a problem that was already pervading the free internet business model, which is that Ad revenue model is outdated and exhausted without a clear alternative.

It maybe was unfair to telephone operators when connection automation was implemented, as it made operators obsolete, but the older model couldn't scale, the same way reading text from source doesn't scale for human productivity.


Telephone operators. Definitely a nice update to the buggy whip metaphor. Thanks.

Also, I agree exactly. Advertising is increasingly useless. It's a tax on knowledge and it's gross. I can't wait for it to die.

I want to only pay for the stuff I use.


This argument came up a bunch a while back. I settled on the opinion that while it's possible to buy summaries of books, I don't give a fart in a breeze where ChatGPT got it's data.

E.g. Summary of How to Win Friends and Influence People: Effective Steps to Better Interpersonal Relationships by Book Lyte

ChatGPT does more of a mashup with the learned data than humans need to, that'll do me.


The problem is this perspective is from copyright owners and not chatgpt users. It's fine if you don't care, but what matters is--do courts and lawmakers care. Today is probably the right time to get started on it for those types.


> “Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

We can only hope. It’s unfair to someone that my browser can ask your server for a page, I see an ad for random bullshit nobody would ever care about, and money changes hands behind the scenes and that counts as an economic transaction which boosts GDP. It’s unfair (in my favour) that I can piggy back off this to get things for free.

And when I say “someone“ I suspect “everyone”. Sadly spending money advertising “Yorkshire woman finds guaranteed way to win on the horses” doesn’t seem to have caused anyone to run out of money and have the whole thing collapse yet. And it’s unfair on real small businesses with products paying for adverts which people don’t see or are clicked by bots or are misreported and all they can do is throw money at Google and Facebook and hope.


I kind of agree with you, but I think that's only because we've all been saturated with the idea of everlasting ownership of ideas.

Clearly, ownership of ideas runs out, because we all use linked lists or binary trees, or paper, or turbines or the list goes on. We don't pay money to the inventors of linked lists, or the heirs or successors-in-interest to the inventor of paper. Why not? When does ownership of an idea expire? Why do we unconsciously accept copyright or patent limits of today?

There's also an issue with simultaneous invention, but that's out of scope here. Clearly ChatGPT is just regurgitating or otherwise emitting previously-ingested material.


When this becomes properly entrenched I fear that it may create a disincentive to create original content. If that happens we will all be poorer for it in return for amazing access to what we already have. I don't think it is a good deal.


It's my opinion that royalties are the reason we have so much horrifying junk in our culture. It has created a world where we are inundated with cultural garbage that people produced only to squeeze money out of copyrights.

I dream of royalties going away so that we only original content that was made for the love of expression, a feeling that it's important. I would be happy to have a LOT less stuff to look at if I didn't have to sift through so much garbage.

Of course, I am also in favor of UBI so that those creators can eat while they are doing it.


At the minimum, systems like chatgpt should be forced to link to their sources, so it gives something back and so that its assertions can be verified - right now, it’s just good at bullshitting through questions.


It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.


Yea, we've been seeing these posts over and over again on this same topic, and most of it for me boils down to

"If you applied the same set of rules to a human, how exactly would that look"

Simply put culture is the copying of each others ideas. When one of us started banging rocks together to make them sharp they didn't sell this idea to others, at best they traded sharpened arrows for something else.

The big issue with humans is we are commonly very conservative in our ideas. "Yesterday I did X, today I did X, and tomorrow I'll do X", fine and dandy until tomorrow a machine does X for nearly free. Instead of figuring out how to adapt our economic systems to deal with new systems of cheap and plenty the fearful and the greedy are looking for ways to maximize the amount they can profit or hold it back to maintain status quo.


I think the copyright ethicality of the current class of AIs is about like religion or guns.

Discussion is pointless because everyone already has an opinion and it's very firm.


Big Tech companies have been scraping massive amounts of data for about two decades. Many smaller companies have tried to imitate them (remember when Big Data was the hottest thing out there? How do you think most of those startups obtained their data?) but pretty much all of them failed, mainly by running out of cash. OpenAI just happened to win the scraping lottery.


"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."


> I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them

Google is doing this in search results for years, so does bing. apple also does this in their built in dictionary.

why rant about chatgpt that currently at least is a small company in comparison.


chatgpt actually has some ideas about this

question: How could the people who generate used in an ai language model be paid for their work?

answer: There are several ways in which the people who generate content for an AI language model could be paid for their work:

    Royalty-based payment: Content creators could receive a percentage of the revenue generated from the use of their content in the AI language model.

    Token-based payment: If the AI language model is built on a blockchain, content creators could be paid in tokens that could be traded for cryptocurrency or fiat currency.

    Partnership with content publishers: The developers of the AI language model could partner with content publishers to compensate the creators of the scraped content.


Seems like this AI model doesn't understand very well how it works[0], which I suppose can be explained (maybe there aren't that many explanations in the corpus), but it's also quite ironic.

[0] The answers focus on the technicalities of how the payments could be arranged, but the much bigger problem is that it's not clear who the payments should be going to (there's no immediately obvious or unique way of attributing a given output to specific training inputs, that would require a separate model with a lot of room for judgement/modelling decisions; or a new type of LLM that has that feature baked in).


The data it was scraped from was then put into vector maps and usedd to create a model which is used from zero to create unique sentences that summarize what the model relates to. The text results coming out are neither copyright infringement nor plagiarism.


You're saying plagiarism isn't if one mostly swaps a couple of things in the expression of the content?


the definition of plagiarism is "the practice of taking someone else's work or ideas and passing them off as one's own"...chatGPT might infer from thousands or millions of different possible works or ideas before creating its own sentences so I don't think that meets the definition of plagiarism


In general no, but there is a problem in that ChatGPT may end up "regurgitating" large chunks of source material regardless, even if mechanistically that's not what it's trying to do. Similarly it's been recently reported that Stable Diffusion has effectively memorized some entire images it was trained on, and is capable of generating those as output.

I don't think the word-by-word statistical mechanism of ChatGPT would stand up as a copyright defense in court. It's the output that counts, not the means of getting there. It'd be like me copying some copyright work word-for-word then trying to claim "well, your honor, I was only using that for inspiration, I was using my full creative abilities to write what I did, so you can't blame me if it's a word-for-word copy".

I think OpenAI (or any company with the resources to train such a model in the first place) could fairly easily self-police and check that what they are generating isn't an exact (or almost exact) copy of something it was trained on. It's a bit like the app Shazam/similar recognizing a song from a short snippet - you just need to generate some type of "hash code" for each generated sentence (or whatever level of granularity makes sense) and compare it to a database of "hash codes" from the source material it was trained on.


There needs to be a “raw source” option that puts links to everything the model spits out that the user can enable or disable. This can give credit to whatever the model cites, and also help us understand a little of how it’s linking things together.


Not any more than I don't need to keep paying my teacher once I learn what she knows. ChatGPT's value isn't in what it knows: it's in what it understands from your prompt in terms of that sea of information.


1- It's on internet. If it's on Google site then it's free. If you want then use robots.txt (not that ever stopped google's spider to index your pages)

2 - Code was trained from GitHub. GitHub is Microsoft. OpenAI is Microsoft money. So Microsoft trained its AI on Microsoft code. You disagree? Then GTFO from GitHub and don't feed Microsoft your code anymore.

3 (the most important point) - Q: "Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?"

Fuck YEAH!! please do so. I hope the shit show that ad model is crashes and burn to the ground. You can't use internet without having a solid armor on you with uBlock Origin and/or NoScript (or PiHole if you want the same readable experience on rest of your house devices).


Never realized how little data it was fed with. 570GB can fit on my laptop.


The original dataset was 45 TB.

The neural net model is condensed to 800 GB.

https://www.springboard.com/blog/data-science/machine-learni...

Note that the "compression" there also includes the "intelligence" that it presents - you might be able to get some powerful compression of English text... but you can't ask a gzip file to come up with a joke about cats and dinosaurs.


1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.


But its not just words…


I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.


> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

Hopefully. This would be the best outcome I can think of for the Internet.


Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?

Obviously storage is not a major factor here.


I seem to recall that the training cost for ChatGPT was in the tens of millions of dollars. Execution cost is on the order of ~$1 per interaction.


The cost per day isn't something that there are any reliable sources for.

The closest to an authoritative source on it is https://twitter.com/sama/status/1599671496636780546

> average is probably single-digits cents per chat; trying to figure out more precisely and also how we can optimize it

An attempt to work through it from related resources is https://twitter.com/tomgoldsteincs/status/160019698195510069...

In particular https://twitter.com/tomgoldsteincs/status/160019699090561433...

> So what would this cost to host? On Azure cloud, each A100 card costs about $3 an hour. That's $0.0003 per word generated.

> But it generates a lot of words! The model usually responds to my queries with ~30 words, which adds up to about 1 cent per query.

---

It is much less than $1/interaction.


On that argument, I could see publishers trying to sue. If you ask GPT:

> What's the New York Times scrambled egg recipe?

GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.


This reminds me a bit of the criticism of “black box logic” for ML models.

Is there something analogous to saliency maps for LLM?


My feeling is that one of the four happen:

1) AI is open sourced and we adapt stably. Either everybody has the opportunity to be their own business, or there is UBI.

2) AI is open sourced but it is unfairly distributed. Only some people are suited to BTOB, and/or UBI is shit.

3) AI is not open sourced, the wealthy edge out mankind and a planet scale genocide occurs.

4) none of it matters because the looming war between the US & China explodes or climate change wipes us out in any meaningful capacity that could pursue AI.

Given the track record of our species, #1 feels like wishful thinking


The ad-based model of the internet is bad anyway. I don’t think chatGPT will break it, but we can hope!


I think it should reference the sources of the information, similar to any research paper or essay.


Could you cite the need to use citations in common conversation or asking it to write jokes?

If you are using GPT as a research tool as opposed to asking your friend who is an expert int the subject, are you citing your friend when you write the paper - or are you going back and finding sources that then back your friend's point up?


Why should it be less fair than what a search engine does?

It’s really just building a better model.


A search engine directs people to the original work. This doesn't.


Have you used Google in the last 5-10 years? It's been slowly parsing more and more information from websites so you never leave Google.


I'm aware of that, but I don't really consider that a part of search.


Maybe unfair is the wrong word. I think most agree that scraping, even at a massive scale -- is in itself fair. But is it sustainable?

Will LLMs drive interest/activity away from wikipedia.org? Will it put its own sources of high-quality ad-supported content -- wikihow.com, for example (though I can't be totally sure it scraped from there) -- out of business? Or is there an earth-shattering copyright suit against OpenAI in the works as we speak?

> Can this start breaking the ad-based model of the internet

Is the alternative that everything is behind some kind of paywall by default, to block scraping? Is that where we're heading?


Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.


This is essentially a defeatist argument flirting with supporting extortion, it seems to me. If you think that chatgpt is doing something wrong, this is arguing that you should allow the wrong to exist because there's nothing you can do about it.

In other areas of society where a bad thing cannot be stopped, we still use legislation to reduce the amount of it and mitigate some of the harm.


I don't think they're doing anything wrong. Indeed, I think they're performing a public service that none of the others seemed positioned to do. The default is to keep advances private.


I agree. ChatGPT should cite its sources.


I'm pretty sure ChatGPT doesn't know it's sources. If it generated "that cat sat on the mat", then what (even from a theoretical POV) is the source of the word "mat" ? Note that it's not pulling the whole "cat sat on the mat" sentence from anyplace - that's not how it works - it's just generating this one word at a time based on the statistics (collected over all the text it was fed) of what word is most likely to follow what came before.

So, who gets credit for the word "mat" being generated in that context ? I guess any texts talking about cats and mats in close proximity may deserve some of the "credit", but it goes way deeper than that since why did ChatGPT choose to output such a trite sentence (albeit while only selecting one word at a time), rather that something else about cats or perhaps a more interesting thing that cats often sit in/on ...

People seem to assume that ChatGPT is pulling entire "facts" from various sources, but that's just not how it works - it's just feeding all the texts into a giant meat grinder of word statistics. It knows about words, not facts.


> People seem to assume that ChatGPT is pulling entire "facts" from various sources, but that's just not how it works - it's just feeding all the texts into a giant meat grinder of word statistics. It knows about words, not facts.

...yes?

OpenAI: "challenge incorrect assumptions"


Yes, it’s 100% unfair but the net gain to society will be worth it in the long run. Got to break some eggs to make an omelet.


yes... and sometimes it's straight out copyright theft


Without asking it to reproduce copyrighted information, do you have any examples of this? Please remember to cite your sources.


It's your fault for making your IP free and public. Instead of posting for free on your web property, do it in a book that you charge for.


Is it unfair that to present coworkers thoughts you summarized or derived after reading ad-supported content?


Is that something you think is a good analogy for ChatGPT use of it's data sources?

To me it looks more like memorizing enough of other employees' project contributions to try passing it all off as your own achievements in performance review.


if it is ad-supported then you "bought and paid for" that content, no?

so in that case it wouldn't be unfair ... :-)

did the ChatGPT pay for the content it is using? that was the original question...


Only if you learning the same things is cheating too.

"Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.


How could training an AI on the works of Shakespeare possibly be unfair to him? Or to any other long dead person? - I don't see any issues

How could training an AI on the works of someone who has already been paid for them be unfair? - Possibly because it effects their future marketability and income?

Current authors, artists, internet commenters, clearly have an interest in the results of their creative endeavors being used for gain that they won't benefit from. This is very similar to the extractive monopolies of YouTube and the rest of social media. Their profit at our expense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: