Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?
People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".
> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?
Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.
Being allowed to scrape something does not absolve you of all intellectual property, copyright, moral, etc. issues arising from subsequent use of the scraped data.
Exactly, besides, the question isn’t about legality, it’s about what the law should be, I think. The question isn’t whether it’s legal, the question is whether we need to change the law in response to technology.
> Web scraping is legal, US appeals court reaffirms
First, the case is not closed. [0]
Second, to draw an analogy, you can use scraping in the same way you can use a computer: for legal purposes. That is, you cannot use scraping to violate copyright, just as you cannot use a computer to violate copyright.
The following being my conjecture (IANAL), there is fair use and there is copyright violation, and scraping can be used for either—it does not automatically make you a criminal, but neither is it automatically OK. If what you do is demonstrably fair use presumably you’d be fine; but OpenAI with its products cannot prove fair use in principle (and arguably the use stops being fair already at the point where it compiles works with intent to profit).
It seems the issue with scraping as it pertains to copyright issues isn't the scraping, any more than buying a book to sell off photocopies of it cheaply doesn't indicate that there is a problem with buying books. The issue is the copying, and more importantly, the distribution of those copies.
Fair use of course being the exception.
Now, as for accessing things like credentials that get left in unsecured AWS buckets is the bigger area where courts are less likely to recognize the legality of scraping. Never mind the fact that these people literally published their private data on a globally accessible platforms in a public fashion. I'm not a lawyer but I've seen reports of this leaning both directions in court, and yes, I've seen wget listed as a "hacker tool."
This is what happens when feelings matter more to the legal system than principles.
And before it's brought up, I may as well point out that no, I don't condone the actual USE of obviously private credentials found in an AWS bucket any more than I condone the use of a credit card that one may find on the sidewalk. Both are clearly in the public sphere, unprotected, but for both there is a pretty good expectation that someone put it there by accident, and that it's not YOUR credential to use.
Basically, getting back to the OP, ChatGPT hasn't done anything I've seen that'd constitute copyright infringement -- fair use seems to apply fairly well. As for the ad-supported model, adblockers did this all first. If you wanted to stop anything accessing your site that didn't view ads, there are solutions out there to achieve this. Don't be surprised when it chases away a good amount of traffic though -- you're likely serving up ad-supported content because it's not content you expected your users to pay for to begin with.
Wouldn’t it be nice if the people on these forums were not ignorant of both philosophy or the legal system before diving into incoherent conversations about both at the same time where the main thrust is the emotions they have about these tools?
It's scraping both when humans do it and when the ChatGPT team do it, but that wasn't the point the parent made. He made a moral/philosophical point which is what i responded to.
Check me on this because I'm not a software person:
When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.
When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.
...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?
> Because it's false equivalence? ChatGPT isn't a human being.
Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?
It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.
Well yes, it's the whole crux of the matter. Laws govern human behaviour. As of 2023, only living beings have agency. If I shoot someone with a gun, the criminal is me and not the gun. Being a deterministic piece of silicon, a computer is perfectly equivalent. Sure, it is important to start a discussion of potential nonhuman sentience in the future, but these AI models are not unlike any previous software in legal issues. It's bizarre to me how many people are missing this.
> these AI models are not unlike any previous software in legal issues
Agreed. However, the previous 'legal issues' related to software and the emergence of the internet are also difficult to take seriously when considered on anything but extremely short time scales.
Every time we swirl around this topic, we arrive at the same stumbles which the legacy legal system refuses to address:
* If something happening on the internet is illegal, _where_ is it illegal? Different jurisdictions recognize different jurisdictional notions - they can't even agree on whose laws apply where. If you declare something to be illegal in your house, does that give it the force of law on the internet? Of course not. Yet, the internet doesn't recognize the US state any more than it does your household. It seamlessly routes around the "laws" of both.
* The "laws" that the internet is bound to follow are the fundamental forces of physics. There is no - and can be no - formal in-band way for software to be bound to the laws of men, because signals do not obey borders. The only way to enforce these "laws" are out-of-band violence.
* States continuously, and without exception, find themselves at a disadvantage when they make the futile effort to stem the evolution of the internet. For example, only 30 years ago (a tiny spec in evolutionary time scales), the US state gave non-trivial consideration to banning HTTPS.
I understand that people sometimes follow laws. But they also often don't. The internet has already formed robust immunity against human laws.
Whatever human laws are, they are not the crux of anything related to evolution of software. They are already routinely cast aside when necessary, and are very clearly headed for total irrelevance.
Does anyone actually find these arguments persuasive?
There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.
Second, try applying this logic to literally anything else and you'll see why it's absurd:
"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"
"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"
Yes, and: why shouldn’t it matter that in one case it is a person and in another it is a computer program?
Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?
But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.
I also agree it's not the only argument and ultimate proof.
I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.
You usually expect people to cite sources. Granted, that very often doesn't happen, and the amount of citing expected depends on the context. But ChatGPT just doesn't cite sources at all. I think there's a case to be made that they should.
With search engines, it does feel like there was is a more clear trade of scraping access in exchange for web traffic.
With ChatGPT the traffic benefit isn’t there, so it feels like it isn’t a fair trade.
Google adding the context and data to their search results page also started blurring this trade making it unnecessary to click to the site the info was cleaned from.
How does someone site a source when they are using GPT to convert a box score into an entertaining paragraph about a baseball game? Or to convert a natural language command into a JSON format ready for downstream processing?
Humans have a pretty good sense of when you need to cite sources, and when you don't. For example, long ago I learned from some website how to write a for-loop in python, and now I write them all the time without giving credit. I'm okay with ChatGPT writing a for-loop without citing its source.
I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.
And yet, exactly in this example, I HATE that people don't put sources. Perhaps not for "for loops", but search anything simple in python. "Python JSON output", for example, and you will find a billion articles that describe a simple python library ... but DON'T link to python.org or the "javadoc". They're always dicussing the most blatantly obvious simple thing, never remotely complete, never link to where you can actually find more info (but jobs, courses, ads, ... those will be linked)
It's getting me to the point of refusing to use Google, or only use Google with "site:...". I mean, the site varies, but without site limits Google's becoming useless.
ChatGPT doesn't have a concept of sources. It has weights that together define a function that allow it to guess the most likely next word from the context. As a neat side effect of this contextual next-word guessing, it often can share accurate information.
If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.
> You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.
I've read that ChatGPT is not connected to the net, but if it was: Couldn't you have it do a google search (or better yet corpus search) for the string it generated and then return the most significant matches (significance by string matching, not google rank)? It would be really crude, but wouldn't this just be a handful of lines of code that don't interfere with the "transformers-based model" code at all?
Why couldn't you, as a human do that to verify it?
The other day I had GPT write a rap battle between Burger King and Ronald McDonald. One of the stanzas came back:
Burger King:
Your burgers are plain, your buns a bore.
Your clown's been around since '63,
I'm sure my flame-grilled taste will leave you impressed
My burgers are fresh, my fries are the best
So here's the challenge for you - who do you compensate for that line?
The complaint that people have isn't that GPT isn't citing its sources but rather that it isn't compensating the people who created the data that has that information.
... and now, if you're ever asked about historical clown trivia and pull out the "Ronald has been around since 1963", who should you give a royalty to? Me (for writing this), GPT (for making me aware of it), Wikipedia (for the source of my links in this post), the estate of Willard Scott for the Joy of Living (which Wikipedia cites), some random blog author that had some clown trivia on it that happened to have been part of the training set for GPT?
Because I want to credit not verify? Because I want to trace the flow of information?
It isn't just monetary compensation that's important here.
I come at this from the point of view of a scientist who is expected to reference ideas. Not necessarily back to their original source, but at least back to a source that can theoretically point back to another link in the chain.
Sure, I can manually search for a reference based on what ChatGPT gave me. Or someone could spend a few minutes adding a few lines of code to ChatGPT to save millions of people some minutes of time.
-----
What would be awesome is an LLM that you can feed data to, and it can then write a paper based solely on the data you feed it.
I've still got the question - who should I credit with the bit that Ronald has been around since 1963?
I had it write a poem the other day in the style of Roses are read about coffee and bacon.
Roses are red
Bacon is greasy
My coffee is hot
Together they please me
If this is something that someone considers to be a derivative work of other things... who do I credit?
Identify a word that have different meanings to two different professions at the same time and the professions that use them. Give the definition of the word for each profession. Write a joke using this word.
to which I got back:
The word is "band."
Definition for a Musician: A group of musicians who play music together.
Definition for an Astronomer: A dark region in the sky with less stars.
Joke: What did the astronomer say when the musician asked him to join his band? "I'm sorry, I don't do solos in the dark!"
How do you credit that?
---
> What would be awesome is an LLM that you can feed data to, and it can then write a paper based solely on the data you feed it.
> If this is something that someone considers to be a derivative work of other things... who do I credit?
Based on a quick search the best credits would be ChatGPT as the arranger, and "Roud Folk Song Index number 19798" as the inspiration.
> "Joke: What did the astronomer say when the musician asked him to join his band? "I'm sorry, I don't do solos in the dark!""
> "How do you credit that?"
That you credit to ChatGPT. It's not referencing facts or discoveries, so credit isn't as important as it is for articles. If you want to credit an inspiration then I'm sure there's an index of joke forms out there that has an appropriate number to cite.
I can't actually find a definition for band in astronomy that is "a dark region in the sky with less stars." So it seems to be a pretty poor joke.
This does it solely based on the data you feed into it? And by data I mean scientific data that you discovered, and want formatted into a particular research article style.
Edit to add: Possible sources for the line "together they please me":
Why did you pick that index rather than some other source material? Roses are red dates back to 1784 (year not index number) as a nursery rhyme. Does it need to be credited or is it in the public consciousness to the point where one can create a poem based on it without knowing its original source?
Write a haiku about bacon and coffee. Identify the syllable count for each word and line used in the haiku.
Example:
Bacon (2) sizzles (2)
Aroma (3) of (1) coffee (2) too (1)
Mouthwatering (4) bliss (1)
Smoky (2) bacon (2)
Brewing (3) coffee (2) aroma (3)
Makes (1) mornings (2) bright (2)
The second poem is from GPT. Do we need to credit the dictionary where it got the syllable count for each word? Or where it got that coffee (rather than bacon) is brewed? Or that bacon and coffee are things more often consumed in the morning?
Identify four foods or beverages that are frequently consumed in the morning and how each is prepared for breakfast.
1. Coffee: prepared by brewing hot water over ground coffee beans.
2. Cereal: prepared by pouring cereal into a bowl and adding milk.
3. Toast: prepared by toasting bread and adding butter and/or jelly.
4. Eggs: prepared by scrambling, frying, poaching, or boiling them.
There is a difference between "identifying a source where this information can be found" and "this is the (copyrighted) source of the data that GPT used to draw upon to come up with the statement."
The first is an exercise for the reader (and much better done and evaluated by the reader). The second is what people are concerned about.
Scientific documents - certainly. If you are writing a research paper or encyclopedia, I expect it to be well cited.
If you are writing something that is synthesizing knowledge (not just reporting the facts), the "where are all the places were that knowledge came from" is an impossible task for human or machine.
If I ask GPT to create a poem in the style of Roses are Red about coffee and bacon - why should that request need to be citied to the same degree of scrutiny as an encyclopedia or research paper?
If, on the other hand, you're trying to use GPT to write such a paper... I would hold that you're doing it wrong. It doesn't do that well. The model is "about" transforming language. To do so, it has a fair bit of 'knowledge' that it contains to be able to do that accurately. OpenAI makes no claims about the accuracy of the content that GPT produces (its improved, it can more accurately answer data - but if you want to know the answer it is no better than your next door neighbor who has read a lot).
If you are claiming that the example of Bacon is Greasy poem that GPT wrote is infringing any more than a child's "roses are red, my cat is orange, his eyes are green, nothing rhymes with orange" then I believe you will face an uphill battle.
To say that there is plagiarism and infringement going on - it needs examples rather than a "I think it works this way and is just regurgitating material it was fed from elsewhere."
Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN
"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.
RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."
If I woke up tomorrow and breathed all the oxygen, nobody else could breathe. But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.
> But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.
If you became the first line, go-to source for the information of those websites, those websites would stop getting click-throughs. Eventually it would become less and less worthwhile (economically or emotionally) for the people keeping those sites running to keep them running. It would become more and more difficult for people to find those sites even if they are running, or even the archives of those sites.
So yes, eventually you'd stop people from reading them too.
It is in fact that simple. There are dozens, hundreds, perhaps thousands of legitimate, genuine, serious, real reasons to be concerned and "want something to be done". This isn't it.
"Learning is unfair" is not an argument you want to win.
>Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?
The difference is in scale.
A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.
OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.
At a human level it falls below the noise floor. It's a fact of life that humans will learn and build from experience.
The difference is scale. At scale it becomes a problem.
Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.
At scale, it becomes a wonderful tool. Are the people in this thread so threatened or so invested in the current business models of the internet that you can’t see how amazing this sort of thing could be for our abilities as a species? Not just in its current iteration, but it will get better and better.
This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.
It is a wonderful tool but I still feel that the creators of the training data are getting shafted. I'm both amazed and horrified at our creation and what it portends.
Yeah, there will probably have to be some adjustment. In the future, maybe an ML agent will hire people to go find answers for it about questions it has, using us as researchers/mechanical Turks :-) Quality matters more than quantity for something that’s trying to understand the world well and not just building a statistical language model, I imagine that it will be worth it to pay for quality when training heavily used models, to avoid using garbage info. You don’t need 30 different superficial product reviews with a bunch of SEO text if you have one that’s very thoroughly researched.
And in the meantime, with ads no longer working, maybe crypto is actually useful for something here - lightning makes very small transactions possible with basically no fees, and makes it easy to programmatically pay for things. People hate being nickled and dimed, but a professional trying to construct an ML model could reasonably budget for use fees for fast unhindered access to quality training data. An agent could even evaluate its likelihood of learning something new/accurate vs the cost proposed by the server, and choose the subsets to pull.
Just a random idea, but I hope we don’t fight tooth and nail to preserve the trash heap of the internet’s current state.
The internet has always been a trash heap. We've just been creating new heaps with parts of the old heaps every few years or so. Sure, it's nice to imagine a future in which this isn't the case, but your imagination is not going to be the future reality.
People are already being paid to curate data for models, I’m mostly suggesting that that might become a major revenue source, and that ads might be less relevant in a world where people don’t need to sift through the trash heap to get info (and that’s a good thing overall!)
Wouldn't an AI-driven search engine be even better than a language model for that purpose though? It could even snippet highlight the most relevant parts of various web pages to save on the sifting.
Maybe, and arguably, that's what Google has been doing. But one thing I really like about the idea of using a model directly is that there's one interface to learn, whereas with web search, I'm constantly adapting to a grab bag of page types and UX conventions.
Do you think just maybe there is a diffence here because humans need money to survive, and maybe we should have compassion for humans who could hypothetically starve or freeze or suicide or whatever because they have no money? Or is it just silly to care about people like that?
That's got nothing to do with whether or not it is "fair" for a learning system to produce content after it has learned.
That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".
If "fairness" isn't worth figuring out for a society, why is our entire economic order nominally built ontop of such a virtue? How is this not the very thing we are talking about? People starve on the streets right now because any other arrangement of resources has been deemed "unfair." Do we not sign a contract for our labor or for our homes because of shared idea of fairness? Fairness is the ultimate thing we appeal to in our world, it is the only thing that can sustain the intense individuality of the modern world. Dont ambiguate it as a Nietzchean moral fairness here, we are talking about the pseudo-algorithmic fairness of a market which guarantees certain things if you trade enough of your resources.
Well, the luddites were right in that they were fighting a good and honorable fight.
Until things are, in fact, solved by whatever idea you might have, why should we just accept each new thing that makes our human lives more intolerable? How could you expect any rational person to have that kind of blind trust in a technology, much less "progress" itself, when every single aspect of our world shows that it is who owns the technology that actually benefits from it? I think it is much more crazy just totally rolling over for each new thing that takes your job than it is to maybe fight for your food and shelter.
I think we can do better than UBI, but either way, fighting against this unfairness is fighting for the things we need to continue with some shred of humanity, insofar as this technology is and will be an agent for the consolidation of labor and profit. Its all the same fight, and the historical luddites understood this consciously or not.
Who knows, maybe the internet would have been better off if some people were brave enough to smash some of Google's servers in like 2006..
This is a response to an argument the GP didn't make. One can still have grave concerns about generative AI's potential impact on human society while accepting there is nothing fundamentally unfair about how it scrapes publicly accessible data.
Ok, so, what is that argument then? Maybe you just want to say: "well this is such a big deal its going to change everything anyway, so we can't judge it on how it will affect today's society, but rather the society it will create." If so, all one can really say to such longtermism is: "well, good luck with that I guess, I will keep trying to survive over here."
> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?
It’s fair if you do this; neither fair nor legal [0] when a commercial for-profit entity, backed by a large corporation, does it at scale and capitalizes on that.
Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]
Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).
[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)
[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.
People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".