> Is it unfair for you to create content/products/etc after you have read and le...

Retr0id · on Feb 5, 2023

Being allowed to scrape something does not absolve you of all intellectual property, copyright, moral, etc. issues arising from subsequent use of the scraped data.

dawsoneliasen · on Feb 5, 2023

Exactly, besides, the question isn’t about legality, it’s about what the law should be, I think. The question isn’t whether it’s legal, the question is whether we need to change the law in response to technology.

faktory · on Feb 5, 2023

ChatGPT isn't doing the scraping, humans are. And humans are using computers to both read the article and create content or to scrape it.

So not it's not a false equivalence.

anileated · on Feb 5, 2023

There’s a reason scraping is a legally grey area.

> Web scraping is legal, US appeals court reaffirms

First, the case is not closed. [0]

Second, to draw an analogy, you can use scraping in the same way you can use a computer: for legal purposes. That is, you cannot use scraping to violate copyright, just as you cannot use a computer to violate copyright.

The following being my conjecture (IANAL), there is fair use and there is copyright violation, and scraping can be used for either—it does not automatically make you a criminal, but neither is it automatically OK. If what you do is demonstrably fair use presumably you’d be fine; but OpenAI with its products cannot prove fair use in principle (and arguably the use stops being fair already at the point where it compiles works with intent to profit).

[0] https://news.ycombinator.com/item?id=31079231

cykros · on Feb 5, 2023

It seems the issue with scraping as it pertains to copyright issues isn't the scraping, any more than buying a book to sell off photocopies of it cheaply doesn't indicate that there is a problem with buying books. The issue is the copying, and more importantly, the distribution of those copies.

Fair use of course being the exception.

Now, as for accessing things like credentials that get left in unsecured AWS buckets is the bigger area where courts are less likely to recognize the legality of scraping. Never mind the fact that these people literally published their private data on a globally accessible platforms in a public fashion. I'm not a lawyer but I've seen reports of this leaning both directions in court, and yes, I've seen wget listed as a "hacker tool."

This is what happens when feelings matter more to the legal system than principles.

And before it's brought up, I may as well point out that no, I don't condone the actual USE of obviously private credentials found in an AWS bucket any more than I condone the use of a credit card that one may find on the sidewalk. Both are clearly in the public sphere, unprotected, but for both there is a pretty good expectation that someone put it there by accident, and that it's not YOUR credential to use.

Basically, getting back to the OP, ChatGPT hasn't done anything I've seen that'd constitute copyright infringement -- fair use seems to apply fairly well. As for the ad-supported model, adblockers did this all first. If you wanted to stop anything accessing your site that didn't view ads, there are solutions out there to achieve this. Don't be surprised when it chases away a good amount of traffic though -- you're likely serving up ad-supported content because it's not content you expected your users to pay for to begin with.

faktory · on Feb 5, 2023

Yes but that's a technical issue. I took the parent as making a philosophical point and responded in that spirit.

williamcotton · on Feb 5, 2023

Wouldn’t it be nice if the people on these forums were not ignorant of both philosophy or the legal system before diving into incoherent conversations about both at the same time where the main thrust is the emotions they have about these tools?

anileated · on Feb 5, 2023

One can dream.

faktory · on Feb 5, 2023

dmak · on Feb 5, 2023

How is it not scraping? There's no other way to get all that data for training a model without scraping.

faktory · on Feb 5, 2023

It's scraping both when humans do it and when the ChatGPT team do it, but that wasn't the point the parent made. He made a moral/philosophical point which is what i responded to.

anonymouskimmer · on Feb 5, 2023

Check me on this because I'm not a software person:

When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.

When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.

jMyles · on Feb 5, 2023

> The question is if this data is legal to scrape

...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?

> Because it's false equivalence? ChatGPT isn't a human being.

Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?

It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.

venv · on Feb 5, 2023

>Is this important?

Well yes, it's the whole crux of the matter. Laws govern human behaviour. As of 2023, only living beings have agency. If I shoot someone with a gun, the criminal is me and not the gun. Being a deterministic piece of silicon, a computer is perfectly equivalent. Sure, it is important to start a discussion of potential nonhuman sentience in the future, but these AI models are not unlike any previous software in legal issues. It's bizarre to me how many people are missing this.

jMyles · on Feb 6, 2023

> these AI models are not unlike any previous software in legal issues

Agreed. However, the previous 'legal issues' related to software and the emergence of the internet are also difficult to take seriously when considered on anything but extremely short time scales.

Every time we swirl around this topic, we arrive at the same stumbles which the legacy legal system refuses to address:

* If something happening on the internet is illegal, _where_ is it illegal? Different jurisdictions recognize different jurisdictional notions - they can't even agree on whose laws apply where. If you declare something to be illegal in your house, does that give it the force of law on the internet? Of course not. Yet, the internet doesn't recognize the US state any more than it does your household. It seamlessly routes around the "laws" of both.

* The "laws" that the internet is bound to follow are the fundamental forces of physics. There is no - and can be no - formal in-band way for software to be bound to the laws of men, because signals do not obey borders. The only way to enforce these "laws" are out-of-band violence.

* States continuously, and without exception, find themselves at a disadvantage when they make the futile effort to stem the evolution of the internet. For example, only 30 years ago (a tiny spec in evolutionary time scales), the US state gave non-trivial consideration to banning HTTPS.

I understand that people sometimes follow laws. But they also often don't. The internet has already formed robust immunity against human laws.

Whatever human laws are, they are not the crux of anything related to evolution of software. They are already routinely cast aside when necessary, and are very clearly headed for total irrelevance.

_qzu4 · on Feb 5, 2023

> It's bizarre to me how many people are missing this.

Very much this. I am too tired right now to engage with other responders, but thank you for articulating precisely the point I want to make.

bilsbie · on Feb 5, 2023

> It's a product that is built upon data from other sources.

To be fair, so are you.