Scraping isn’t illegal, and to be honest, I’m not even sure it’s unethical. I’m ...

wiseowise · on July 1, 2023

It’s ethical when your average Joe does it on a small scale to scrap their favorite favorite YouTuber or to buy something when it becomes available.

When you have financial incentive to build your business on someone’s data and you scrap literally millions if not billions of pages - it’s unethical.

specproc · on July 1, 2023

The thing with social media platforms is that this data is user-generated, so you've got the company "owning" user content.

This data is often of great public value. I track conversations around a social issue as part of my work for a non-profit.

I'd counter it's unethical to prevent people from accessing this data.

jnsaff2 · on July 1, 2023

I’m not disagreeing with your comment but

> great public value

Having been to twitter mostly through the most recent prominent war, man the signal to noise ratio is really low even when being careful about who to follow and who to block. There is so much disinformation, bad takes, uninformed opinions presented as facts, pure evil, etc.

So I guess it could be used for training very specific things or cataloging the underbelly of humanity but for general human knowledge it’s a frigging cesspool.

specproc · on July 1, 2023

OK, not gonna argue with that. There is, I guess, a perception that it matters because policy-makers, and the wonks and hacks that influence them are hooked. The value for me (and ergo the public, some classic NGO thinking there for you) lies in understanding those dynamics.

I do not use the Twitters myself, and actively discourage others from doing so. Sends people bonkers.

lmeyerov · on July 1, 2023

I mean, we have found election manipulations like large-scale inauthentic activity of out-of-staters explicitly targeting African Americans, and projects here even to the extent of the perpetrators getting indicted. Other projects were tracking vaccine side-effect self-reports faster than the CDC and other disaster intelligence.

We were actually gearing up to switch to paid accounts as we found use cases that could subsidize these efforts... And then the starting price for reasonably small volumes shot up to like $500k/yr.

bojo · on July 1, 2023

So, are we saying it's unethical for Google and other search engines who make money off of ad revenue to scrape sites like Twitter? Or are they paying a large sum to Twitter to do this?

raincole · on July 1, 2023

If Google doesn't provide a way to say "please don't scrape my site", then it 100% unethical.

We have robots.txt. If Google doesn't respect that, it's unethical. Don't you think so?

RobotToaster · on July 1, 2023

Does twitter's robots.txt forbid scraping? Judging by the fact it shows up in Google I'd assume not.

pcthrowaway · on July 1, 2023

Maybe it's time for an llm.txt

Not that the people you want to respect that would

raincole · on July 1, 2023

The tricky part is it's much more harder to prove that they didn't respect that.

spullara · on July 1, 2023

When there is a value exchange between the two entities that are relatively similar then I think it is ethical. People trade Google making money on ads for their site being found when people search. It is also possible to opt-out.

wiseowise · on July 1, 2023

They benefit mutually from their symbiosis. Financially, AI bro model #1321 doesn’t bring anyone value except their owners.

cygx · on July 1, 2023

If done against the wishes of the owner of the site, yes, I would consider that unethical. Thankfully, Google respects robots.txt and noindex.

zladuric · on July 1, 2023

But it's it ethical for the site owner to block access to random people and companies in the internet to _my_ data? I posted that tweet with the expectation that it's gonna be publicly available. Now the owner of the site is breaking that expectation. I would say that this part is also unethical.

Especially since they're not moderating things or anything.

cygx · on July 1, 2023

I would say that this part is also unethical.

Agreed. However, it's probably covered by their terms of service.

Same thing with the recent reddit kerfuffle. I'd have much preferred a Usenet 2.0 instead of centralizing global communications in the hands of a handful of private companies with associated user-hostile incentive structures.

max51 · on July 1, 2023

Being indexed by google is optional. Twitter could stop it a any time if they thought it was a bad deal for them. That not comparable to a startup company trying to scrape the entire site to train their AI and using sophisticated techniques to bypass protections Twitter has put in place

kergonath · on July 1, 2023

Translation: it’s ethical when I do it.

wiseowise · on July 1, 2023

You wouldn’t download a car, would you?

pdntspa · on July 1, 2023

Except with modern software, some wannabe genius programmer will think they can get a bunch of money or cred or whatever by infantilizing the process down to something your grandma could use. Then, suddenly, everyone is scraping. The net effect is largely the same -- server operators see an overwhelming proportion of requests from bots. Still ethical?

pmlnr · on July 1, 2023

> I’m not even sure it’s unethical.

If it doesn't respect robots.txt, it is unethical.

glimmung · on July 1, 2023

Is it ethical for the "public square" to have a robots.txt?

Musk is trying to have his cake and eat it...

(Clearly it's not a public square, but his position is incoherent).

rnbaxter · on July 1, 2023

Yes, it is ethical. In many countries it is legal for humans to walk around the public square and overhear all conversations.

It is NOT legal to install cameras that record everyone's conversations, much less sell the laundered results.

Pre-2023 people went on Twitter with the expectation that their output would be read by humans.

A traditional search engine is different: It redirects to the original. A bastardized search engine that shows snippets is more questionable, but still miles away from the AI steal.

RobotToaster · on July 1, 2023

Many countries have freedom of panorama, which means it is legal to video record the public square. I'm not aware if anywhere has specific laws on mounting the camera on a robot.

johnnyanmac · on July 1, 2023

>Pre-2023 people went on Twitter with the expectation that their output would be read by huma ns

Expectations =/= reality. And the reality is that bits have been reading comments for over a decade.

threeseed · on July 1, 2023

a) It looks to be permitted according to Twitter's robots.txt

b) Given Twitter is public, user generated content which they don't own but simply have a license I wouldn't call it unethical in the slightest.

AmericanChopper · on July 1, 2023

If the background of the issue is as Musk described, then it certainly is not allowed by twitter’s robots.txt, which allows a maximum of one request per second.

I do a lot of data scraping, so I’m sympathetic to the people who want to do it, but violating the robots.txt (or other published policies) is absolutely unethical, regardless of the license of the content the service is hosting. Another way of describing an unauthorised usecase taking a service offline is a denial of service attack, which (again, if Musk’s description of the problem is accurate) seems to be the issue Twitter was facing, with a choice between restricting services or scaling forever to meet the scrapers requirements.

Personally I would have probably tried to start with a captcha, but all this dogpiling just looks like low effort Musk hate. The prevailing sentiment on HN has become so passionately anti-Musk that it’s hard to view any criticism of him or Twitter here with any credibility.

echelon · on July 1, 2023

"You wouldn't download a car."

The only reason these websites and platforms aggregate any content at all is because they're effectively giant public squares.

dizhn · on July 1, 2023

No means no ! :)

develop7 · on July 1, 2023

Moreover, isn't making scraping impossible illegal per a couple-of-years-old bill?