Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Scraping isn’t illegal, and to be honest, I’m not even sure it’s unethical. I’m assuming you think it so — if so, why? I’m not disagreeing, but haven’t given it much thought.


It’s ethical when your average Joe does it on a small scale to scrap their favorite favorite YouTuber or to buy something when it becomes available.

When you have financial incentive to build your business on someone’s data and you scrap literally millions if not billions of pages - it’s unethical.


The thing with social media platforms is that this data is user-generated, so you've got the company "owning" user content.

This data is often of great public value. I track conversations around a social issue as part of my work for a non-profit.

I'd counter it's unethical to prevent people from accessing this data.


I’m not disagreeing with your comment but

> great public value

Having been to twitter mostly through the most recent prominent war, man the signal to noise ratio is really low even when being careful about who to follow and who to block. There is so much disinformation, bad takes, uninformed opinions presented as facts, pure evil, etc.

So I guess it could be used for training very specific things or cataloging the underbelly of humanity but for general human knowledge it’s a frigging cesspool.


OK, not gonna argue with that. There is, I guess, a perception that it matters because policy-makers, and the wonks and hacks that influence them are hooked. The value for me (and ergo the public, some classic NGO thinking there for you) lies in understanding those dynamics.

I do not use the Twitters myself, and actively discourage others from doing so. Sends people bonkers.


I mean, we have found election manipulations like large-scale inauthentic activity of out-of-staters explicitly targeting African Americans, and projects here even to the extent of the perpetrators getting indicted. Other projects were tracking vaccine side-effect self-reports faster than the CDC and other disaster intelligence.

We were actually gearing up to switch to paid accounts as we found use cases that could subsidize these efforts... And then the starting price for reasonably small volumes shot up to like $500k/yr.


So, are we saying it's unethical for Google and other search engines who make money off of ad revenue to scrape sites like Twitter? Or are they paying a large sum to Twitter to do this?


If Google doesn't provide a way to say "please don't scrape my site", then it 100% unethical.

We have robots.txt. If Google doesn't respect that, it's unethical. Don't you think so?


Does twitter's robots.txt forbid scraping? Judging by the fact it shows up in Google I'd assume not.


Maybe it's time for an llm.txt

Not that the people you want to respect that would


The tricky part is it's much more harder to prove that they didn't respect that.


When there is a value exchange between the two entities that are relatively similar then I think it is ethical. People trade Google making money on ads for their site being found when people search. It is also possible to opt-out.


They benefit mutually from their symbiosis. Financially, AI bro model #1321 doesn’t bring anyone value except their owners.


If done against the wishes of the owner of the site, yes, I would consider that unethical. Thankfully, Google respects robots.txt and noindex.


But it's it ethical for the site owner to block access to random people and companies in the internet to _my_ data? I posted that tweet with the expectation that it's gonna be publicly available. Now the owner of the site is breaking that expectation. I would say that this part is also unethical.

Especially since they're not moderating things or anything.


I would say that this part is also unethical.

Agreed. However, it's probably covered by their terms of service.

Same thing with the recent reddit kerfuffle. I'd have much preferred a Usenet 2.0 instead of centralizing global communications in the hands of a handful of private companies with associated user-hostile incentive structures.


Being indexed by google is optional. Twitter could stop it a any time if they thought it was a bad deal for them. That not comparable to a startup company trying to scrape the entire site to train their AI and using sophisticated techniques to bypass protections Twitter has put in place


Translation: it’s ethical when I do it.


You wouldn’t download a car, would you?


Except with modern software, some wannabe genius programmer will think they can get a bunch of money or cred or whatever by infantilizing the process down to something your grandma could use. Then, suddenly, everyone is scraping. The net effect is largely the same -- server operators see an overwhelming proportion of requests from bots. Still ethical?


> I’m not even sure it’s unethical.

If it doesn't respect robots.txt, it is unethical.


Is it ethical for the "public square" to have a robots.txt?

Musk is trying to have his cake and eat it...

(Clearly it's not a public square, but his position is incoherent).


Yes, it is ethical. In many countries it is legal for humans to walk around the public square and overhear all conversations.

It is NOT legal to install cameras that record everyone's conversations, much less sell the laundered results.

Pre-2023 people went on Twitter with the expectation that their output would be read by humans.

A traditional search engine is different: It redirects to the original. A bastardized search engine that shows snippets is more questionable, but still miles away from the AI steal.


Many countries have freedom of panorama, which means it is legal to video record the public square. I'm not aware if anywhere has specific laws on mounting the camera on a robot.


>Pre-2023 people went on Twitter with the expectation that their output would be read by huma ns

Expectations =/= reality. And the reality is that bits have been reading comments for over a decade.


a) It looks to be permitted according to Twitter's robots.txt

b) Given Twitter is public, user generated content which they don't own but simply have a license I wouldn't call it unethical in the slightest.


If the background of the issue is as Musk described, then it certainly is not allowed by twitter’s robots.txt, which allows a maximum of one request per second.

I do a lot of data scraping, so I’m sympathetic to the people who want to do it, but violating the robots.txt (or other published policies) is absolutely unethical, regardless of the license of the content the service is hosting. Another way of describing an unauthorised usecase taking a service offline is a denial of service attack, which (again, if Musk’s description of the problem is accurate) seems to be the issue Twitter was facing, with a choice between restricting services or scaling forever to meet the scrapers requirements.

Personally I would have probably tried to start with a captcha, but all this dogpiling just looks like low effort Musk hate. The prevailing sentiment on HN has become so passionately anti-Musk that it’s hard to view any criticism of him or Twitter here with any credibility.


"You wouldn't download a car."

The only reason these websites and platforms aggregate any content at all is because they're effectively giant public squares.


No means no ! :)


Moreover, isn't making scraping impossible illegal per a couple-of-years-old bill?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: