Scraping isn’t illegal, and to be honest, I’m not even sure it’s unethical. I’m assuming you think it so — if so, why? I’m not disagreeing, but haven’t given it much thought.
Having been to twitter mostly through the most recent prominent war, man the signal to noise ratio is really low even when being careful about who to follow and who to block. There is so much disinformation, bad takes, uninformed opinions presented as facts, pure evil, etc.
So I guess it could be used for training very specific things or cataloging the underbelly of humanity but for general human knowledge it’s a frigging cesspool.
OK, not gonna argue with that. There is, I guess, a perception that it matters because policy-makers, and the wonks and hacks that influence them are hooked. The value for me (and ergo the public, some classic NGO thinking there for you) lies in understanding those dynamics.
I do not use the Twitters myself, and actively discourage others from doing so. Sends people bonkers.
I mean, we have found election manipulations like large-scale inauthentic activity of out-of-staters explicitly targeting African Americans, and projects here even to the extent of the perpetrators getting indicted. Other projects were tracking vaccine side-effect self-reports faster than the CDC and other disaster intelligence.
We were actually gearing up to switch to paid accounts as we found use cases that could subsidize these efforts... And then the starting price for reasonably small volumes shot up to like $500k/yr.
So, are we saying it's unethical for Google and other search engines who make money off of ad revenue to scrape sites like Twitter? Or are they paying a large sum to Twitter to do this?
When there is a value exchange between the two entities that are relatively similar then I think it is ethical. People trade Google making money on ads for their site being found when people search. It is also possible to opt-out.
But it's it ethical for the site owner to block access to random people and companies in the internet to _my_ data? I posted that tweet with the expectation that it's gonna be publicly available. Now the owner of the site is breaking that expectation. I would say that this part is also unethical.
Especially since they're not moderating things or anything.
Agreed. However, it's probably covered by their terms of service.
Same thing with the recent reddit kerfuffle. I'd have much preferred a Usenet 2.0 instead of centralizing global communications in the hands of a handful of private companies with associated user-hostile incentive structures.
Being indexed by google is optional. Twitter could stop it a any time if they thought it was a bad deal for them. That not comparable to a startup company trying to scrape the entire site to train their AI and using sophisticated techniques to bypass protections Twitter has put in place
Except with modern software, some wannabe genius programmer will think they can get a bunch of money or cred or whatever by infantilizing the process down to something your grandma could use. Then, suddenly, everyone is scraping. The net effect is largely the same -- server operators see an overwhelming proportion of requests from bots. Still ethical?
Yes, it is ethical. In many countries it is legal for humans to walk around the public square and overhear all conversations.
It is NOT legal to install cameras that record everyone's conversations, much less sell the laundered results.
Pre-2023 people went on Twitter with the expectation that their output would be read by humans.
A traditional search engine is different: It redirects to the original. A bastardized search engine that shows snippets is more questionable, but still miles away from the AI steal.
Many countries have freedom of panorama, which means it is legal to video record the public square. I'm not aware if anywhere has specific laws on mounting the camera on a robot.
If the background of the issue is as Musk described, then it certainly is not allowed by twitter’s robots.txt, which allows a maximum of one request per second.
I do a lot of data scraping, so I’m sympathetic to the people who want to do it, but violating the robots.txt (or other published policies) is absolutely unethical, regardless of the license of the content the service is hosting. Another way of describing an unauthorised usecase taking a service offline is a denial of service attack, which (again, if Musk’s description of the problem is accurate) seems to be the issue Twitter was facing, with a choice between restricting services or scaling forever to meet the scrapers requirements.
Personally I would have probably tried to start with a captcha, but all this dogpiling just looks like low effort Musk hate. The prevailing sentiment on HN has become so passionately anti-Musk that it’s hard to view any criticism of him or Twitter here with any credibility.