Hacker News new | past | comments | ask | show | jobs | submit login

This might be controversial but everything is fair game everywhere. If you can crawl it, tough luck. It's there and everyone can get to it anyways, why not a crawler?



Because the rules a well-functioning society runs by are more nuanced than "Is it technically possible to do this?"

If you'd like a specific example of why people might seek this courtesy, someone might have a page or group of pages on their site that works fine when used by the humans who would normally use it, but which would keel over if bots started crawling it, because bot usage patterns don't look like normal human patterns.


A society is composed of humans. But there are (very stupid) AIs loose on the Internet that aren't going to respect human etiquette.

By analogy: humans drive cars and cars can respond to human problems at human time-scales, and so humans (e.g. pedestrians) expect cars to react to them the way humans would. But there are other things on, and crossing, the road, besides cars. Everyone knows that a train won't stop for you. It's your job to get out of the way of the train, because the train is a dumb machine with a lot of momentum behind it, no matter whether its operator pulls the emergency brake or not.

There are dumb machines on the Internet with a lot of momentum behind them, but, unlike trains, they don't follow known paths. They just go wherever. There's no way to predict where they'll go; no rule to follow to avoid them. So, essentially, you have to build websites so that they can survive being hit by a train at any time. And, for some websites, you have to build them to survive being hit by trains once per day or more.

Sure, on a political level, it's the fault of whoever built these machines to be so stupid, and you can and should go after them. But on a technical, operational level—they're there. You can't pre-emptively catch every one of them. The Internet is not a civilized place where "a bolt from the blue" is a freak accident no one could have predicted, and everyone will forgive your web service if it has to go to the hospital from one; instead, the Internet is a (cyber-)war-zone where stray bullets are just flying constantly through the air in every direction. Customers of a web service are about the same as shareholders in a private security contractor—they'd just think you irresponsible if you deployed to this war-zone without properly equipping yourself with layers and layers of armor.


Honestly that is the site owners problem. If it can be found by a person it's fair. I genuinely respect the concept of courtesy but I don't expect it. People can seek courtesy but they should have expectations of whether or not it will happen.


So in your view is DoS attack not actually an attack and site owners should just have to handle the traffic?


Techies forget the rule of laws. A dos has intent. A bot crawling a poorly designed website accidentally causing the site owners problems does not have malicious intent. They can choose to block the offender just like a restaurant can refuse service. But intent still matters.


This thread is about what behavior we should design crawlers to have. One person said crawlers should disregard noindex directives on government sites, and you replied that they should ignore all robots.txt directives and just crawl whatever they can. If you intentionally ignore robots.txt, that has intent, by definition.


Not intentionally ignore it by going out of their way to override it, just not be required to implement a feature to their crawler. Apparently parsing those sounds tricky with edge cases. Ignoring that file is absolutely on the table. People of course can adhere to but it's not required and in my opinion shouldn't even be paid attention to.

In my younger years the only time I ever dealt with robots.txt was to find stuff I wasn't supposed to crawl.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: