Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is kind of misguided - it ignores the main reason sites use robots.txt, which is to exclude irrelevant/old/non-human-readable pages that nevertheless need to remain online from being indexed by search engines - but it's an interesting look at Archive Team's rationale.


Yes, and I'd add to that dynamically-generated URLs of infinite variability which have two separate but equally-important reasons for automated traffic to avoid:

1. You (bot) are wasting your bandwidth, CPU, storage on a literally unbounded set of pages

2. This may or may not cause resource problems for the owner of the site (e.g. Suppose they use Algolia to power search and you search for 10,000,000 different search terms... and Algolia charges them by volume of searches.)

The author of this angry rant really seems specifically ticked at some perceived 'bad actor' who is using robots.txt as an attempt to "block people from getting at stuff" but it's super misguided in that it ignores an entire purpose of robots.txt that is not even necessarily adversarial to the "robot."

This whole thing could have been a single sentence: "Robots.txt has a few competing vague interpretations and is voluntary; not all bots obey it, so if you're fully relying on it to prevent a site from being archived, that won't work."


Correct.

That has been one of the biggest uses -- improve SEO by preventing web crawlers from getting lost/confused in a maze of irrelevant content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: