Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it is not so right. It can be useful to exclude some kinds of dynamic files, files with redundant pieces in URLs (e.g. a query string for files that do not require it; if you could reasonably do it (which in many cases you unfortunately can't), you might make it never crawl URLs with a query string), to set crawl delays, etc. (You might also want to e.g. use other ways of mirroring some files; e.g. for a version control repository, you do not need to crawl the web pages and you can just clone the repository instead. This way, you will only need to clone the new and changed files and not all of them.)

Robots.txt should not be used for preventing automated access in general or for disallowing mirrors to be made.

Someone else wrote "I actually don't care if someone ignores my robots.txt, as long as their crawler is well run." I mostly agree with this, although whoever wrote the crawler does not know everything (but neither does the server operator).

In writing the specification for the crawling policy file for Scorpion protocol, I had tried to make some things more clear and avoid some problems, although it is not perfect.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: