Hacker News new | past | comments | ask | show | jobs | submit login

I suppose it's a pretty good approach in order to help avoid upsetting website owners and possible lawsuits. While I'm not too sure I agree with it, I can appreciate that they have provided a very easy way for websites to opt-out.



A possible pragmatic solution would be to track the site and spot ownership changes and freeze the robots.txt when it happens.

Reliable ownership change detection can be tricky though, but it's doable IMHO.


It's pretty shocking how many web designers, even experienced professionals, assume a site isn't being crawled because it hasn't "gone live" yet in their minds (i.e., no press release). If you have a site active on an IP without any access controls, you can almost be sure it is being indexed by someone. If it's not the default site, expect one of your users to leak the virtual host name. If it's SSL-protected, it might even be revealed in the certificate. I respect the work the Internet Archive is doing, but I'm also grateful that they will immediately retroactively apply robots.txt if you discover you foolishly exposed a site prematurely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: