I suppose it's a pretty good approach in order to help avoid upsetting website o...

ithkuil · on Jan 11, 2013

A possible pragmatic solution would be to track the site and spot ownership changes and freeze the robots.txt when it happens.

Reliable ownership change detection can be tricky though, but it's doable IMHO.

jackalope · on Jan 11, 2013

It's pretty shocking how many web designers, even experienced professionals, assume a site isn't being crawled because it hasn't "gone live" yet in their minds (i.e., no press release). If you have a site active on an IP without any access controls, you can almost be sure it is being indexed by someone. If it's not the default site, expect one of your users to leak the virtual host name. If it's SSL-protected, it might even be revealed in the certificate. I respect the work the Internet Archive is doing, but I'm also grateful that they will immediately retroactively apply robots.txt if you discover you foolishly exposed a site prematurely.