Hacker News new | past | comments | ask | show | jobs | submit login

The Internet Archive stopped honoring robots.txt back in 2017 because "Robots.txt meant for search engines don’t work well for web archives"

Not sure without a bit more digging whether they honor explicit rules for their crawler.

Edit: (but a little looking at comments indicates that they don't, and notes that ia_archiver is Alexa, not the Internet Archive)




Hmmm, and their robots.txt seems to not be at fault. I wonder if they excluded it to not overlap with Alta Vista, still it's tragic that digital.com isn't browsable in the wayback machine.


This may be a question of age. DEC was acquired by Compaq in 1998, then the IA was only a couple years old.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: