The Internet Archive stopped honoring robots.txt back in 2017 because "Robots.tx... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

fencepost on Feb 15, 2019 | parent | context | favorite | on: Altavista: The rise and fall of the biggest pre-Go...

The Internet Archive stopped honoring robots.txt back in 2017 because "Robots.txt meant for search engines don’t work well for web archives"

Not sure without a bit more digging whether they honor explicit rules for their crawler.

Edit: (but a little looking at comments indicates that they don't, and notes that ia_archiver is Alexa, not the Internet Archive)

Thoreandan on Feb 15, 2019 [–]

Hmmm, and their robots.txt seems to not be at fault. I wonder if they excluded it to not overlap with Alta Vista, still it's tragic that digital.com isn't browsable in the wayback machine.

fencepost on Feb 17, 2019 | [–]

This may be a question of age. DEC was acquired by Compaq in 1998, then the IA was only a couple years old.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact