> - The Archive's policy is to honor robots.txt and other no-archive directives.
What bothers me is that they do so retroactively based on the current robots.txt, not the one contemporary with the archived content. So if a domain parker takes over a domain, and their robots.txt excludes everyone from every page (or everyone but Google), then archive.org no longer provides its archive of the old content.
I'm not sure there's a good way to work around this - the alternative is to only respect the robots.txt as it was when the snapshot was taken, at which point once a confidential page is in the archive you can't (easily) remove it again.
Perhaps access to archived pages should only be blocked when the ia_archiver user agent specifically is denied in the robots.txt. That way they aren't inadvertently blocked by a generic robots.txt that denies everything (which sometimes occurs with parked domains), but there's still a way to deny the Wayback Machine if you really need to.
Well, then everybody would claim the same right and you'd have to maintain full list of bots and keep it up to date. This doesn't sound scalable, * should still mean "everybody".
robots.txt only tells bots not to crawl the website. It doesn't say anything about indexing or archival of pages.
It would be perfectly possible to have the Wayback Machine respect the robots.txt and not have it crawl or archive any new pages, whilst making pages that have already been archived accessible unless a specific user agent has been denied.
I would say that the default behavior should be to respect the robots.txt of the time of the snapshot and only revert archival of accidentally cached pages that were never intended to be public.
Sadly, it requires a bit of human intervention there.
The internet archive wouldn't work if its crawler wasn't fully automated. You can't handle the whole internet in a way that requires human intervention.
The problem for manual deletion via human intervention is that if you do it outside of robots.txt you then need to ensure the identity of the owner, which makes it much more complicated and costly.
Maybe robots.txt could have clauses that say whether to apply entries retroactively. True, domain parkers could enable it, but I don't think most would, since it's extra work for no benefit - the point is usually not to erase history but to protect current site.
Since the number of pages referenced from Wikipedia is limited, the archiving could be also done by other parties. Maybe just setup a raspberry, xTB hard disk, download everything and then distribute over IPFS (or something else) (as backup in case Archive.org removes the pages).
To take into account Javascript etc, one could also capture a png snapshot of the page.
Yes, it's by design. It's to stave off the inevitable flood of legal threats and takedown demands.
If someone requests content be taken off, they instruct them to update their robots.txt. The content is not removed but will not be shown through archive.org as long as the robots.txt exclusion is in place.
There was a court case where the plaintiff wanted to subpoena the Internet Archive for evidence (since the defendant had since blocked the content with robots.txt). They sent an expert to testify that complying with that kind of thing would be too much of a burden for them, and suggested that the court force the defendant to change their robots.txt. The court agreed.
By design from the last time I saw it discussed. The idea is that a change there could indicate that there was a mistake and the data shouldn't have been crawled for one reason or another. There's just no way to know in an automated fashion.
I'm inclined to say that robots.txt should be ignored altogether and removal requests should be handled per individual case. But I guess that goes too far...
Still, if the domain owner changes, they should not be able to remove content from old archives. That's like being able to remove stuff from encyclopedias about a palace somewhere, just because you live in the place where the palace once stood.
They are The Internet Archive after all, it's logical to archive contents like the domain owner and robots.txt for a given point in time. A change of owner can be easily detected.
> removal requests should be handled per individual case
Are you going to fund the internet archive to handle that workload?
> if the domain owner changes, they should not be able to remove content from old archives.
Why not? If that work belongs to anyone, it is the current domain owner. Why does the fact that the internet archive happened to crawl it mean that suddenly they lose control of their information?
This is elevating the Internet Archive from 'hey it's cool someone made a copy of that while it was up and no one cared' 'because the Internet Archive crawled it the world has absolute rights to that information from now on, wishes of the owner be damned.'
> If that work belongs to anyone, it is the current domain owner.
Not at all. A domain parker taking over a domain does not imply they have any rights over all the content that the previous owner of the domain posted.
> Are you going to fund the internet archive to handle that workload?
I figured someone would ask that. I don't have an immediate answer but it's a good question (upvote for that). My hopes are just that removal requests are not too frequent. But without current numbers (of number of pages hidden after-the-fact and current removal requests) this is guesswork.
They could charge for it perhaps? A dollar per request. Doesn't seem too unreasonable for something you mistakenly made available to the planet. It doesn't have to be per page, so if you made a million documents available all under example.com/hidden/ then hiding that folder is a simple action and costs just one dollar. You're paying them for their time.
In the Netherlands, if you want your personal information (e.g. phone number; email address) removed from a company's systems, you can request that and they must grant it if they have no reason to keep the data any longer. And you can make requests to see your data, etc. But the law allows for companies to charge for this and I've seen example amounts (I think around 3 euros) somewhere. It's a somewhat similar situation.
So I don't have a single good answer, but I think by-case is a better way to go (and worth thinking about, at least) than just using the current approach.
What bothers me is that they do so retroactively based on the current robots.txt, not the one contemporary with the archived content. So if a domain parker takes over a domain, and their robots.txt excludes everyone from every page (or everyone but Google), then archive.org no longer provides its archive of the old content.