Nothing in their rambling even remotely supports their argument.
Yes, robots.txt is no magic bullet against ill-behaving crawlers (as proven by ArchiveTeam) but it never was supposed to be that.
You choose to ignore my specific wish not to be crawled by you? Fair enough, I'll return the favour and simply block your useragent
ArchiveTeam ArchiveBot/[DATECODE] (wpull [VERSION]) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/[VERSION] Safari/537.36
In Apache for example
# Remove the ^ anchor in case text gets prepended
RewriteCond %{HTTP_USER_AGENT} ^ArchiveTeam
RewriteRule .* - [F,L]
That's okay. You and the three people on this planet who actually use a robots.txt because they don't want to be crawled are not the ones they care about. The 99% that just blindly copied some tutorial which said something bright like "only allow Googles bot" without thinking about it are.
Resistance is futile. Eventually you'll get a malicious crawler ignoring robots.txt, changing the UA for each request and querying you from a distributed network of different data centers.
On the page: "While the onslaught of some social media hoo-hah will demolish some servers in the modern era, normal single or multi-thread use of a site will not cause trouble, unless there's a server misconfiguration"
Absolutely agreed. I work on Wikipedia articles on a specific subject where we rely on a select few web resources. Many of them have long since closed down and we use the archived versions instead. It's downright tragic when we've lost complete websites, perfectly usable sources for hundreds of articles with quality content, all because some domain parker brought up the URL and added a robots.txt that retroactively disabled the existing archive for the site. We need to archive everything, regardless of what the site owner thinks crawlers shouldn't see. Years down the road we might actually wish we had archives of things that many found uninteresting or not meant to be archived. (for example to see how sitemaps were set up or RSS feeds)
I think it's quite a clever way of dealing with the problem. Archive.org takes the long view - they don't delete data merely because someone changed a robots.txt, they merely hide it. So the data is only inaccessible as long as someone cares enough to specifically hide it. At the same time this covers their ass, copyright-wise.
They need some way for people to easily hide archived content, since they have no clear right to publish their copy. You can send them an e-mail instead, but those take comparatively more resources to process and robots.txt at least sort-of indicates that you actually are in a position where you can demand the removal and can be processed automatically.
Wouldn't then the best logic be to finegrain this system a bit more?
So blocking the archive blocks new content added after robots.txt is made available, and adding a certain few lines in the robots.txt file explicitly hides older content on the domain in the archive.
That way, everyone wins. Sites that really want to remove content for some insane reason can do so, older sites usually aren't lost if the domain expires and domain holding page owners/cybersquatters don't accidentally cause older content to be hidden (since hey, they don't really want to block it, just stop their holding page from being archived).
This sounds like a childish rant about why Archive Team don't want to follow robots.txt, which, incidentally, many many crawlers also don't follow.
I think the crux of the matter is found here:
> If you don't want people to have your data, don't put it online.
As much as I agree in principal with this, because of the way web requests work, I don't want to be associated with this group.
You cannot ignore copyright, and robots.txt is exactly what I would use if I didn't want something archived by an organisation I have nothing to do with.
>You cannot ignore copyright, and robots.txt is exactly what I would use if I didn't want something archived by an organisation I have nothing to do with.
AFAICT this page is a reaction to an archive.org policy of respecting robots.txt retroactively - e.g. oldwebsite.com runs from 1999-2009, domain expires in 2010, gets bought in 2011 and the new owners add a robots.txt disallowing IA. The archive.org copies for 10 years are now inaccessible.
Wonder if you could argue that the Archive.org version is ignoring or misrepresenting copyright as well. After all, if they block content from a previous site on the same domain, maybe that could be read as them saying the current domain owner owns content from previous iterations. Or something. Which would interesting to see the legal reaction to.
Also makes me wonder if a solution could be implemented where domain owners can explicitly give permission for their work to be archived, with the assumption that all content from the current 'iteration' of the site remains accessible, even if the domain changes ownership. So I could "I give the Internet Archive full permission to archive DomainA.com", so the archive keeps said info accessible even if the domain is sold to someone else/expires.
Of course they would think it's dumb because some of robots.txt rules are counter to their objective which is to save the internet in all its glory. They shouldn't use robots.txt I agree. At the same time robots.txt is not worthless.
SEO is where robots.txt shines right now. It's not that people are trying to hide something it's because we don't want it to conflict with the content we actually want to promote.
Bingo. Everything I remember reading about robots.txt strongly emphasised that it's not a way to hide content – in case one's reasoning skills are lacking – and that it's main use is to prevent irrelevant or infinite content showing up in any indices.
This is something I simply cannot agree with. People are using ROBOTS.TXT for all kinds of reasons, such as blocking unwanted careless webcrawler or indexing single page application. I mean come on. How can you say something like that just because eliminating ROBOTS.TXT would potentially benefit your business.
> ...such as blocking unwanted careless webcrawler...
This file doesn't "block" anything, it simply asks the robot to do something, which implies that it is probably being abnormally careful: truly annoying, unwanted, careless robots might follow these guidelines, but that seems like a stretch. In reality, this file exists so that extra careful robots are able to get feedback from websites that have extremely narrow bandwidth availability or extremely high generation cost... concepts which this article makes a pretty compelling argument for "that doesn't make sense". In practice, this file then makes the owners of websites sometimes think "I can build something weirdly broken (such as a procedurally generated content tarpit, or mapping anonymous GET requests to database insertions) and just rely on this file to explain what I did along with enforcing rate limits and boundaries"... and then an "unwanted careless webcrawler" comes along and causes them serious issues. It is akin to having your entire webserver crash if someone sends you a non-ASCII character in a form field, but thinking "this will work out: I have a little flag in my HTML file that makes it clear I only accept ASCII". If you absolutely feel like you need to block something, then actually block it: any robot gracious enough to pay attention to this file is also going to send a useful user agent, and you can use that to return a legitimate 403.
Strange advice. Not sure if I understood what they mean.
One important use case to exclude sections of your website is to not pollute the sitemap which Google crawls or to be more precise--the daily crawl volume Google allocates to your site. If you let every page be crawled more important pages get crawled less. Example: In the past, you created a content category which didn't turn out successful. Before you remove this category with plenty of links which would result in crawl errors it would be smarter to exclude them in the ROBOTS file and focus on your core categories.
I am sorry but I can't agree with, if you don't want your content to be archived, don't put it online.
This is very similar to, taking photos/videos of people on street without their consent, and archiving and publishing. Even more, like taking photo of someone and publishing, who is wearing a t-shirt saying please don't take photo of me.
Sorry but if you will use my server resources, you will be bound with my rules.
Yes, robots.txt is no magic bullet against ill-behaving crawlers (as proven by ArchiveTeam) but it never was supposed to be that.
You choose to ignore my specific wish not to be crawled by you? Fair enough, I'll return the favour and simply block your useragent
In Apache for example and if possible your IP ranges.