ROBOTS.TXT is a suicide note

datalist · on Jan 12, 2017

Nothing in their rambling even remotely supports their argument.

Yes, robots.txt is no magic bullet against ill-behaving crawlers (as proven by ArchiveTeam) but it never was supposed to be that.

You choose to ignore my specific wish not to be crawled by you? Fair enough, I'll return the favour and simply block your useragent

  ArchiveTeam ArchiveBot/[DATECODE] (wpull [VERSION]) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/[VERSION] Safari/537.36

In Apache for example

  # Remove the ^ anchor in case text gets prepended
  RewriteCond %{HTTP_USER_AGENT} ^ArchiveTeam
  RewriteRule .* - [F,L]

and if possible your IP ranges.

sgift · on Jan 12, 2017

That's okay. You and the three people on this planet who actually use a robots.txt because they don't want to be crawled are not the ones they care about. The 99% that just blindly copied some tutorial which said something bright like "only allow Googles bot" without thinking about it are.

krapp · on Jan 12, 2017

You're assuming ill-behaving crawlers wouldn't show you a completely mundane user-agent.

I know they often don't... and I wonder why they don't, but there's no reason they should.

zamber · on Jan 12, 2017

Resistance is futile. Eventually you'll get a malicious crawler ignoring robots.txt, changing the UA for each request and querying you from a distributed network of different data centers.

metabrew · on Jan 12, 2017

Shoutout to the Last.fm robots.txt, which contains this:

    Disallow: /harming/humans
    Disallow: /ignoring/human/orders
    Disallow: /harm/to/self

theandrewbailey · on Jan 12, 2017

link: http://www.last.fm/robots.txt

xja · on Jan 12, 2017

Currently returning:

Resource Limit Is Reached

The website is temporarily unable to service your request as it exceeded resource limit. Please try again later.

The irony! Cache link:

http://webcache.googleusercontent.com/search?ei=h0J3WMKKA4er...

louis-paul · on Jan 12, 2017

On the page: "While the onslaught of some social media hoo-hah will demolish some servers in the modern era, normal single or multi-thread use of a site will not cause trouble, unless there's a server misconfiguration"

cooper12 · on Jan 12, 2017

Absolutely agreed. I work on Wikipedia articles on a specific subject where we rely on a select few web resources. Many of them have long since closed down and we use the archived versions instead. It's downright tragic when we've lost complete websites, perfectly usable sources for hundreds of articles with quality content, all because some domain parker brought up the URL and added a robots.txt that retroactively disabled the existing archive for the site. We need to archive everything, regardless of what the site owner thinks crawlers shouldn't see. Years down the road we might actually wish we had archives of things that many found uninteresting or not meant to be archived. (for example to see how sitemaps were set up or RSS feeds)

dTal · on Jan 12, 2017

I think it's quite a clever way of dealing with the problem. Archive.org takes the long view - they don't delete data merely because someone changed a robots.txt, they merely hide it. So the data is only inaccessible as long as someone cares enough to specifically hide it. At the same time this covers their ass, copyright-wise.

Walf · on Jan 12, 2017

Isn't this a problem with the archiving logic? Why on Earth would one apply robots.txt rules retroactively?

detaro · on Jan 12, 2017

They need some way for people to easily hide archived content, since they have no clear right to publish their copy. You can send them an e-mail instead, but those take comparatively more resources to process and robots.txt at least sort-of indicates that you actually are in a position where you can demand the removal and can be processed automatically.

CM30 · on Jan 12, 2017

Wouldn't then the best logic be to finegrain this system a bit more?

So blocking the archive blocks new content added after robots.txt is made available, and adding a certain few lines in the robots.txt file explicitly hides older content on the domain in the archive.

That way, everyone wins. Sites that really want to remove content for some insane reason can do so, older sites usually aren't lost if the domain expires and domain holding page owners/cybersquatters don't accidentally cause older content to be hidden (since hey, they don't really want to block it, just stop their holding page from being archived).

shakna · on Jan 12, 2017

This sounds like a childish rant about why Archive Team don't want to follow robots.txt, which, incidentally, many many crawlers also don't follow.

I think the crux of the matter is found here:

> If you don't want people to have your data, don't put it online.

As much as I agree in principal with this, because of the way web requests work, I don't want to be associated with this group.

You cannot ignore copyright, and robots.txt is exactly what I would use if I didn't want something archived by an organisation I have nothing to do with.

voltagex_ · on Jan 12, 2017

>You cannot ignore copyright, and robots.txt is exactly what I would use if I didn't want something archived by an organisation I have nothing to do with.

AFAICT this page is a reaction to an archive.org policy of respecting robots.txt retroactively - e.g. oldwebsite.com runs from 1999-2009, domain expires in 2010, gets bought in 2011 and the new owners add a robots.txt disallowing IA. The archive.org copies for 10 years are now inaccessible.

shakna · on Jan 12, 2017

True, but it is archive.org protecting their own work from being shutdown for breaching copyright.

One group has respect for authorship, and one does not.

It may not be the most palatable solution, but hardly a need for a tantrum, and intent to ignore well established rights.

CM30 · on Jan 12, 2017

Wonder if you could argue that the Archive.org version is ignoring or misrepresenting copyright as well. After all, if they block content from a previous site on the same domain, maybe that could be read as them saying the current domain owner owns content from previous iterations. Or something. Which would interesting to see the legal reaction to.

Also makes me wonder if a solution could be implemented where domain owners can explicitly give permission for their work to be archived, with the assumption that all content from the current 'iteration' of the site remains accessible, even if the domain changes ownership. So I could "I give the Internet Archive full permission to archive DomainA.com", so the archive keeps said info accessible even if the domain is sold to someone else/expires.

pjc50 · on Jan 12, 2017

Should material be lost forever out of "respect for authorship"?

krapp · on Jan 12, 2017

Arguably, if I own the material, I should have the right to deny it to the historical record if I choose.

Then again, Kafka wanted his unpublished works burned after his death - and the world is arguably a better place for having ignored his wishes.

lcw · on Jan 12, 2017

Of course they would think it's dumb because some of robots.txt rules are counter to their objective which is to save the internet in all its glory. They shouldn't use robots.txt I agree. At the same time robots.txt is not worthless.

SEO is where robots.txt shines right now. It's not that people are trying to hide something it's because we don't want it to conflict with the content we actually want to promote.

Walf · on Jan 12, 2017

Bingo. Everything I remember reading about robots.txt strongly emphasised that it's not a way to hide content – in case one's reasoning skills are lacking – and that it's main use is to prevent irrelevant or infinite content showing up in any indices.

popobobo · on Jan 12, 2017

This is something I simply cannot agree with. People are using ROBOTS.TXT for all kinds of reasons, such as blocking unwanted careless webcrawler or indexing single page application. I mean come on. How can you say something like that just because eliminating ROBOTS.TXT would potentially benefit your business.

saurik · on Jan 12, 2017

> ...such as blocking unwanted careless webcrawler...

This file doesn't "block" anything, it simply asks the robot to do something, which implies that it is probably being abnormally careful: truly annoying, unwanted, careless robots might follow these guidelines, but that seems like a stretch. In reality, this file exists so that extra careful robots are able to get feedback from websites that have extremely narrow bandwidth availability or extremely high generation cost... concepts which this article makes a pretty compelling argument for "that doesn't make sense". In practice, this file then makes the owners of websites sometimes think "I can build something weirdly broken (such as a procedurally generated content tarpit, or mapping anonymous GET requests to database insertions) and just rely on this file to explain what I did along with enforcing rate limits and boundaries"... and then an "unwanted careless webcrawler" comes along and causes them serious issues. It is akin to having your entire webserver crash if someone sends you a non-ASCII character in a form field, but thinking "this will work out: I have a little flag in my HTML file that makes it clear I only accept ASCII". If you absolutely feel like you need to block something, then actually block it: any robot gracious enough to pay attention to this file is also going to send a useful user agent, and you can use that to return a legitimate 403.

greenspot · on Jan 12, 2017

Strange advice. Not sure if I understood what they mean.

One important use case to exclude sections of your website is to not pollute the sitemap which Google crawls or to be more precise--the daily crawl volume Google allocates to your site. If you let every page be crawled more important pages get crawled less. Example: In the past, you created a content category which didn't turn out successful. Before you remove this category with plenty of links which would result in crawl errors it would be smarter to exclude them in the ROBOTS file and focus on your core categories.

bluesign · on Jan 12, 2017

I am sorry but I can't agree with, if you don't want your content to be archived, don't put it online.

This is very similar to, taking photos/videos of people on street without their consent, and archiving and publishing. Even more, like taking photo of someone and publishing, who is wearing a t-shirt saying please don't take photo of me.

Sorry but if you will use my server resources, you will be bound with my rules.

jhbadger · on Jan 12, 2017

Are you familar with https://en.wikipedia.org/wiki/Streisand_effect ?

If anything, a robots.txt would encourage archiving because people would be annoyed with the asocial attitude.

bluesign · on Jan 12, 2017

I understand, but don't you think people should have option to opt-out from archive team archiving their stuff, or google indexing their site?

HeadlessChild · on Jan 12, 2017

I only use robots.txt for pages that already issues 403. Something like this:

  User-agent: *
  Disallow: /secret/

3825 · on Jan 12, 2017

The code 508 that i currently see on the page is interesting and worth preserving. I think it validates their stance.

Archived at

https://archive.fo/http://www.archiveteam.org/index.php?titl...

zamber · on Jan 12, 2017

Once upon a time I had to manually disable ROBOTS.txt parsing in one crawler just to stress test a staging machine for a friend.

The lesson from this is that ROBOTS.txt works as long as everyone follows a line set in sand.