What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary...

unsnap_biceps · 2025-01-18T19:38:25 1737229105

I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.

That counts as barely imho.

I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.

joecool1029 · 2025-01-18T20:53:43 1737233623

Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

SR2Z · 2025-01-18T21:56:05 1737237365

IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.

prinny_ · 2025-01-18T23:15:57 1737242157

Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.

compootr · 2025-01-18T23:58:52 1737244732

OAI is using others' work to resell it in models. IA uses it to presrrve the history of the web

there is a case to be made about the value of the traffic you'll get from oai search though...

SR2Z · 2025-01-24T19:50:48 1737748248

It does depend a lot on how you feel about IA's integrity :P

amarcheschi · 2025-01-18T22:45:38 1737240338

I also don't think they hit servers repeatedly so much

AnonC · 2025-01-19T01:36:16 1737250576

As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.

dredmorbius · 2025-01-19T23:13:50 1737328430

The most recent notice IA have blogged was in 2017, and there's no indication that the service has reversed course on robots.txt since.

<https://blog.archive.org/?s=robots.txt>

noman-land · 2025-01-18T20:12:24 1737231144

This is highly annoying and rude. Is there a complete list of all known bots and crawlers?

jsheard · 2025-01-18T20:13:14 1737231194

https://darkvisitors.com/agents

https://github.com/ai-robots-txt/ai.robots.txt

LukeShu · 2025-01-18T20:07:15 1737230835

Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.

And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.

mariusor · 2025-01-19T08:43:32 1737276212

One would think they'd at least respect the cache-control directives. Those have been in the web standards since forever.

Animats · 2025-01-18T21:58:11 1737237491

Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.

    New reason preventing your pages from being indexed

    Search Console has identified that some pages on your site are not being indexed 
    due to the following new reason:

        Indexed, though blocked by robots.txt

    If this reason is not intentional, we recommend that you fix it in order to get
    affected pages indexed and appearing on Google.
    Open indexing report
    Message type: [WNC-20237597]

smarnach · 2025-01-19T14:47:50 1737298070

They are not complaining. You configured Google Search Console to notify you about problems that affect the search ranking of your site, and that's what they do. if you don't want to receive these messages, turn them off in Google Search Console.