Certainly, countermeasures against crawler blocking will be a necessary componen...

jsheard · on Oct 31, 2024

AFAIK OpenAI currently respects robots.txt, so we'll have to see if they change that policy out of desperation at some point.

andrethegiant · on Oct 31, 2024

> AFAIK OpenAI currently respects robots.txt

I wonder to what degree -- for example, do they respect the Crawl-delay directive? For example, HN itself has a 30-second crawl-delay (https://news.ycombinator.com/robots.txt), meaning that crawlers are supposed to wait 30 seconds before requesting the next page. I doubt ChatGPT will delay a user's search of HN by up to 30 seconds, even though that's what robots.txt instructs them to do.

echoangle · on Nov 1, 2024

Would ChatGPT when live interacting with a user even have to respect robots.txt? I would think the robots.txt only applies to automatic crawling. When directed by a user, one could argue that ChatGPT is basically the user agent the user is using to view the web. If you wanted to write a browser extension that shows the reading time for all search results on google, would you respect robots.txt when prefetching all pages from the results? I probably wouldn’t, because that’s not really automated crawling to me.

claudiulodro · on Nov 1, 2024

They do respect robots.txt (supposedly), but they also introduced a new user agent that nobody would yet have in their robots.txt as part of this feature[1], and looking at my server logs it's already crawled a bunch of sites.

[1] https://platform.openai.com/docs/bots/overview-of-openai-cra...