Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Certainly, countermeasures against crawler blocking will be a necessary component of effective search corpus aggregation in the go forward. Otherwise, search will balkanize around who will pay the most for access to public content. Common Crawl is ~10PB, this is not insurmountable.

Edit: I understand there is a freerider/economic issue here, unsure how to solve that as the balance between search engine/gen AI systems and content stores/providers becomes more adversarial.



AFAIK OpenAI currently respects robots.txt, so we'll have to see if they change that policy out of desperation at some point.


> AFAIK OpenAI currently respects robots.txt

I wonder to what degree -- for example, do they respect the Crawl-delay directive? For example, HN itself has a 30-second crawl-delay (https://news.ycombinator.com/robots.txt), meaning that crawlers are supposed to wait 30 seconds before requesting the next page. I doubt ChatGPT will delay a user's search of HN by up to 30 seconds, even though that's what robots.txt instructs them to do.


Would ChatGPT when live interacting with a user even have to respect robots.txt? I would think the robots.txt only applies to automatic crawling. When directed by a user, one could argue that ChatGPT is basically the user agent the user is using to view the web. If you wanted to write a browser extension that shows the reading time for all search results on google, would you respect robots.txt when prefetching all pages from the results? I probably wouldn’t, because that’s not really automated crawling to me.


They do respect robots.txt (supposedly), but they also introduced a new user agent that nobody would yet have in their robots.txt as part of this feature[1], and looking at my server logs it's already crawled a bunch of sites.

[1] https://platform.openai.com/docs/bots/overview-of-openai-cra...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: