Certainly, countermeasures against crawler blocking will be a necessary component of effective search corpus aggregation in the go forward. Otherwise, search will balkanize around who will pay the most for access to public content. Common Crawl is ~10PB, this is not insurmountable.
Edit: I understand there is a freerider/economic issue here, unsure how to solve that as the balance between search engine/gen AI systems and content stores/providers becomes more adversarial.
I wonder to what degree -- for example, do they respect the Crawl-delay directive? For example, HN itself has a 30-second crawl-delay (https://news.ycombinator.com/robots.txt), meaning that crawlers are supposed to wait 30 seconds before requesting the next page. I doubt ChatGPT will delay a user's search of HN by up to 30 seconds, even though that's what robots.txt instructs them to do.
Would ChatGPT when live interacting with a user even have to respect robots.txt? I would think the robots.txt only applies to automatic crawling. When directed by a user, one could argue that ChatGPT is basically the user agent the user is using to view the web. If you wanted to write a browser extension that shows the reading time for all search results on google, would you respect robots.txt when prefetching all pages from the results? I probably wouldn’t, because that’s not really automated crawling to me.
They do respect robots.txt (supposedly), but they also introduced a new user agent that nobody would yet have in their robots.txt as part of this feature[1], and looking at my server logs it's already crawled a bunch of sites.
Edit: I understand there is a freerider/economic issue here, unsure how to solve that as the balance between search engine/gen AI systems and content stores/providers becomes more adversarial.