OP mentions using robots.txt to avoid crawling but even google ignores this now ...

lwhsiao · on Nov 29, 2023

OP here. I'm not sure about the details in your link, but basically my understanding lines up with [1]; robots.txt isn't guaranteed to be respected, but generally is.

FWIW, what I specifically have in robots.txt is

    User-agent: *
    Disallow: /

which seems to work well for me so far (i.e., I do not find my house documentation site on any search engine).

[1]: https://developers.google.com/search/docs/crawling-indexing/...

saalweachter · on Nov 29, 2023

If I understand the details of the link, it was a particular feature of robots.txt that was considered undocumented/unsupported that Google dropped support for.

I think the point of it was that you could tell Google to crawl some pages (for links) but not index them?