> It depends on people knowing about it, and adding the complexity of checking it to their crawler.
I can't believe any bot writer doesn't know about robots.txt. They're just so self-obsessed and can't comprehend why the rules should apply to them, because obviously their project is special and it's just everyone else's bot that causes trouble.
(malicious) Bot writers have exactly zero concern for robots.txt. Most bots are malicious. Most bots don't set most of the TCP/IP flags. Their only concern is speed. I block about 99% of port scanning bots by simply dropping any TCP SYN packet that is missing MSS or uses a strange value. The most popular port scanning tool is masscan which does not set MSS and some of the malicious user-agents also set some odd MSS values if they even set it at all.
-A PREROUTING -i eth0 -p tcp -m tcp -d $INTERNET_IP --syn -m tcpmss ! --mss 1280:1460 -j DROP
Example rule from the netfilter raw table. This will not help against headless chrome.
The reason this is useful is that many bots first scan for port 443 then try to enumerate it. The bots that look up domain names to scan will still try and many of those come from new certs being created in LetsEncrypt. That is one of the reasons I use the DNS method, get a wildcard and sit on it for a while.
Another thing that helps is setting a default host in ones load balancer or web server that serves up a default simple static page served from a ram disk that say something like, "It Worked!" and disable logging for that default site. In HAProxy one should look up the option "strict-sni". Very old API clients can get blocked if they do not support SNI but along that line most bots are really old unsupported code that the botter could not update if their life depended on it.
You do realize vpns and older connectivity exists that needs values lower than 1280 right?
Of course. Nifty thing about open source means I can configure a system to allow or disallow anything. Each server operator can monitor their legit users traffic and find what they need to allow and dump the rest. Corporate VPN's will be using known values. "Free" VPN's can vary wildly but one need not support them if they choose not to. On some systems I only allow and MSS of 1460 and I also block TCP SYN packets with a TTL greater than 64 but that matches my user-base.
I know crawlies are for sure reading robots.txt because they keep getting themselves banned by my disallowed /honeytrap page which is only advertised there.
I can't believe any bot writer doesn't know about robots.txt. They're just so self-obsessed and can't comprehend why the rules should apply to them, because obviously their project is special and it's just everyone else's bot that causes trouble.