Hacker News new | past | comments | ask | show | jobs | submit login
Protect Your Site with a Blackhole for Bad Bots (perishablepress.com)
75 points by ScotterC on Nov 1, 2010 | hide | past | favorite | 20 comments



I've worked at places that have tried this trick. It doesn't work - it's always been removed because of real users complaining they've lost access.

Several scenarios can trigger it, and probably more. The internet is a weird place. Consider:

1. Some clients, browser plugins, and proxy servers implement link prefetching. These agents will not care that the link is attached to a 1px gif that the user won't see. This is not really breaking the rules, either, it is quite permissible and in the scope of HTTP implementations - unless you've put your black hole behind a form POST (which bots won't for anyway.)

2. Internet explorer among other tools allows users to download content for offline viewing. The client does not respect robots.txt when such fetching has been initiated by a user.

3. Not all users browse the web visually, and your 1px gif is discriminating against the visually impaired. When browsed with a screen reader, a linked image is a linked image is a linked image.

Additionally, outright blacklisting by IP address as noted by others on this thread is highly problematic, especially when the behavior that triggers it could accidentally come from real users behind a NAT firewall (at a typical office, library, etc). A single user performing any of the above behaviors would block the entire group from the service.

There are better ways to fight misbehaving robots that do not so easily trigger false positives...


I would suggest burying the "bad link" several pages deep under the blackhole. (i.e. get a page, with a link, which leads to another page that has another link, etc, etc.)

The spider will crawl all the way down, but real users and prefetch won't. (I'm not sure about offline, but I can't imagine they go down very far.)


And in case the user does end up triggering it, perhaps also replace the "You have been banned" page with a captcha challenge to get yourself removed from the blacklist.


I've worked at two places that did this.

One place was run with an iron fist by the meanest physicist on the planet, a guy who eats nails for breakfast and always asks the toughest question at the seminar. His attitude was "delete 'em all and let got sort 'em out"

The other place was my own personal network of websites, where I was trying to keep spam-harvesting robots from getting my email address. I had a process that had several steps to protect the page that had my address on it, one of which was a 'tripwire' which would instantly block any stupid robot that followed one of the hidden link.

I can say that I've gone for years without getting any email complains about difficulties reading my email address ;-)


At least not by email?


You're missing the point - there is no 1px gif, or hidden link. The only place that you can find the link is in the robots.txt file, and it's set to Disallow:

http://perishablepress.com/robots.txt

So if you're a "bad bot", you'll go to /blackhole/ looking for juicy stuff, and get banned. If you're Google you won't follow it (or he has a whitelist) and if you're an end user you won't even see it.

Edit: Ah, my bad. He does put a link to it in his footer.


If you do this, create the robots.txt first, then wait a week or two!

Only then activate the actual blackhole.

The reason is that robots do not download robots.txt each time, they can, and do, cache it for quite a while, especially for sites that don't change much.


So this is just to ban non-targeted crawlers? Any particular reason you'd want to ban crawlers from your site? Surely your server is up to the task of serving a few extra requests, enough so that it's not worth your time adding code (and slowing down good requests) to restrict them.

The kind of bot that I care about are the ones that spam up my content site. They only go to pages that real users visit (the "Post Stuff" page), so this trick wouldn't help against them. And they never post from the same IP twice, preferring to hop between infected machines on a botnet every time they make a post.

I'm curious what sort of traffic pattern this author is seeing that would motivate him to build this.


Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access.

How many of these so-called "bad bots" already do this sort of spoofing? Would usage of these techniques only encourage such behavior?


A few bad ideas here:

1) Blocking by IP address. (AOL and Universities come to mind.)

2) nofollow links are followed by search engines and users alike. (display:none is ignored by some text-based browsers that ignore CSS.)


Also: whitelisting anyone with "googlebot" in their user-agent. That's like letting anyone in at the White House just if they turn up with an Obama mask on.


If your site becomes popular this could become a target for trolls. Eg. in some forums trolls will post a fake link to a logout page, this is why sites should use POST or private keys for logging out. If you implement this blackhole it will become a much more serious target.


This is pretty neat.

Something we did at $company[-2] is we had a block of IPs that weren't used for customer traffic. If something hit them (SSH login attempts, HTTP GET requests that were looking for RFI vulnerabilities, etc, etc) the IP would be firewalled from the entire network for a period of time (generally 2-3 days).


This is okay, but what would happen if someone writes a popular flash client that pulls data from a site using that blackhole.php?

The clients could access the data once then blocked out forever?


Against what, exactly, does this protect? And why?

"Bad bots" are the least of my worries, and if I were to protect against anything, I'd protect agains excessive requests per second.


There are some interesting ideas to work around the problems with this method in the comments, such as hashing the ip with a secret string on the link to stop others making you ban all your users. And also sorts of other stuff - even putting a CAPTCHA on the ban page as an escape method. But in the end I think the method is flawed:

1) a singled infected computer on a network could take out a large number of users. 2) any thing doing prefetching will cause them to be banned. 3) there is a risk of taking out valid bots, and verifying them correctly is just too expensive for a large site.

My main issue with bots is their spam, so I just use tools like Akismet to keep that under control.


robots.txt is malformed in this example:

  Disallow: /*/blackhole/*
This line won't work even for good robots (robots.txt doesn't have wildcard characters).


Wouldn't using a hidden link also subject you to a possible penalty from Google and other search engines?


Implementation is very weak. It reads whole blacklist, line by line (could have used sqlite at least), and uses extract() to emulate register_globals misfeature on hosts that disabled it (and it doesn't even check for disabled register_globals properly).


Would like to see how someone implements this type of blocking in a rails app.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: