Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

how are you tracking visitors and differentiating them with bots?


crudely. apache2 logs are parsed every 5 minutes. if the IP address exists already in post-processed database, ignore the entry; if they didn't exist in database, a script parses user agent strings and checks against a list of known "consumer" browsers; a whitelist. If they match, we assume they're human. we then delete the detailed apache2 logs and put just the IP address, when we first saw them (date, not datetime), and whether they were deemed human or bot into database. faking user agent strings or using something like playwright would confuse the script; but the browser list will also inherently not have all entries of existing "consumer browsers".

every day, a script checks all IP addresses in the post-processed database to see if there are "clusters" on the same subnet. I think it's if we see 3 visitors on the same subnet, we consider it a likely bot and retroactively switch those entries to being a bot in the database. Without taking in millions of visitors, I think this is reasonable, but it can introduce errors, too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: