Hacker News new | past | comments | ask | show | jobs | submit login

We do, and we also use our own user-agent string: "SiteTruth.com site rating system". A growing number of sites reject connections based on USER-AGENT string. Try "redfin.com", for example. (We list those as "blocked"). Some sites won't let us read the "robots.txt" file. In some cases, the site's USER-AGENT test forbids things the "robots.txt" allows.

Another issue is finding the site's preferred home page. We look at "example.com" and "www.example.com", both with HTTP and HTTPS, trying to find the entry point. This just looks for redirects; it doesn't even read the content. Some sites have redirects from one of those four options to another one. In some cases, the less favored entry point has a "disallow all" robots.txt file. In some cases, the robots.txt file itself is redirected. This is like having doors with various combinations of "Keep Out" and "Please use other door" signs. In that phase, we ignore "robots.txt" but don't read any content beyond the HTTP header.

Some sites treat the four reads to find the home page as a denial of service attack and refuse connections for about a minute.

Then there's Wix. Wix sometimes serves a completely different page if it thinks you're a bot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: