Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No proxy yet, but I am considering one as many sites are re-directing my crawler based on its IP, which is causing indexing issues.

The hardest part BY FAR is the crawler: initially I was using Apache Nutch but it got slower and slower as the index grew, so I replaced it with my own crawler that I wrote in PHP (comfortable for me) and made that multi-threaded using Supervisor.

The second hardest part was the amount of security I had to build in to prevent bots running spam searches and hogging my infra.

I'll try to write a blog soon and post it here.



Do you have multiple IPs? I am trying to build something which needs just the published at and updated at date fields for thousands of links and I am afraid my IP will get blocked quickly.


Just one IP for now. You are right to worry about being blocked from crawling however, it has happened to me already on a few sites. The key things to help mitigate against this are:

1. Always identify your crawler via a consistent user-agent string, that explains its a web search crawler and not a generic web browser.

2. Always obey the directives in robots.txt.

3. Make sure your crawler is not too aggressive (low frequency of requests).

(updated for formatting)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: