No proxy yet, but I am considering one as many sites are re-directing my crawler based on its IP, which is causing indexing issues.
The hardest part BY FAR is the crawler: initially I was using Apache Nutch but it got slower and slower as the index grew, so I replaced it with my own crawler that I wrote in PHP (comfortable for me) and made that multi-threaded using Supervisor.
The second hardest part was the amount of security I had to build in to prevent bots running spam searches and hogging my infra.
Do you have multiple IPs? I am trying to build something which needs just the published at and updated at date fields for thousands of links and I am afraid my IP will get blocked quickly.
Just one IP for now. You are right to worry about being blocked from crawling however, it has happened to me already on a few sites. The key things to help mitigate against this are:
1. Always identify your crawler via a consistent user-agent string, that explains its a web search crawler and not a generic web browser.
2. Always obey the directives in robots.txt.
3. Make sure your crawler is not too aggressive (low frequency of requests).
The hardest part BY FAR is the crawler: initially I was using Apache Nutch but it got slower and slower as the index grew, so I replaced it with my own crawler that I wrote in PHP (comfortable for me) and made that multi-threaded using Supervisor.
The second hardest part was the amount of security I had to build in to prevent bots running spam searches and hogging my infra.
I'll try to write a blog soon and post it here.