Because I already have a powerful distributed architecture. I was curious about ...

binux · on Nov 17, 2014

the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

maratc · on Nov 17, 2014

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.

binux · on Nov 17, 2014

The fetcher fit you already...

maratc · on Nov 17, 2014

You are running

   phantomjs phantomjs_fetcher.js

and using it as proxy? The setup instructions are a bit unclear on this.

binux · on Nov 17, 2014

I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.