I really like the flow/UX. Congratulations! Nice job! What is the roadmap? I am ...

_bitliner · on Nov 17, 2014

Furthermore, what you mean with `Javascript pages supported`? Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?

binux · on Nov 17, 2014

http://demo.pyspider.org/debug/js_test_sciencedirect is a sample for this.

There is a phantomjs fetcher that can render the page as WebKit did. Furthermore, you can have some JavaScript running before/after page loaded to simulate a mouse click.

pknerd · on Nov 17, 2014

But will it not be slow? Assuming downloading css/images etc?

binux · on Nov 17, 2014

Images not downloaded default. Both the fetcher and the phantomjs proxy is totally async.

binux · on Nov 17, 2014

To make it more flexible and easy to reuse? I have implemented most features I need now.

_bitliner · on Nov 17, 2014

Because I already have a powerful distributed architecture. I was curious about the architecture of pyspider.

For example, how the queue is handled? Is it centralized? Is there a server managing it?

binux · on Nov 17, 2014

the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

maratc · on Nov 17, 2014

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.

binux · on Nov 17, 2014

The fetcher fit you already...

maratc · on Nov 17, 2014

You are running

   phantomjs phantomjs_fetcher.js

and using it as proxy? The setup instructions are a bit unclear on this.

binux · on Nov 17, 2014

I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.