Show HN: A Python Spider System with Web UI

meowface · on Nov 17, 2014

This looks really nice. The API seems more user-friendly than scrapy's.

adam-_- · on Nov 17, 2014

How does this compare to scrapy? Why would I use one over the other, or is either a fine choice?

binux · on Nov 18, 2014

I'm working on a benchmarking suite https://gist.github.com/binux/67b276c51e988f8e2c31 and meet some problem...

pyspider comes from a vertical search engine project. we have two issues:

- 100+ websites, they may change the template or down sometime. We need a dashboard to monitor the changes and the fails.

- update in 5 minutes, when the website updated, we need follow that in 5 minutes. We are using a update time from index(list) page to tell the changed pages. And pages should been updated after about 30 days in case of we missed something. A powerful scheduler is needed.

obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.

skillachie · on Nov 18, 2014

+1 Scrapy comparison please

Can you compare to scrapy as requested by other posters. Why could you not build on top of scrapy and leverage celery for scheduling etc (http://www.celeryproject.org/)

What is the immediate value add to using pyspider ?

OedipusRex · on Nov 17, 2014

Can someone explain what this is?

smcleod · on Nov 17, 2014

Nice project! I do wish it supported a PostgreSQL backend rather than (or as well as I guess) MySQL.

erikb · on Nov 17, 2014

What is a "spider system"? Never heard that term before.

binux · on Nov 17, 2014

sorry :(

kidsil · on Nov 17, 2014

Thanks for making me feel bad about my python-based aggregation solution :)

https://github.com/AZdv/agricatch

_bitliner · on Nov 17, 2014

I really like the flow/UX. Congratulations! Nice job!

What is the roadmap?

I am really inside scraping, it is one of my daily job. I could consider to integrate it in one of my architectures

_bitliner · on Nov 17, 2014

Furthermore, what you mean with `Javascript pages supported`? Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?

binux · on Nov 17, 2014

http://demo.pyspider.org/debug/js_test_sciencedirect is a sample for this.

There is a phantomjs fetcher that can render the page as WebKit did. Furthermore, you can have some JavaScript running before/after page loaded to simulate a mouse click.

pknerd · on Nov 17, 2014

But will it not be slow? Assuming downloading css/images etc?

binux · on Nov 17, 2014

Images not downloaded default. Both the fetcher and the phantomjs proxy is totally async.

binux · on Nov 17, 2014

To make it more flexible and easy to reuse? I have implemented most features I need now.

_bitliner · on Nov 17, 2014

Because I already have a powerful distributed architecture. I was curious about the architecture of pyspider.

For example, how the queue is handled? Is it centralized? Is there a server managing it?

binux · on Nov 17, 2014

the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

maratc · on Nov 17, 2014

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.

binux · on Nov 17, 2014

The fetcher fit you already...

maratc · on Nov 17, 2014

You are running

   phantomjs phantomjs_fetcher.js

and using it as proxy? The setup instructions are a bit unclear on this.

binux · on Nov 17, 2014

I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.

bowlofstew · on Nov 17, 2014

That is a nice tool....nice work!

Immortalin · on Nov 17, 2014

Any plans for a gui based web scraper interface similar to portia?

binux · on Nov 17, 2014

Currently, yes and no.

pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.

But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.

prht · on Nov 17, 2014

I am not trying to offend you, but I really don't understand when someone says "yes and no". I hear it more and more these days. Is this becoming a cliche? It can be "yes" or "no", not both together. "yes and no" is "no" for me.

smoe · on Nov 17, 2014

Don't know about other languages, but in german this phrase is pretty common when there is no clear yes or no answer. Like "yes to some extend but not completely"

bjblazkowicz · on Nov 17, 2014

How's the performance compared to scrapy?

binux · on Nov 18, 2014

https://gist.github.com/binux/67b276c51e988f8e2c31

zbb · on Nov 17, 2014

Take a look at source code. The package hirarchy is not pythonic (use "libs" as top package is not a good idea).

huskyr · on Nov 17, 2014

Ah damned. the package hierarchy is not pythonic. That renders all the functionality of this package completely unusable.

Come on people, don't be like this. It takes 5 seconds to rephrase a comment like this into something friendlier.

binux · on Nov 17, 2014

Yes, the scheduler, fetcher, processor is stand alone here, they are running in different process. But they are sharing some common libs. I haven't made a decision how to put them into a single package, and running together.

Any advice or project that I can refer to?

binux · on Nov 17, 2014

agree

redacted · on Nov 17, 2014

What is the recommended way? (Serious question, I have larger projects that I would someday like to refactor into proper packages)

iamtew · on Nov 17, 2014

There is also these guides that provide plenty of information on how packages work and best practices:

https://packaging.python.org/en/latest/distributing.html

https://github.com/pypa/sampleproject

binux · on Nov 17, 2014

I have "organize the code using a single top-level package".

ngoldbaum · on Nov 17, 2014

Because the name "libs" is now installed into the global module namespace. It's better to use a less generic name.

paulhauggis · on Nov 17, 2014

Why isn't it a good idea? I have plenty of projects setup this way and it works well.

It looks pretty well organized to me.

vertex-four · on Nov 17, 2014

The issue is that if you have two packages installed, and both use "libs" as their top-level package, they'll collide. Use "projectname.common" instead.

fmueller · on Nov 17, 2014

This is not true, you can specify package directories in setup.py.

See https://docs.python.org/2/distutils/setupscript.html#listing...

vertex-four · on Nov 17, 2014

A package name != the actual name of the directory in the source tree. My point stands.