Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A Python Spider System with Web UI (github.com/binux)
159 points by binux on Nov 17, 2014 | hide | past | favorite | 40 comments



This looks really nice. The API seems more user-friendly than scrapy's.


How does this compare to scrapy? Why would I use one over the other, or is either a fine choice?


I'm working on a benchmarking suite https://gist.github.com/binux/67b276c51e988f8e2c31 and meet some problem...

pyspider comes from a vertical search engine project. we have two issues:

- 100+ websites, they may change the template or down sometime. We need a dashboard to monitor the changes and the fails.

- update in 5 minutes, when the website updated, we need follow that in 5 minutes. We are using a update time from index(list) page to tell the changed pages. And pages should been updated after about 30 days in case of we missed something. A powerful scheduler is needed.

obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.


+1 Scrapy comparison please

Can you compare to scrapy as requested by other posters. Why could you not build on top of scrapy and leverage celery for scheduling etc (http://www.celeryproject.org/)

What is the immediate value add to using pyspider ?


Can someone explain what this is?


Nice project! I do wish it supported a PostgreSQL backend rather than (or as well as I guess) MySQL.


What is a "spider system"? Never heard that term before.


sorry :(


Thanks for making me feel bad about my python-based aggregation solution :)

https://github.com/AZdv/agricatch


I really like the flow/UX. Congratulations! Nice job!

What is the roadmap?

I am really inside scraping, it is one of my daily job. I could consider to integrate it in one of my architectures


Furthermore, what you mean with `Javascript pages supported`? Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?


http://demo.pyspider.org/debug/js_test_sciencedirect is a sample for this.

There is a phantomjs fetcher that can render the page as WebKit did. Furthermore, you can have some JavaScript running before/after page loaded to simulate a mouse click.


But will it not be slow? Assuming downloading css/images etc?


Images not downloaded default. Both the fetcher and the phantomjs proxy is totally async.


To make it more flexible and easy to reuse? I have implemented most features I need now.


Because I already have a powerful distributed architecture. I was curious about the architecture of pyspider.

For example, how the queue is handled? Is it centralized? Is there a server managing it?


the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.


Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.


The fetcher fit you already...


You are running

   phantomjs phantomjs_fetcher.js
and using it as proxy? The setup instructions are a bit unclear on this.


I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.


That is a nice tool....nice work!


Any plans for a gui based web scraper interface similar to portia?


Currently, yes and no.

pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.

But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.


I am not trying to offend you, but I really don't understand when someone says "yes and no". I hear it more and more these days. Is this becoming a cliche? It can be "yes" or "no", not both together. "yes and no" is "no" for me.


Don't know about other languages, but in german this phrase is pretty common when there is no clear yes or no answer. Like "yes to some extend but not completely"


How's the performance compared to scrapy?



Take a look at source code. The package hirarchy is not pythonic (use "libs" as top package is not a good idea).


Ah damned. the package hierarchy is not pythonic. That renders all the functionality of this package completely unusable.

Come on people, don't be like this. It takes 5 seconds to rephrase a comment like this into something friendlier.


Yes, the scheduler, fetcher, processor is stand alone here, they are running in different process. But they are sharing some common libs. I haven't made a decision how to put them into a single package, and running together.

Any advice or project that I can refer to?


agree


What is the recommended way? (Serious question, I have larger projects that I would someday like to refactor into proper packages)


There is also these guides that provide plenty of information on how packages work and best practices:

https://packaging.python.org/en/latest/distributing.html

https://github.com/pypa/sampleproject


I have "organize the code using a single top-level package".


Because the name "libs" is now installed into the global module namespace. It's better to use a less generic name.


Why isn't it a good idea? I have plenty of projects setup this way and it works well.

It looks pretty well organized to me.


The issue is that if you have two packages installed, and both use "libs" as their top-level package, they'll collide. Use "projectname.common" instead.


This is not true, you can specify package directories in setup.py.

See https://docs.python.org/2/distutils/setupscript.html#listing...


A package name != the actual name of the directory in the source tree. My point stands.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: