Hacker News new | past | comments | ask | show | jobs | submit login
Scrapy: New Python web crawling & scraping framework (built on Twisted) (scrapy.org)
92 points by lunchbox on Dec 28, 2008 | hide | past | favorite | 19 comments



Awesome, now somebody go back in time 3 months and release this so I could have not spent that time writing the same thing (okay, not exactly the same, mine isn't nearly as pluggable).


This looks quite promising and I look forward to trying it out. I wonder how it will work compared with my current approach of using Mechanize and BeautifulSoup, along with the threading module.


For long-running, large jobs I can tell you from experience it would work about a gazillion times faster. Especially if you drop BeautifulSoup for something like lxml.


Agreed. BeautifulSoup can be quite slow(unless I'm doing it wrong, which I probably am).

It accounts for 98% of the time of a current job I'm running. If anyone can provide some tips it'd be much appreciated.


It's rather unlikely you're doing anything wrong. IIRC, BeautifulSoup docs acknowledge that it is rather slow.

Also, if you don't make use of the methods that extract/unravel object trees, they may not be properly GC'd, leading to further slowdowns. I can't remember the method names exactly (might be destroy() and extract()), but they're in the docs.


I can't speak about Mechanize, but I did a project with BeautifulSoup a few months back.

For scraping specific elements in a page, the xpath/Firebug integration is a huge win. Being able to highlight an item and grab the xpath selector in Firebug saves so much time, it's not even funny.


If anyone is interested in doing a small project for us using Scrapy (and lxml and whatever else), please drop me an e-mail.


Please, tell me why scrapy better than wget? I can easly call wget from my python scripts...


wget is synchronous while Twisted is an asynchronous networking engine. This means that you don't need to wait for a request to finish before making another one (or making pancakes, or doing whatever you want).

I essentially wrote a parallelized version of scrapy which has the ability to make hundreds of requests per second, depending on available CPUs. You could never achieve that level of performance using wget.


This is great. I was running threads on a current crawl job but the real bottleneck is BeautifulSoup and not the network. So splitting the project into threads(while it helped about 10%) wasn't really necessary and Twisted probably would have done the trick.


anyone knows how the memory leak happened? I use scrapy to fetch some data out, the total network in/out is about 400M, but the memory usage of scrapy gained about 1.5G.


Can it simulate a browser as well as HtmlUnit?


I want to ask people here thoughts on frameworks, this looks well suited to a project I have, but it is built on Twisted and the preferred option of frameworks seems to be Django, now I'm a PHP coder, who is just about to step up to the python challenge so I am thinking it would be better to start with a more established framework? Thoughts would be most welcome.


Django is a framework for creating web applications. Twisted is a framework for network programming. Scrapy is a framework for scraping web pages.

If you're thinking about learning web development with Python, I'd suggest Django. Other Python web frameworks are TurboGears, Pylons, Web.py or Cherry.py. Django tends to have the best documentation and probably the largest community right now, however.


Twisted is not a web framework, so I don't think there would be any problem with using it in a Django project.


Ah just what I needed and on the day I needed it! Thanks HN for posting it and the creators for making it.


is anyone from scrapy here, i have some tech questions? is there an irc group?


I am looking for some community action for scrapy. It looks useful for a project I'm working on currently using BeautifulSoup but not digging the sluggish performance.

I am having trouble resolving the docs to the code. Is there an IRC, mailing list or forum?


the irc group on freenode has some groupies there who help. I have it installed, dont bother with debian, use a clean install of ubuntu

here are some notes to get started from a clean install (replace your own vars)

install Ubuntu 810 apt-get update apt-get install subversion adduser --home /home/bleu bleu su bleu svn co http://svn.scrapy.org/scrapy mv scrapy-trunk scrapy ls scrapy branches tags trunk sudo root as root apt-get install python-twisted apt-get install nano su bleu source ~/.bashrc pico ~/.bashrc #add this to end of file: export PYTHONPATH=/home/bleu/scrapy/trunk

python >>> import scrapy quit() scrapy/trunk/scrapy/bin/scrapy-admin.py startproject myproject




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: