Scrapy: New Python web crawling & scraping framework (built on Twisted)

tdavis · on Dec 28, 2008

Awesome, now somebody go back in time 3 months and release this so I could have not spent that time writing the same thing (okay, not exactly the same, mine isn't nearly as pluggable).

lunchbox · on Dec 28, 2008

This looks quite promising and I look forward to trying it out. I wonder how it will work compared with my current approach of using Mechanize and BeautifulSoup, along with the threading module.

tdavis · on Dec 28, 2008

For long-running, large jobs I can tell you from experience it would work about a gazillion times faster. Especially if you drop BeautifulSoup for something like lxml.

breck · on Dec 29, 2008

Agreed. BeautifulSoup can be quite slow(unless I'm doing it wrong, which I probably am).

It accounts for 98% of the time of a current job I'm running. If anyone can provide some tips it'd be much appreciated.

tdavis · on Dec 29, 2008

It's rather unlikely you're doing anything wrong. IIRC, BeautifulSoup docs acknowledge that it is rather slow.

Also, if you don't make use of the methods that extract/unravel object trees, they may not be properly GC'd, leading to further slowdowns. I can't remember the method names exactly (might be destroy() and extract()), but they're in the docs.

iamelgringo · on Dec 28, 2008

I can't speak about Mechanize, but I did a project with BeautifulSoup a few months back.

For scraping specific elements in a page, the xpath/Firebug integration is a huge win. Being able to highlight an item and grab the xpath selector in Firebug saves so much time, it's not even funny.

sachinag · on Dec 28, 2008

If anyone is interested in doing a small project for us using Scrapy (and lxml and whatever else), please drop me an e-mail.

glazz · on Dec 28, 2008

Please, tell me why scrapy better than wget? I can easly call wget from my python scripts...

tdavis · on Dec 28, 2008

wget is synchronous while Twisted is an asynchronous networking engine. This means that you don't need to wait for a request to finish before making another one (or making pancakes, or doing whatever you want).

I essentially wrote a parallelized version of scrapy which has the ability to make hundreds of requests per second, depending on available CPUs. You could never achieve that level of performance using wget.

breck · on Dec 29, 2008

This is great. I was running threads on a current crawl job but the real bottleneck is BeautifulSoup and not the network. So splitting the project into threads(while it helped about 10%) wasn't really necessary and Twisted probably would have done the trick.

liuliu · on Jan 10, 2009

anyone knows how the memory leak happened? I use scrapy to fetch some data out, the total network in/out is about 400M, but the memory usage of scrapy gained about 1.5G.

msie · on Dec 29, 2008

Can it simulate a browser as well as HtmlUnit?

agentbleu · on Dec 28, 2008

I want to ask people here thoughts on frameworks, this looks well suited to a project I have, but it is built on Twisted and the preferred option of frameworks seems to be Django, now I'm a PHP coder, who is just about to step up to the python challenge so I am thinking it would be better to start with a more established framework? Thoughts would be most welcome.

iamelgringo · on Dec 28, 2008

Django is a framework for creating web applications. Twisted is a framework for network programming. Scrapy is a framework for scraping web pages.

If you're thinking about learning web development with Python, I'd suggest Django. Other Python web frameworks are TurboGears, Pylons, Web.py or Cherry.py. Django tends to have the best documentation and probably the largest community right now, however.

arockwell · on Dec 28, 2008

Twisted is not a web framework, so I don't think there would be any problem with using it in a Django project.

agentbleu · on Dec 28, 2008

Ah just what I needed and on the day I needed it! Thanks HN for posting it and the creators for making it.

agentbleu · on Dec 28, 2008

is anyone from scrapy here, i have some tech questions? is there an irc group?

lowkey · on Dec 29, 2008

I am looking for some community action for scrapy. It looks useful for a project I'm working on currently using BeautifulSoup but not digging the sluggish performance.

I am having trouble resolving the docs to the code. Is there an IRC, mailing list or forum?

agentbleu · on Dec 29, 2008

the irc group on freenode has some groupies there who help. I have it installed, dont bother with debian, use a clean install of ubuntu

here are some notes to get started from a clean install (replace your own vars)

install Ubuntu 810 apt-get update apt-get install subversion adduser --home /home/bleu bleu su bleu svn co http://svn.scrapy.org/scrapy mv scrapy-trunk scrapy ls scrapy branches tags trunk sudo root as root apt-get install python-twisted apt-get install nano su bleu source ~/.bashrc pico ~/.bashrc #add this to end of file: export PYTHONPATH=/home/bleu/scrapy/trunk

python >>> import scrapy quit() scrapy/trunk/scrapy/bin/scrapy-admin.py startproject myproject