Awesome, now somebody go back in time 3 months and release this so I could have not spent that time writing the same thing (okay, not exactly the same, mine isn't nearly as pluggable).
This looks quite promising and I look forward to trying it out. I wonder how it will work compared with my current approach of using Mechanize and BeautifulSoup, along with the threading module.
For long-running, large jobs I can tell you from experience it would work about a gazillion times faster. Especially if you drop BeautifulSoup for something like lxml.
It's rather unlikely you're doing anything wrong. IIRC, BeautifulSoup docs acknowledge that it is rather slow.
Also, if you don't make use of the methods that extract/unravel object trees, they may not be properly GC'd, leading to further slowdowns. I can't remember the method names exactly (might be destroy() and extract()), but they're in the docs.
I can't speak about Mechanize, but I did a project with BeautifulSoup a few months back.
For scraping specific elements in a page, the xpath/Firebug integration is a huge win. Being able to highlight an item and grab the xpath selector in Firebug saves so much time, it's not even funny.
wget is synchronous while Twisted is an asynchronous networking engine. This means that you don't need to wait for a request to finish before making another one (or making pancakes, or doing whatever you want).
I essentially wrote a parallelized version of scrapy which has the ability to make hundreds of requests per second, depending on available CPUs. You could never achieve that level of performance using wget.
This is great. I was running threads on a current crawl job but the real bottleneck is BeautifulSoup and not the network. So splitting the project into threads(while it helped about 10%) wasn't really necessary and Twisted probably would have done the trick.
anyone knows how the memory leak happened? I use scrapy to fetch some data out, the total network in/out is about 400M, but the memory usage of scrapy gained about 1.5G.
I want to ask people here thoughts on frameworks, this looks well suited to a project I have, but it is built on Twisted and the preferred option of frameworks seems to be Django, now I'm a PHP coder, who is just about to step up to the python challenge so I am thinking it would be better to start with a more established framework? Thoughts would be most welcome.
Django is a framework for creating web applications. Twisted is a framework for network programming. Scrapy is a framework for scraping web pages.
If you're thinking about learning web development with Python, I'd suggest Django. Other Python web frameworks are TurboGears, Pylons, Web.py or Cherry.py. Django tends to have the best documentation and probably the largest community right now, however.
I am looking for some community action for scrapy. It looks useful for a project I'm working on currently using BeautifulSoup but not digging the sluggish performance.
I am having trouble resolving the docs to the code. Is there an IRC, mailing list or forum?