Crawl a website with scrapy and store extracted results with MongoDB

JackC · on April 23, 2012

For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination:

  import httplib2, lxml, pyquery
  h = httplib2.Http(".cache")
  def get(url):
      resp, content = h.request(  url, headers={'cache-control':'max-age=3600'})
      return pyquery.PyQuery( lxml.etree.HTML(content) )

This gives you a little function that fetches any URL as a jquery-like object:

  pq = get("http://foo.com/bar")
  checkboxes = pq('form input[type=checkbox]')
  nextpage = pq('a.next').attr('href')

And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.

Just something else to throw in the toolbelt ...

the_cat_kittles · on April 23, 2012

Have checked out kenneth reitz's requests? Its fantastic, you might like it

codehenge · on April 23, 2012

Link, for the interested:

https://github.com/kennethreitz/requests

the_cat_kittles · on April 23, 2012

thanks, i should have included a code sample too:

    import requests
    from lxml import etree

    jquery_like_page = etree.HTML(requests.get('url').text)

jat1 · on April 23, 2012

Also check this out for a pretty good discussion on scraping http://pyvideo.org/video/609/web-scraping-reliably-and-effic...

BaltoRouberol · on April 23, 2012

Yeah, I actually learnt scraping from Asheesh :) He's awesome.

jat1 · on April 23, 2012

I have been playing with scraping for quite some time now and have my own scripts and stuff, but I found that video informing and there were a few useful snippets I had missed.

Keep meaning to check out more of the Pycon vids

danneu · on April 24, 2012

Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: https://gist.github.com/2475824. Screenshot: http://i.imgur.com/cbv9A.png

[1]: http://anemone.rubyforge.org/doc/index.html

ananthrk · on April 23, 2012

Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "isbullshit_spiders.py"? :)

BaltoRouberol · on April 23, 2012

Oh, that's just a typo. My bad. Edit: there, corrected.

hack_edu · on April 23, 2012

I really want to read. Topic is right down my alley.

Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :(

martius · on April 23, 2012

Hi all, thank you for reporting !

This is a known issue and I'm working on it! I'll try to push a "responsive" version today or tomorrow.

noinput · on April 23, 2012

http://www.readability.com/mobile/articles/yfhqwo0t cleans it right up.

joshu · on April 24, 2012

Completely unreadable on iPhone too.

mumphster · on April 23, 2012

Same using mobile safari

jordanmessina · on April 23, 2012

Same using chrome on windows unless I resize the browser and make the width about 1000px

martius · on April 23, 2012

Yes, it's true for any viewport with a width of less than 940px. I'll do my best to fix this too.