The other Elephant in the room - with rich Javascript interfaces, crawlers are o...

nostrademons · on Jan 13, 2012

That's not really true - John Resig hooked Rhino up to a good-enough DOM model in a weekend:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

No, you're not going to match Google's crawling infrastructure or data extraction libraries. But if you just want to grab pages off the web, parse them, and handle JavaScript from those pages correctly, you can easily rig something up between Mechanize/html5lib/V8 or Nutch/Tika/Rhino.

bad_user · on Jan 14, 2012

Crawling a couple of pages is different from crawling the entire web on a recurring basis. It was hard enough without the emergence of Javascript-enabled pages.