Hacker News new | past | comments | ask | show | jobs | submit login

The other Elephant in the room - with rich Javascript interfaces, crawlers are only feasible to build if you have enough resources - i.e. only companies like Google, Yahoo and Microsoft are able to do it. Web pages are becoming increasingly less accessible and Google has a lot to gain by this trend.

The writing on the wall is pretty clear - Google has a monopoly in search and the only way they can be disrupted is through alternative means of finding content (like social networks).

Which is why, as much as I like them and their products, I find the integration with Google+ downright scary and dangerous for the health of our ecosystem.




That's not really true - John Resig hooked Rhino up to a good-enough DOM model in a weekend:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

No, you're not going to match Google's crawling infrastructure or data extraction libraries. But if you just want to grab pages off the web, parse them, and handle JavaScript from those pages correctly, you can easily rig something up between Mechanize/html5lib/V8 or Nutch/Tika/Rhino.


Crawling a couple of pages is different from crawling the entire web on a recurring basis. It was hard enough without the emergence of Javascript-enabled pages.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: