Hacker News new | past | comments | ask | show | jobs | submit login
Osmosis: Web scraper for Node.js (github.com/rc0x03)
61 points by tombenner on April 5, 2015 | hide | past | favorite | 20 comments



I'm puzzled why the author highlights "Lightweight: no dependencies" as a strength - in node.js land this is not the case in my opinion. I'm happy to hear people's view on this


Well, you are right about nodejs land & what it means being lightweight in that land. Frankly it looks like hell when you have to deal with not only packages also with their semantic versions. but being another nodejs crawler library owner (Arachnod: Web Crawler for Node.js https://www.npmjs.com/package/arachnod), i can assure you that Developer has a point about being lightweight & no dependency of his package. When i had started to code a crawler with nodejs i had to deal with many problems (i believe amount of problems may be less for other common languages)

Also i haven't tried it for a long shot for example to make it work more than millions of webpages, but "memory leak free" is a really strong claim which has to be tested first.


I'm interested to hear what you think is problematic when building a web crawler in regards to dependencies? Is it specific to the DOM parsing?


I've toyed with web scraping in Node, and the answer is definitely parsing.


The main source is 174 lines with lots of empty lines and a fair amount of what looks like debug code.

Not sure what you are wanting to be taken out?

Also, it does have dependencies (check package.json); you just took a quote out of its context.


I didn't look at what could be taken out - I was just reflecting on the "no dependencies" thing. But you might be right... maybe it's as bare bone as need be.

Also I might be coloured by some folks avoiding dependencies like the plague ;) I usually prefer small modules coupled with other small modules as the result usually means more tested code (when related to node.js that is).


The readme says "Lightweight: no dependencies like jQuery, cheerio, or jsdom", I think the context is relevant for performance and/or size reasons.


He already uses libxml, so cheerio or jsdom would be alternatives. It's true that cheerio it self have 5 direct dependencies and I'm not sure if that is what he refers to when using the word "lightweight".

I can't speak on the performance difference between libxml and cheerio, but jsdom seems to be way slower than cheerio. So maybe that's what he means with lightweight.


If you need request delays, executing JS on the page, pagination or deep object schemas, you may also consider x-ray: https://github.com/lapwinglabs/x-ray


Can this run js live on the page?


There are other ways to run live js on a page. It just has to do with how you load the pages. If it's just an http request to get the body it won't work, but using a headless browser will do the trick just fine, and without too much async headache.


Maybe in the future by implementing a basic DOM structure.

https://github.com/rc0x03/node-osmosis/issues/4


nope, you need something like PhantomJS for that.


I wrote something similar https://github.com/dijs/parsz


Scraping in Node.js is just not worth it. IMHO the asynchronicity really gets in the way of building a scraper.


I don't know about libxml, but normally you'd not need to do any I/O once you've gotten hold of the raw HTML - so there should be no need for callbacks. E.g. with cheerio you can parse HTML synchronously.


"Fast: uses libxml C bindings" Where exactly are these located?


He depends on the module named libxmljs, which contain the C code: https://github.com/polotek/libxmljs


why you would write web scraper using asynchronous javascript beats me. what is the gain?


the ability to scrape/crawl pages with interactive elements without writing tons of threading code, or limiting your crawl rate to the number of threads you can handle.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: