Osmosis: Web scraper for Node.js

watson · on April 5, 2015

I'm puzzled why the author highlights "Lightweight: no dependencies" as a strength - in node.js land this is not the case in my opinion. I'm happy to hear people's view on this

risyasin · on April 5, 2015

Well, you are right about nodejs land & what it means being lightweight in that land. Frankly it looks like hell when you have to deal with not only packages also with their semantic versions. but being another nodejs crawler library owner (Arachnod: Web Crawler for Node.js https://www.npmjs.com/package/arachnod), i can assure you that Developer has a point about being lightweight & no dependency of his package. When i had started to code a crawler with nodejs i had to deal with many problems (i believe amount of problems may be less for other common languages)

Also i haven't tried it for a long shot for example to make it work more than millions of webpages, but "memory leak free" is a really strong claim which has to be tested first.

watson · on April 5, 2015

I'm interested to hear what you think is problematic when building a web crawler in regards to dependencies? Is it specific to the DOM parsing?

fapjacks · on April 5, 2015

I've toyed with web scraping in Node, and the answer is definitely parsing.

jsprogrammer · on April 5, 2015

The main source is 174 lines with lots of empty lines and a fair amount of what looks like debug code.

Not sure what you are wanting to be taken out?

Also, it does have dependencies (check package.json); you just took a quote out of its context.

watson · on April 5, 2015

I didn't look at what could be taken out - I was just reflecting on the "no dependencies" thing. But you might be right... maybe it's as bare bone as need be.

Also I might be coloured by some folks avoiding dependencies like the plague ;) I usually prefer small modules coupled with other small modules as the result usually means more tested code (when related to node.js that is).

shakeel_mohamed · on April 5, 2015

The readme says "Lightweight: no dependencies like jQuery, cheerio, or jsdom", I think the context is relevant for performance and/or size reasons.

watson · on April 5, 2015

He already uses libxml, so cheerio or jsdom would be alternatives. It's true that cheerio it self have 5 direct dependencies and I'm not sure if that is what he refers to when using the word "lightweight".

I can't speak on the performance difference between libxml and cheerio, but jsdom seems to be way slower than cheerio. So maybe that's what he means with lightweight.

matthewmueller · on April 6, 2015

If you need request delays, executing JS on the page, pagination or deep object schemas, you may also consider x-ray: https://github.com/lapwinglabs/x-ray

lintuxvi · on April 5, 2015

Can this run js live on the page?

arielm · on April 6, 2015

There are other ways to run live js on a page. It just has to do with how you load the pages. If it's just an http request to get the body it won't work, but using a headless browser will do the trick just fine, and without too much async headache.

rc0x03 · on April 6, 2015

Maybe in the future by implementing a basic DOM structure.

https://github.com/rc0x03/node-osmosis/issues/4

aikah · on April 6, 2015

nope, you need something like PhantomJS for that.

dijs · on April 6, 2015

I wrote something similar https://github.com/dijs/parsz

jlas · on April 5, 2015

Scraping in Node.js is just not worth it. IMHO the asynchronicity really gets in the way of building a scraper.

watson · on April 6, 2015

I don't know about libxml, but normally you'd not need to do any I/O once you've gotten hold of the raw HTML - so there should be no need for callbacks. E.g. with cheerio you can parse HTML synchronously.

_RPM · on April 5, 2015

"Fast: uses libxml C bindings" Where exactly are these located?

watson · on April 5, 2015

He depends on the module named libxmljs, which contain the C code: https://github.com/polotek/libxmljs

curiousjorge · on April 6, 2015

why you would write web scraper using asynchronous javascript beats me. what is the gain?

richmarr · on April 7, 2015

the ability to scrape/crawl pages with interactive elements without writing tons of threading code, or limiting your crawl rate to the number of threads you can handle.