I'm puzzled why the author highlights "Lightweight: no dependencies" as a strength - in node.js land this is not the case in my opinion. I'm happy to hear people's view on this
Well, you are right about nodejs land & what it means being lightweight in that land. Frankly it looks like hell when you have to deal with not only packages also with their semantic versions. but being another nodejs crawler library owner (Arachnod: Web Crawler for Node.js https://www.npmjs.com/package/arachnod),
i can assure you that Developer has a point about being lightweight & no dependency of his package. When i had started to code a crawler with nodejs i had to deal with many problems (i believe amount of problems may be less for other common languages)
Also i haven't tried it for a long shot for example to make it work more than millions of webpages, but "memory leak free" is a really strong claim which has to be tested first.
I didn't look at what could be taken out - I was just reflecting on the "no dependencies" thing. But you might be right... maybe it's as bare bone as need be.
Also I might be coloured by some folks avoiding dependencies like the plague ;) I usually prefer small modules coupled with other small modules as the result usually means more tested code (when related to node.js that is).
He already uses libxml, so cheerio or jsdom would be alternatives. It's true that cheerio it self have 5 direct dependencies and I'm not sure if that is what he refers to when using the word "lightweight".
I can't speak on the performance difference between libxml and cheerio, but jsdom seems to be way slower than cheerio. So maybe that's what he means with lightweight.
If you need request delays, executing JS on the page, pagination or deep object schemas, you may also consider x-ray: https://github.com/lapwinglabs/x-ray
There are other ways to run live js on a page. It just has to do with how you load the pages. If it's just an http request to get the body it won't work, but using a headless browser will do the trick just fine, and without too much async headache.
I don't know about libxml, but normally you'd not need to do any I/O once you've gotten hold of the raw HTML - so there should be no need for callbacks. E.g. with cheerio you can parse HTML synchronously.
the ability to scrape/crawl pages with interactive elements without writing tons of threading code, or limiting your crawl rate to the number of threads you can handle.