Show HN: Automatic web page content extractor for Node.js and the command line

JangoSteve · on July 5, 2014

Reminds me of an article I wrote a while back on the "meat algorithm" I had developed and used as part of LeadNuke to create more printable versions of pages. What had surprised me at the time was how simple it was to get just the main content of a page for like 95% of the pages I tried.

http://www.alfajango.com/blog/create-a-printable-format-for-...

michaelmcmillan · on July 5, 2014

I fear you're exaggerating the success-rate of your algorithm. Have you tried to test it, if so: How did you measure a successful extract?

The reason for my skepticism is Arc90's readability extension [1]. At the surface it looks more complex. I could of course be wrong!

[1] https://github.com/Kerrick/readability-js/blob/develop/reada...

JangoSteve · on July 5, 2014

A successful extract was measured as a page for which you could read with the title and body intact, and by which my application could call `content.text` on the result and get the plain text of a page without the header, footer, navigation, etc.

The complexity of the readability plugin seems to be due to the fact that it actually does a lot more than just making something readable. For example, the point of my algorithm was to strip all style information from a page and show only the content, leaving it to be styled according to the global stylesheet. Notice that my script not only removes linked stylesheets and style tags, but also style attributes of all elements. The readability plugin actually does things like counting reference links and styling them a certain way [1].

It has 53 lines dedicated to both getting and normalizing the article title, when the vast majority of the time, it's just the first h1 or HTML title attribute (which could be a one-liner and is also outside the scope of the "meat algorithm" since it's just trying to get the body) [2].

It has 22 lines dedicated to injecting a custom readability footer into the page [3].

It has 74 lines dedicated to converting all inline links to footnotes [4].

It has 55 lines dedicated to injecting typekit fonts [5]. And on and on.

It also does things like figuring out when to float an image in the article and when to make it full-width [6] (as opposed to just leaving the image inline with no styling, as my script does).

And it even dedicates 333 lines of code to finding pagination links to build content from multi-page articles [7], which my script simply doesn't do, since it only cares about the content of the current page.

It also does things like computing a content-weight score for parts of the page, I'm guessing to determine a relative heuristic for which parts are most likely to be the main content [8]. This is actually the path I had started to go down, before I realized that my much simpler version solved 95% of the use-cases I had, and that, for my purposes, I didn't really care if it failed 5% of the time.

I think the discrepancy in complexity can be explained really easily:

a) The readability plugin does a lot of things not directly related to simply grabbing the content of the page.

b) There's a lot of complexity involved in trying to get it to work for that last 5% (those are the really weirdly-structured sites, for which you'd need to develop some sort of heuristic and/or learning algorithm).

In other words, no, I didn't exaggerate the success-rate. The readability plugin is just very functionally different. The algorithm in my article is also not the complete script; the algorithm was meant to be a base starting point to build from (certainly you would need to if you need greater than 95% success rate, which most applications would).

EDIT: Also should point out, my article was also meant to document my surprise in how easy it was to get a 95% solution. Most of the complexity I've seen in other scripts is in trying to figure out all HTML nodes on a page which could possibly be related to the main content, so that you can scan for them and reassemble only those nodes. The breakthrough for me was that if you can find one paragraph tag, then you can just go up a level or two to the containing node, and blindly grab all nodes, whatever they may be, within that containing node. The main pages this doesn't work for are pages that don't use paragraph tags in their article body (e.g. plain text with break tags all over the place, which were surprisingly few and far between with my sample set).

[1] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[2] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[3] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[4] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[5] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[6] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[7] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

[8] https://github.com/Kerrick/readability-js/blob/4596857da3cc4...

zaidf · on July 5, 2014

I wish there was an online demo. I'd like to compare it to diffbot.com(a saas tool we pay for).

JoshTriplett · on July 5, 2014

Nice!

As an aid to extracting the right body content, have you considered comparing multiple pages from the same site, and giving greater weight to content that differs (the article) rather than content that stays the same (the navigation)?

rpedela · on July 5, 2014

How does this compare to Apache Tika? There is also a Node wrapper for Tika but don't remember the name of the module off the top of my head.

ageitgey · on July 5, 2014

I'm sure Apache Tika is much more capable. For example, it supports html, csv, ppt, etc instead of just html. But it also requires Java/Maven and the installation process is far from simple.

Unfluff is a small, simple .js library that can be installed and used in seconds. It doesn't have any external dependencies on data files or other language runtimes. So it just depends on which tool is right for your job. If you are writing a quick script, this might be a lot easier to use.

frik · on July 5, 2014

> Unfluff is a small, simple .js library

it's written in CoffeeScript

(The transpiled JS files lack comments and meaningful new lines of the original CoffeeScript source.)

nateguchi2 · on July 5, 2014

You can, of course, configure the coffeescript compiler to output a more readable compilation output.

edwinyzh · on July 5, 2014

Can I use it to monitor source code changes of a Google Code projects, eg (https://code.google.com/p/dcef3/source/list), or a github project? If not can anybody recommend a good tool for this kind of task? Thanks.

MattJ100 · on July 5, 2014

If you look at the source of that page, you'll see:

  <link type="application/atom+xml" rel="alternate" href="/feeds/p/dcef3/gitchanges/basic">

which refers to an Atom feed for the project's commits.

Same for Github projects, e.g. at https://github.com/petdance/ack2 you will find:

  <link href="https://github.com/petdance/ack2/commits/dev.atom" rel="alternate" title="Recent Commits to ack2:dev" type="application/atom+xml" />

edwinyzh · on July 5, 2014

Great! Thanks for the info, and sorry for the dump question without doing any research myself first - that idea occurred to me but I've never had the time to investigate into it :P

flusensieb · on July 5, 2014

You can use the "Git source changes" atom feed from https://code.google.com/p/dcef3/feeds

There's similar functionality on github (but I haven't found documentation for it)... Example: https://github.com/shirou/kptool/commits/master.atom

binarymax · on July 5, 2014

http://watchthatpage.com

sferoze · on July 5, 2014

Thanks, I am making my own tool to manage my bookmarks and research and this is gonna be very useful!

johnernaut · on July 5, 2014

Pretty cool, I wrote a utility in Go [1] similar to this that also extracts and compresses related CSS, JS and images for offline use.

[1] https://github.com/johnernaut/webhog

bakareika · on July 5, 2014

Not long ago I started a similar project, https://github.com/mvasilkov/readability2

Works as a SAX consumer, pretty fast, not sure how accurate.

bndr · on July 5, 2014

I wrote a similar library a while back (https://github.com/bndr/node-read), but without the command line tool. This seems cool.

christiangenco · on July 5, 2014

I love it! Very simple and straightforward interface. Makes it very easy to incorporate into a bigger workflow. You've followed the "do one thing well" command line tool philosophy to a T.

jgmmo · on July 5, 2014

whats the unique value proposition of this compared to the bajillion other web scrapers out there?

ageitgey · on July 5, 2014

Most basic scraping libraries require you to input a bunch of regexs or css selectors to manually specify what you want to extract from a page. They require custom coding for each page you want to scrape. This library is totally automatic - you just pass in an html page and it returns the most 'texty' text on the page with no custom coding.

There are of course other libraries like this (boilerpipe, Goose, etc), but they tend to be written in Java and Python. The very few existing Node solutions didn't fit my needs so I hacked this together. So for people looking for a quick and simple Node solution, this might be useful.

walshemj · on July 5, 2014

there is the boilerpipe library which I used to do a test on New Scientist a while back - we wanted to use ML to identify usefull clusters of content.