Reminds me of an article I wrote a while back on the "meat algorithm" I had developed and used as part of LeadNuke to create more printable versions of pages. What had surprised me at the time was how simple it was to get just the main content of a page for like 95% of the pages I tried.
A successful extract was measured as a page for which you could read with the title and body intact, and by which my application could call `content.text` on the result and get the plain text of a page without the header, footer, navigation, etc.
The complexity of the readability plugin seems to be due to the fact that it actually does a lot more than just making something readable. For example, the point of my algorithm was to strip all style information from a page and show only the content, leaving it to be styled according to the global stylesheet. Notice that my script not only removes linked stylesheets and style tags, but also style attributes of all elements. The readability plugin actually does things like counting reference links and styling them a certain way [1].
It has 53 lines dedicated to both getting and normalizing the article title, when the vast majority of the time, it's just the first h1 or HTML title attribute (which could be a one-liner and is also outside the scope of the "meat algorithm" since it's just trying to get the body) [2].
It has 22 lines dedicated to injecting a custom readability footer into the page [3].
It has 74 lines dedicated to converting all inline links to footnotes [4].
It has 55 lines dedicated to injecting typekit fonts [5]. And on and on.
It also does things like figuring out when to float an image in the article and when to make it full-width [6] (as opposed to just leaving the image inline with no styling, as my script does).
And it even dedicates 333 lines of code to finding pagination links to build content from multi-page articles [7], which my script simply doesn't do, since it only cares about the content of the current page.
It also does things like computing a content-weight score for parts of the page, I'm guessing to determine a relative heuristic for which parts are most likely to be the main content [8]. This is actually the path I had started to go down, before I realized that my much simpler version solved 95% of the use-cases I had, and that, for my purposes, I didn't really care if it failed 5% of the time.
I think the discrepancy in complexity can be explained really easily:
a) The readability plugin does a lot of things not directly related to simply grabbing the content of the page.
b) There's a lot of complexity involved in trying to get it to work for that last 5% (those are the really weirdly-structured sites, for which you'd need to develop some sort of heuristic and/or learning algorithm).
In other words, no, I didn't exaggerate the success-rate. The readability plugin is just very functionally different. The algorithm in my article is also not the complete script; the algorithm was meant to be a base starting point to build from (certainly you would need to if you need greater than 95% success rate, which most applications would).
EDIT: Also should point out, my article was also meant to document my surprise in how easy it was to get a 95% solution. Most of the complexity I've seen in other scripts is in trying to figure out all HTML nodes on a page which could possibly be related to the main content, so that you can scan for them and reassemble only those nodes. The breakthrough for me was that if you can find one paragraph tag, then you can just go up a level or two to the containing node, and blindly grab all nodes, whatever they may be, within that containing node. The main pages this doesn't work for are pages that don't use paragraph tags in their article body (e.g. plain text with break tags all over the place, which were surprisingly few and far between with my sample set).
As an aid to extracting the right body content, have you considered comparing multiple pages from the same site, and giving greater weight to content that differs (the article) rather than content that stays the same (the navigation)?
I'm sure Apache Tika is much more capable. For example, it supports html, csv, ppt, etc instead of just html. But it also requires Java/Maven and the installation process is far from simple.
Unfluff is a small, simple .js library that can be installed and used in seconds. It doesn't have any external dependencies on data files or other language runtimes. So it just depends on which tool is right for your job. If you are writing a quick script, this might be a lot easier to use.
Can I use it to monitor source code changes of a Google Code projects, eg (https://code.google.com/p/dcef3/source/list), or a github project? If not can anybody recommend a good tool for this kind of task? Thanks.
Great! Thanks for the info, and sorry for the dump question without doing any research myself first - that idea occurred to me but I've never had the time to investigate into it :P
I love it! Very simple and straightforward interface. Makes it very easy to incorporate into a bigger workflow. You've followed the "do one thing well" command line tool philosophy to a T.
Most basic scraping libraries require you to input a bunch of regexs or css selectors to manually specify what you want to extract from a page. They require custom coding for each page you want to scrape. This library is totally automatic - you just pass in an html page and it returns the most 'texty' text on the page with no custom coding.
There are of course other libraries like this (boilerpipe, Goose, etc), but they tend to be written in Java and Python. The very few existing Node solutions didn't fit my needs so I hacked this together. So for people looking for a quick and simple Node solution, this might be useful.
http://www.alfajango.com/blog/create-a-printable-format-for-...