Do you want to talk about the summarisation algorithm at all? I wrote a little b...

nl · on Oct 18, 2010

(I wrote the classifier4j summariser, as outlined here: http://news.ycombinator.com/item?id=1803020)

In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.

In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).

Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.

jeremymcanally · on Oct 18, 2010

Well unfortunately I'm not doing much on my end at this point. I do a few small things and then let libots do most of the work. One of my last iteration items was to put some of my own summarization work into it and let libots be less of a player, but obviously with 48 hours, I had to prioritize.

I invested most of my cleverness in actually getting the content out of the page since that's really where the money is for an MVP for this; no content == no summary. :)

syllogism · on Oct 18, 2010

Yeah, that problem is a real pain. As I mentioned in my post it's the bit I'm not happy with. I wonder how the readability tool does it; that seems to do a very good job.

It seems that OTS uses a word frequency strategy, so the algorithm is similar or identical to the one I demoed. Interesting.

riffer · on Oct 18, 2010

Their JS is out there if you grab it from the Bookmarklet. As in, it is not minified.

I have gone through it carefully, and it is clever.

OTS is definitely word freq based.

jeremymcanally · on Oct 18, 2010

I'm using an algorithm very similar to what they do with a few clever additions of my own. I started out with something almost identical, but they had a few twists that made it even better, which I then in turn improved on (and HTML5-ified :)).

dotBen · on Oct 18, 2010

Why don't you open source your algorithm and more folks can work on it with you. I've been futzing with Readability JS converted to PHP (but could port to Ruby, Python) and it would be great to collab and share test files, etc.

jeremymcanally · on Oct 18, 2010

Sure I might consider that at some point! Yet another OSS project for me to maintain though... :P

syllogism · on Oct 19, 2010

I'd be interested in working on this project --- it's a problem I've come across quite a bit. There's even an academic contest for it, called CLEANEVAL, although the way they set up the problem was arguably not quite right.

dotBen · on Oct 19, 2010

Let me know if you want some help, this is an area I'm interested in.