Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you want to talk about the summarisation algorithm at all? I wrote a little blog post about a trivial extractive summarisation system a while ago ( http://honnibal.wordpress.com/ ), and there's a long literature on summarisation in NLP. A lot of the techniques are a bit computationally costly and complicated to be practical, though. Meanwhile abstractive summarisation still hasn't properly gotten off the ground.


(I wrote the classifier4j summariser, as outlined here: http://news.ycombinator.com/item?id=1803020)

In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.

In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).

Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.


Well unfortunately I'm not doing much on my end at this point. I do a few small things and then let libots do most of the work. One of my last iteration items was to put some of my own summarization work into it and let libots be less of a player, but obviously with 48 hours, I had to prioritize.

I invested most of my cleverness in actually getting the content out of the page since that's really where the money is for an MVP for this; no content == no summary. :)


Yeah, that problem is a real pain. As I mentioned in my post it's the bit I'm not happy with. I wonder how the readability tool does it; that seems to do a very good job.

It seems that OTS uses a word frequency strategy, so the algorithm is similar or identical to the one I demoed. Interesting.


Their JS is out there if you grab it from the Bookmarklet. As in, it is not minified.

I have gone through it carefully, and it is clever.

OTS is definitely word freq based.


I'm using an algorithm very similar to what they do with a few clever additions of my own. I started out with something almost identical, but they had a few twists that made it even better, which I then in turn improved on (and HTML5-ified :)).


Why don't you open source your algorithm and more folks can work on it with you. I've been futzing with Readability JS converted to PHP (but could port to Ruby, Python) and it would be great to collab and share test files, etc.


Sure I might consider that at some point! Yet another OSS project for me to maintain though... :P


I'd be interested in working on this project --- it's a problem I've come across quite a bit. There's even an academic contest for it, called CLEANEVAL, although the way they set up the problem was arguably not quite right.


Let me know if you want some help, this is an area I'm interested in.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: