Do you want to talk about the summarisation algorithm at all? I wrote a little blog post about a trivial extractive summarisation system a while ago ( http://honnibal.wordpress.com/ ), and there's a long literature on summarisation in NLP. A lot of the techniques are a bit computationally costly and complicated to be practical, though. Meanwhile abstractive summarisation still hasn't properly gotten off the ground.
In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.
In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).
Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.
Well unfortunately I'm not doing much on my end at this point. I do a few small things and then let libots do most of the work. One of my last iteration items was to put some of my own summarization work into it and let libots be less of a player, but obviously with 48 hours, I had to prioritize.
I invested most of my cleverness in actually getting the content out of the page since that's really where the money is for an MVP for this; no content == no summary. :)
Yeah, that problem is a real pain. As I mentioned in my post it's the bit I'm not happy with. I wonder how the readability tool does it; that seems to do a very good job.
It seems that OTS uses a word frequency strategy, so the algorithm is similar or identical to the one I demoed. Interesting.
I'm using an algorithm very similar to what they do with a few clever additions of my own. I started out with something almost identical, but they had a few twists that made it even better, which I then in turn improved on (and HTML5-ified :)).
Why don't you open source your algorithm and more folks can work on it with you. I've been futzing with Readability JS converted to PHP (but could port to Ruby, Python) and it would be great to collab and share test files, etc.
I'd be interested in working on this project --- it's a problem I've come across quite a bit. There's even an academic contest for it, called CLEANEVAL, although the way they set up the problem was arguably not quite right.