Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I extract it using a pretty clever algorithm then run it through a few things to summarize it. The extraction is nowhere near perfect; it performs best on the major news sites thus far. I didn't really have time to polish it as much as I'd like, but it seems to work well especially on FOXNews and NYTimes (and Blogspot articles).


Do you mind summarizing a bit how you generate snippets?

We were discussing this on MetaOptimize recently (http://metaoptimize.com/qa/questions/2815/how-are-search-eng...), but I'm curious to hear about alternate approaches.


I build a summariser for classifier4j (http://classifier4j.cvs.sourceforge.net/viewvc/classifier4j/...).

It's 5 years old now, but the summaries it generates are competitive quality-wise with most things out there (eg, the MS Word summarizer). Unfortunately I don't have an online demo working atm (like I said - 5 years old)

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.

Like most things, it's surprising how well a simple algorithm like that works.

There are ports for C#, and Googling just then apparently someone has done a python port too.


The first couple of articles I tried:

http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2010/10/18/...

http://bstj.bell-labs.com/

suffered from complete content extraction failure.

The third one seemed to work pretty well, but there were lots of spurious newlines in the output which made it really hard to read.

Nice idea but needs another 48 hours of polish :-)


PS I tried both those sfgate and Bell Systems pages through viewtext.org, and that got the content for both just fine.

Maybe you could pipe requests through their API:

http://viewtext.org/help/api


Cool, thanks for posting, will definitely keep tabs. Good luck with it. Will post mine when i've got a working site.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: