I extract it using a pretty clever algorithm then run it through a few things to...

bravura · on Oct 18, 2010

Do you mind summarizing a bit how you generate snippets?

We were discussing this on MetaOptimize recently (http://metaoptimize.com/qa/questions/2815/how-are-search-eng...), but I'm curious to hear about alternate approaches.

nl · on Oct 18, 2010

I build a summariser for classifier4j (http://classifier4j.cvs.sourceforge.net/viewvc/classifier4j/...).

It's 5 years old now, but the summaries it generates are competitive quality-wise with most things out there (eg, the MS Word summarizer). Unfortunately I don't have an online demo working atm (like I said - 5 years old)

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.

Like most things, it's surprising how well a simple algorithm like that works.

There are ports for C#, and Googling just then apparently someone has done a python port too.

nervechannel · on Oct 18, 2010

The first couple of articles I tried:

http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2010/10/18/...

http://bstj.bell-labs.com/

suffered from complete content extraction failure.

The third one seemed to work pretty well, but there were lots of spurious newlines in the output which made it really hard to read.

Nice idea but needs another 48 hours of polish :-)

nervechannel · on Oct 18, 2010

PS I tried both those sfgate and Bell Systems pages through viewtext.org, and that got the content for both just fine.

Maybe you could pipe requests through their API:

http://viewtext.org/help/api

SkyMarshal · on Oct 18, 2010

Cool, thanks for posting, will definitely keep tabs. Good luck with it. Will post mine when i've got a working site.