>To query the index, we are going to apply the same tokenizer and filters we use...

blueline · on Aug 4, 2020

how do you populate/maintain the synonym/stemming database? is this something that “standard” datasets exist for? asking because this problem will fall on my plate at work sometimes in the next few months :-)

pseudalopex · on Aug 4, 2020

Stemming uses algorithms or Hunspell dictionaries.[1]

[1] https://cwiki.apache.org/confluence/display/solr/LanguageAna...

injb · on Aug 4, 2020

There was a standard set of synonyms that everyone was using at the time (around 12 years ago) that came in the form of a Lucene index. I'm not sure if it's still around or where you'd find it.

IIRC the stemming was done by an algorithm and not from a database...but I don't remember the exact details.

vaidhy · on Aug 4, 2020

when I was building this in circa 2006, I used porter stemmer. I do not know if that is still state of the art, but it worked really well for my purpose.

sanxiyn · on Aug 5, 2020

These days we use Snowball stemmer. OP also used it.

inertiatic · on Aug 5, 2020

Not a universal trick. Depending on how smart you end up getting and how big your queries are you might end up with an extremely large number of terms in your queries.

amelius · on Aug 5, 2020

Not by more than a small constant factor.

inertiatic · on Aug 5, 2020

The number of terms times the number of expansions times the types of expansion times the target fields.

In effect, a problem in many applications.

dumb1224 · on Aug 4, 2020

Good old days with Lucene powered indices! So many memories. That query expansion trick would have saved me many hours of tweaking.

aflag · on Aug 4, 2020

Wouldn't having a stemmed index and a non-stemmed one also allow the same usability? The user could then just select which index to use.

rpedela · on Aug 5, 2020

Yes and it can make certain use cases easier. There are pros and cons to both index-time and query-time stemming, synonyms, etc. I usually use index-time stemming and query-time synonyms.

sanxiyn · on Aug 5, 2020

That would work, but I am not sure it's worth maintaining two indices.

matlin · on Aug 4, 2020

I like that approach especially since index is more true to the source text.