Hacker News new | past | comments | ask | show | jobs | submit login

>To query the index, we are going to apply the same tokenizer and filters we used for indexing

From developing a Lucene-based search engine years ago, I learned a very useful trick for things like stemming and synonyms: don't use the same tokenizer/filters that you used for indexing. It's much more flexible if you instead expand the query terms, replacing a given term with a list of OR clauses with all the synonyms.

So during indexing, if the document has "fishing", then you just add that to the index.

Later, during querying, the user can search for "fish" with optional stemming enabled, and their query will get re-written into

"fish" OR "fishing" OR "fishy"

This way, the user has control at query-time, and you can manage your synonym/stemming database without having to rebuild your index.

Of course, I had to write a custom parser for this but it was well worth it imo.




how do you populate/maintain the synonym/stemming database? is this something that “standard” datasets exist for? asking because this problem will fall on my plate at work sometimes in the next few months :-)


Stemming uses algorithms or Hunspell dictionaries.[1]

[1] https://cwiki.apache.org/confluence/display/solr/LanguageAna...


There was a standard set of synonyms that everyone was using at the time (around 12 years ago) that came in the form of a Lucene index. I'm not sure if it's still around or where you'd find it.

IIRC the stemming was done by an algorithm and not from a database...but I don't remember the exact details.


when I was building this in circa 2006, I used porter stemmer. I do not know if that is still state of the art, but it worked really well for my purpose.


These days we use Snowball stemmer. OP also used it.


Not a universal trick. Depending on how smart you end up getting and how big your queries are you might end up with an extremely large number of terms in your queries.


Not by more than a small constant factor.


The number of terms times the number of expansions times the types of expansion times the target fields.

In effect, a problem in many applications.


Good old days with Lucene powered indices! So many memories. That query expansion trick would have saved me many hours of tweaking.


Wouldn't having a stemmed index and a non-stemmed one also allow the same usability? The user could then just select which index to use.


Yes and it can make certain use cases easier. There are pros and cons to both index-time and query-time stemming, synonyms, etc. I usually use index-time stemming and query-time synonyms.


That would work, but I am not sure it's worth maintaining two indices.


I like that approach especially since index is more true to the source text.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: