>To query the index, we are going to apply the same tokenizer and filters we used for indexing
From developing a Lucene-based search engine years ago, I learned a very useful trick for things like stemming and synonyms: don't use the same tokenizer/filters that you used for indexing. It's much more flexible if you instead expand the query terms, replacing a given term with a list of OR clauses with all the synonyms.
So during indexing, if the document has "fishing", then you just add that to the index.
Later, during querying, the user can search for "fish" with optional stemming enabled, and their query will get re-written into
"fish" OR "fishing" OR "fishy"
This way, the user has control at query-time, and you can manage your synonym/stemming database without having to rebuild your index.
Of course, I had to write a custom parser for this but it was well worth it imo.
how do you populate/maintain the synonym/stemming database? is this something that “standard” datasets exist for? asking because this problem will fall on my plate at work sometimes in the next few months :-)
There was a standard set of synonyms that everyone was using at the time (around 12 years ago) that came in the form of a Lucene index. I'm not sure if it's still around or where you'd find it.
IIRC the stemming was done by an algorithm and not from a database...but I don't remember the exact details.
when I was building this in circa 2006, I used porter stemmer. I do not know if that is still state of the art, but it worked really well for my purpose.
Not a universal trick. Depending on how smart you end up getting and how big your queries are you might end up with an extremely large number of terms in your queries.
Yes and it can make certain use cases easier. There are pros and cons to both index-time and query-time stemming, synonyms, etc. I usually use index-time stemming and query-time synonyms.
From developing a Lucene-based search engine years ago, I learned a very useful trick for things like stemming and synonyms: don't use the same tokenizer/filters that you used for indexing. It's much more flexible if you instead expand the query terms, replacing a given term with a list of OR clauses with all the synonyms.
So during indexing, if the document has "fishing", then you just add that to the index.
Later, during querying, the user can search for "fish" with optional stemming enabled, and their query will get re-written into
"fish" OR "fishing" OR "fishy"
This way, the user has control at query-time, and you can manage your synonym/stemming database without having to rebuild your index.
Of course, I had to write a custom parser for this but it was well worth it imo.