Hacker News new | past | comments | ask | show | jobs | submit login
What domain of knowledge does my problem fall under?
6 points by voidfiles on May 31, 2008 | hide | past | favorite | 10 comments
My business partner, and I have created a site; http://www.qwertykitchen.com . Its a recipe site. Our main form of search is entering ingredients that you have on hand and you get a list of recipes that contain those ingredients.

We are now looking at doing something more then just full text searching, and keywords. The first problem we want to attack is making words that are basically the same, show up in the search when the other is called.

For example, we have egg and eggs, or milk, 2% milk, whole milk, and skim milk. We want someone to only have to enter milk. If they do enter milk, then the entire cluster should be searched.

My idea of an approach to this problem is this. We search for longest common substring. Then create a directed graph representation. Then when a person searchs for a term, we use a proximity measure and then search for all the terms with in a certain proximity.

So, I was wondering one, does anyone know if their is a name for what I am talking about, and two, does anyone maybe have a better idea of how to do something like this.




The domain of knowledge you're interested in is "information retrieval" (IR).

Here's a textbook on the subject from some of the top minds in the field: http://www-csli.stanford.edu/~hinrich/information-retrieval-...

I am not an IR expert, but one particular algorithm that you might find useful is latent semantic analysis, a.k.a. latent semantic indexing: http://en.wikipedia.org/wiki/Latent_semantic_analysis


> We search for longest common substring.

That idea can be improved upon: "length of largest common substring" is not the measure to use. At the least, you want to use Levenshtein distance <http://en.wikipedia.org/wiki/Levenshtein_distance>, but even that can be tweaked to weigh common typing 'errors' lower (for example, colour is closer to color than to clour)

> Then when a person searchs for a term, we use a proximity measure and then search > for all the terms with in a certain proximity

This is a form of <http://en.wikipedia.org/wiki/Query_expansion>;


What you are asking about is fuzzy searching, but the fact that you are asking at all indicates that you ought to read up on searching theory. There's a large body of work out there that you ought to familiarize yourself with.


do you have specific suggestions on what to read?


Recent issues of the ACM Transactions on Information Systems will bring you the latest research; alternately, if you want to start at the beginning, Knuth volume 3.


Try reading up on latent semantic analysis. It also uses vector calculus, but works a bit different IIRC. It can be used to identify words that are used in the same context and therefore should be the same. I'm not sure if it is directly applicable for recipes because of the concise language, but you might find interesting related articles.


Eggs and egg is an example of stemming, see http://en.wikipedia.org/wiki/Stemming

The milk example as stated could be solved using fulltext indexing and searching.

I think your approach might work, but you could get results much more quickly using existing database fulltext search.


I am not convinced that just fulltext search would work. Wouldn't fulltext search fail in the case of goat. You have goats milk, goats milk cheese, goat meat. So if someone searched goat all three types would be returned?


Well, if someone claims to have a goat, they probably have all three of those, or can at least make them. ;-)


not if it's a boy goat




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: