Hacker News new | past | comments | ask | show | jobs | submit login

It's also that after bootstrapping with text search and page rank, they can incorporate a lot more useful signals in there ranking algorithm: the clickstream on the search results, and page visit time after the click. If the majority of their user base would actually want precise search, this clickstream would not reorder the results, but it does. So most users are happier with imprecise ranking.

The wealth of user traffic is also what no other search engine can replicate, due to Google's market share in web searches.




Idk, from what I've seen, page rank is still the super major factor that's responsible for 90%+ percent. I work for a few large affiliate projects. Renting subdirs on high-link-count-sites = instant top 3 for anything, even the most competitive keys, even when it's totally unrelated to the site's other content. The whole "we have more than 200 factors" seems like mostly hot air to me personally.


It's not necessarily about ranking sites which actually contain the key, or which google already decided should be relevant to the search (where page rank seems to be the most relevant factor, thanks for your interesting data point!).

When Google receives a search query, it first broadens the search phrase (see [0]). The user's clickstream and search refinements are helpful in both training the model for doing the broadening, and then weighting the search contexts, for narrowing down what should actually be displayed on the front pages.

[0] https://www.link-assistant.com/news/keyword-refinements.html


Ah, that's interesting and does explain a bit, thank you. Might the perceived quality decrease be based on a misclassification of the user entering the search, and therefore a problematic refinement? I'm thinking similar to Amazon's recommendation engine that for some reason (I'd wager my terribe fashion sense) has decided I'm likely a women and now gets most recommendations completely wrong.


> Might the perceived quality decrease be based on a misclassification of the user entering the search

Exactly! Search engine performance can be assessed by measuring precision and recall [0]. Full text search engines have really high precision. Additionally, when the user has been socialized with full text searches, they've built a model of how the search engine works ("it will find documents which contain my search phrase"), so false negatives are perceived to be less severe, as they can be readily explained by the model. "Ah, this document about helicopers contains 'Apache', no wonder it's in the results. I'll add 'webserver' to narrow it down" (And experienced users will already start off with all necessary key terms).

While full text search engines have high precision, they also have bad recall. This can be improved, but there is a tradeoff when tuning the algorithm: to increase recall, the search context is broadened. That necessarily decreases precision as well, because there is no way the search engine is always correct when adding context. Also, when at first all documents on the frontpage at least contained the search term, now there is not even a good explanation why some documents were retrieved. And the more precise the query itself (something we learned by using full text searches) the higher the probability of misclassification, and the worse the effects of broadening. The relevant results are somewhere in the list, but now every second result on the frontpage is from the wrong bucket. And with no explanation, those false positives weight heavy for us users from the old days.

[0] Precision is the probability that a random document in the result set is relevant. Recall is the probability that a random relevant document is in the result set.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: