Well, you are certainly correct about how cosine sim would apply to the text emb...

CharlieDigital · 2025-02-18T13:06:51 1739884011

Simple example of problem my team ran across.

The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.

shanusmagnus · 2025-02-18T14:34:27 1739889267

But in a larger frame, of "things tightly associated with coffee", they mean something extremely close. Whether these things are opposite from each other, or virtually identical, is a function of your point of view; or, in this context, the generally-meaningful level of discourse.

At scale, I expect having dairy vs non-dairy distance be very small is the more accurate representation of intent.

CharlieDigital · 2025-02-18T17:30:45 1739899845

Of course, I also expect them to be very close and that's the problem with purely relying on embeddings and distance where, in this case, the two things mean entirely opposite preferences on the same topic.

(I think maybe why we sometimes see AI generated search overviews give certain types of really bad answers because the underlying embedding search is returning "semantically similar" results)

DHolzer · 2025-02-18T12:49:23 1739882963

> Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on

I think both ways can be the preferable option, depending on how well the embedding space represents the text - and that is mostly dependet on the specific use case and model combination.

So if the embedding space does not correctly project required nuance, then it's often a viable option to get the top_n results and do the rest by utilizing the llm + validation calls.

But i do agree with you, i would always like to work with embeddings rather than some llm output. I think it would be such a great thing to have rock solid embedding space where one would not even consider to look at token predictor models.