Hacker News new | past | comments | ask | show | jobs | submit login

Well, you are certainly correct about how cosine sim would apply to the text embeddings, but I disagree about how useful that application is to our understanding of the model.

> In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.

Observing that would be helpful in our understanding of the model!

> For anything else... cosine similarity says little. Sometimes, two steps can have opposite consultation, but they have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.

Yes, that would be good to observe also! But here I think you undervalue the specificity of the OAI embeddings model, which has 3072 dimensions. That's quite a lot of information being captured.

> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").

Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on




Simple example of problem my team ran across.

The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.


But in a larger frame, of "things tightly associated with coffee", they mean something extremely close. Whether these things are opposite from each other, or virtually identical, is a function of your point of view; or, in this context, the generally-meaningful level of discourse.

At scale, I expect having dairy vs non-dairy distance be very small is the more accurate representation of intent.


Of course, I also expect them to be very close and that's the problem with purely relying on embeddings and distance where, in this case, the two things mean entirely opposite preferences on the same topic.

(I think maybe why we sometimes see AI generated search overviews give certain types of really bad answers because the underlying embedding search is returning "semantically similar" results)


> Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on

I think both ways can be the preferable option, depending on how well the embedding space represents the text - and that is mostly dependet on the specific use case and model combination.

So if the embedding space does not correctly project required nuance, then it's often a viable option to get the top_n results and do the rest by utilizing the llm + validation calls.

But i do agree with you, i would always like to work with embeddings rather than some llm output. I think it would be such a great thing to have rock solid embedding space where one would not even consider to look at token predictor models.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: