This comment is similar to the comment I wanted to make because I also thought it was pretty nifty.
Joking aside this is pretty cool. One thing about the whole embeddings/cosine similarity thing for people who are struggling with understanding it.
Computers are good at doing lots of sums. Embeddings turn a problem that seems to be about something else[1] into a problem involving lots of sums by turning that something else into numbers.
So when we turn some text into embeddings (numbers), what do those numbers mean? You could imagine a space with a lot of dimensions - the author is using openai embeddings so it's like a thousand dimensions or something - and every point in that space is some embedding, which is actually a numerical representation of the meaning of some text.[2] So things with similar meaning have embeddings which are close to one another in this space. How do you decide what "close" is?
Well one easy way is cosine similarity. Since these are vectors imagine two arrows coming from the origin. To make things simpler imagine it in the two-dimensional plane rather than 1000 dimensions which owuld make your brain leak out of your ears. So you have two arrows going from the origin to some two points. What you want is the length of the line from the tip of one arrow to the tip of the other arrow. For people who struggle to remember their trig, this distance is given by c^2 = a^2 + b^2 - 2ab cos theta. It just so happens that if you take the dot product of two vectors and divide by the product of their norms you get the cosine of the angle between them (cos theta). That's why it's called cosine similarity even though you don't see a cosine in the formula.[3]
[1] language usually in the case of LLMs, but embeddings aren't only about text.
[2] this is why searching embeddings is called semantic search.
[3] The term cosine distance is often used loosely for this although I believe it's actually technically not a distance because it doesn't obey certain properties that are necessary for something to be a distance.
Even that simple explanation makes my brain itch a little :D I never did master trig
I'm curious if there ARE alternative methods to cosine similarity. A lot of the things I've read mention that cosine similarity is "one of the ways to compute distance..." or "a simple way...". But I've not seen any real suggestions for alternatives. Guess everyone's thinking "if it ain't broke, don't fix it" as cosine similarity works pretty darn well
Yeah there are a few other ways. The most common are the “L2 norm”, which would be the hypoteneuse of a right triangle. so if your points are (x1,y1), (x2,y2) then it is sqrt((x1-x2)^2 + (y1-y2)^2)) which you might recognise from Pythagoras’ theorem (c^2 = a^2 + b^2). If you have 1000 dimensions then instead of just twice for x and y you are doing that that a thousand times but the principle is the same.
Another one is “Manhattan distance” (known as the L1 norm or sometimes as “taxicab distance”), which is just abs(x1-x2)+abs(y1-y2) in that example. If you imagine a set of city blocks and you want to go from one place to another the cab has to go north/south and east/west and can’t go diagonally. That’s the distance it travels. You’re adding up all the North/south parts and the east/west parts.
There are a bunch of other distance measures eg one project I worked on we used Mahalanobis distance which is a more complex measure which adjusts for dimensions in your space being affected by covariance. That wouldn’t be useful for this particular problem though.
Joking aside this is pretty cool. One thing about the whole embeddings/cosine similarity thing for people who are struggling with understanding it.
Computers are good at doing lots of sums. Embeddings turn a problem that seems to be about something else[1] into a problem involving lots of sums by turning that something else into numbers.
So when we turn some text into embeddings (numbers), what do those numbers mean? You could imagine a space with a lot of dimensions - the author is using openai embeddings so it's like a thousand dimensions or something - and every point in that space is some embedding, which is actually a numerical representation of the meaning of some text.[2] So things with similar meaning have embeddings which are close to one another in this space. How do you decide what "close" is?
Well one easy way is cosine similarity. Since these are vectors imagine two arrows coming from the origin. To make things simpler imagine it in the two-dimensional plane rather than 1000 dimensions which owuld make your brain leak out of your ears. So you have two arrows going from the origin to some two points. What you want is the length of the line from the tip of one arrow to the tip of the other arrow. For people who struggle to remember their trig, this distance is given by c^2 = a^2 + b^2 - 2ab cos theta. It just so happens that if you take the dot product of two vectors and divide by the product of their norms you get the cosine of the angle between them (cos theta). That's why it's called cosine similarity even though you don't see a cosine in the formula.[3]
[1] language usually in the case of LLMs, but embeddings aren't only about text.
[2] this is why searching embeddings is called semantic search.
[3] The term cosine distance is often used loosely for this although I believe it's actually technically not a distance because it doesn't obey certain properties that are necessary for something to be a distance.