I think the use of the term "cosine" here is needlessly confusing. It is the dot product of normalized vectors. Sure, when you do the maths, it gives out a cosine, but since we are not doing geometry here, so it isn't really helpful for a beginner to know that. Especially considering that these vectors have many dimensions and anything above 3D is super confusing when you think about it geometrically.
Instead just try to think about what it is: the sum of term-by-term products of normalized vectors. A product is the soft version of a logic AND, and it makes intuitive sense that vectors A and B are similar if there are a lot of traits that are present in both A AND B (represented by the sum) relative to the total number of traits that A and B have (that's the normalization process).
Forget about angles and geometry unless you are comfortable with N-dimensional space with N>>3. Most people aren't.
we absolutely are doing geometry here, given we're talking about metrics in a vector space – and this is trigonometry you learned by the first year of high school.
Where I live, where many people live, we enter high school aged 11. We haven’t been introduced at school to geometry yet.
I suspect you’re using American terminology. When talking about school years it’s often useful to talk about the year or grade of school, like “9th grade” or “year 9” as it’s more universal.
I’m not and, unfortunately, those aren’t universal either, even within a country. The normal terminology where I grew up would be S1, which follows P7.
I would expect most people to know about trigonometric functions by age 12, yes. (I entered high school at 11 and the first topic tackled in maths classes was elementary trigonometry.)
You might like to think of vectors in their geometric interpretation but vectors are not inherently geometric - vectors are just lists of numbers, which we sometimes interpret geometrically because it helps us comprehend them. High dimensional vectors grow increasingly ungeometric as we have to wrestle with increasingly implausible numbers of orthogonal spatial dimensions in order to render them ‘geometric’.
In the end, vectors (long lists of numbers a1, a2, a3, … an) start looking more like discrete functions f(i) = ai. And you can extend the same concept all the way to continuous functions - they’re like infinite dimensional vectors. For continuous functions over a finite interval the dot product (usually called the inner product in this domain) is just the integral of the product of two functions, and the ‘magnitude’ of a function is its RMS, and that means functions have a ‘cosine similarity’ which is not remotely geometric. There isn’t any geometric sense in which there is an ‘angle between’ cos(x) and sin(x) except it turns out that they have a cosine similarity of 0 so it implies the ‘angle between’ them is 90°, which actually makes a lot of sense. But in this same sense there’s an ‘angle between’ any two functions (over an interval).
> You might like to think of vectors in their geometric interpretation but vectors are not inherently geometric - vectors are just lists of numbers
No. They can be expressed as lists of numbers in a basis if the vector space is equipped with a scalar product but the vector itself is an object that transcends the specific numbers it is expressed in.
What you’re saying here is totally wrong and I recommend you check out the Wikipedia page on vector spaces. The geometrical object “a vector” is the more fundamental thing than the list of numbers
Tuples of numbers are a special case of a vector space, which even comes with a canonical basis and inner product for free. And since the article is about word embeddings, which map words to tuples of numbers, there’s no need to mention other vector spaces in this context.
You think this comment could have been written by someone who doesn’t understand what a vector space is?
Vectors are not purely geometric objects. Geometry is a lens through which we can interpret vectors. So is linear algebra. The objects behave the same and both perspectives give us insights about them.
Insisting vectors are only geometric is like saying complex numbers are geometric because they can be thought of as points on the complex plane.
> increasingly implausible numbers of orthogonal spatial dimensions in order to render them ‘geometric’.
Implausible how? “geometric” doesn’t mean “embeds nicely in 3D space”.
What’s wrong with talking about the angle between two L^2 functions defined on an interval?
Geometric reasoning still works? If you take a span of two functions, you have a plane. What’s the issue?
In this case can people just prepend "hyper-" as in hyperplane etc? Hyper-line, hyper-angle. (Speaking as someone who has heard 'hyperplane' a few times but not others)
> that means functions have a ‘cosine similarity’ which is not remotely geometric.
It obeys the normal rules you learned in geometry. For example, pick three functions a,b,c. The functions form a triangle. The triangle obeys the triangle inequality—the distances satisfy d(a,b) ≤ d(a,c) + d(c,b). The angles of the triangle sum to 180°.
Interesting, I think it’s actually far more intuitive to think of it geometrically. I’m not sure what my brain is doing in order for this mental projection to help, but this is exactly how I made dot products “click” for me. I started to think of them in multidimensional space, almost physically (though in a very limited sense since my brain came from a monkey and generally fires on a couple cylinders).
I expect it’s like how learning to play by ear is more intuitive than sheet music. That’s great if you’re an amateur. If you’re dealing with tensors or somesuch trying to design a fusion reactor that’s probably a crutch.
This is a very odd statement and depicts the different ways human brain works. As a musician, I find playing (or thinking music in terms of) sheet music so much more intuitive than play by ear. It feels like the very reason people notate, write music is because anything written down is easier to think/play than anything listened.
I can intuit a lot of things about music and even visualize some of it, but eventually I hit limitations. What I learn through these intuitions still applies as my ability to mentally visualize or model the music begins to fail, though.
It's similar with vectors. Once you have the orchestral equivalent of vectors, there's no way I'm visualizing it and doing mental geometry. However, what I learned and the modelling I developed from the "casio keyboard playing jingles" equivalent of vectors is still useful and applicable.
I guess this is the point where playing by ear or mentally modelling things fails, and notation is far more helpful. Yet if a lot of us approach these complex works from the notation angle first, we might feel pretty lost and uncertain about what we're doing with it and why.
I can tell I'm not articulating this well, but I like the musical analogy and wanted to get that out.
One cannot intuitively think about higher than 3 dimensions. Even for most their intuition is often wrong in 3D space. It's quite accurate for 1D and 2D.
Richard Hamming has a whole section lecture to make everyone realize precisely this [1]. This was an eye opener to me.
Ehh… you can intuitively think about it. Intuition is something you develop with time as you gain familiarity with a subject. You just can’t bring all of your intuitions about 3D space into higher-dimensional spaces.
I took many classes in school where we worked with higher dimensional spaces. You wouldn’t send a physics major a lecture on physics, say it was “eye opening”, and expect them to feel the same way about it. It is stuff they have already seen before. Maybe their eyes are already open.
> Forget about angles and geometry unless you are comfortable with N-dimensional space with N>>3. Most people aren't.
The whole point of measuring similarity this way is that any two vectors exist in a two-dimensional space, which is where you measure the angle between them. Why would you need to be comfortable with high-dimensional spaces?
No, I'm talking about the fact that the space spanned by two vectors is sufficient to contain those vectors. All of the analysis you could ever theoretically want to do on them can be done within that space. If you only have two vectors, you never need to consider a space with higher dimensionality than 2. Each vector is a dimension of the space, and that's it.
(A) Look at this space. Every point within it can be reached by combining these two vectors.
(B) Look at this space. No point outside it can be reached by combining these two vectors.
Saying that two vectors span a space is claim (A). Saying that the space they span contains them is... much weaker than claim (B), but it's related to claim (B) and not to claim (A).
for one reason, if you're just thinking about it as fancy 2d, you will miss a lot of phenomena that occur in higher dimensional spaces. for example, almost all vectors are almost completely orthogonal which isn't true at all in low dimensional spaces
Phrased like that it sounds like a qualitative difference between "low" and "high" dimensional spaces. But isn't it simply a consequence of the fact that the more dimensions you have, the less likely that randomly distributed, sparse non-zeros will end up in the same positions?
I bet it's whether your primary background is programming or mathematics. From the latter, the cosine is very natural (scalar projection etc.) and it's lots of steps to get to your thing. I'd say this was intuitive for us post high-school because of that pedagogical background.
Hmmm.. I heard in a conference that most well understood engineering principles or theories have a neat geometric interpretation. Personally I find a theory with geometric interpretation far easier to grasp. On the other hand, the higher dimensions geometry confuses me a lot: most random sparse vectors are orthogonal to each other, and most volumes of a sphere in that dimension are concentrated in a place.
It’s quite interesting that we end up using cosine similarity. Most networks are trained with a softmax layer at the end (e.g. next word prediction). Given the close relation between softmax and logistic regression, it might make more sense to use σ(u.v) as the similarity function.
Imagine 2 points in 3 dimensional space with a vector being the line from the origin to the point. So you have 2 vectors pointing going to the 2 points from the origin.
If those points are really close together, then angle between the two vector lines is very small. Loosely speaking cosine is a way to quantize how close two lines with a shared origin is. If both lines are the same, the angle between them is 0, and the cosine of 0 is 1. If two lines are 90 degrees apart, their cosine is 0. If two lines are 180 degrees apart, their cosine is -1.
So cosine is a way to quantify the closeness of two lines which share to same origin
To go back with 2 points in space that we started with, we can measure how close those 2 points are by taking the cosine of the lines going from origin to the two points. If they are close, the angle between them is small. If they are the exact same point, the angle between the lines is 0. That line is called a vector
Cosine similarity measures how closes two vectors are in Euclidean space. That’s we end up using it a lot. It’s no the only way to measure closeness. There are many others
I vaguely remember some paper where they didn’t even bother normalizing the vectors, because they expected zeros to be very close to zero, and anything else was considered a one.
I have no idea if this a common optimization or if it was something very niche. It was for a heuristic matrix reordering strategy, so I think they were willing to accept some mistakes.
In cosine similarity, yes.. iirc there is a recent paper arguing that this causes cosine similarity to perform poorly in high dimensional (aka ML) vector spaces.
I think he is talking about the 2024 arxiv paper by some netflix (?) researchers that say that it's best not to normalize the dot products (so instead of cosine similarity you just have a dot product).
For most commercial embeddings (openai etc) this is not a problem as the embeddings are already normalized
They don’t have to be. The computation of cosine distance normalizes the distance.
In the intuitive explanation I gave, the distance from the origin doesn’t matter at all, unless you are trying to force “cosine disbtance” to mean “metric distance”
There are many many ways to quantify closeness. Metric distance i.e taking a ruler and measuring the distance between the 2 points, is one way.
Measuring the angles between the 2 lines that join the points to the origin is another way to measure closeness
The squared of the measured metric distance is another way
The absolute value of the ruler distance is another way
The question you asked is only relevant if you’re trying to force cosine distance into ruler distance. You don’t need to. Cosine distance is a sufficient way by itself to measure closeness.
But yeah generally speaking, costume distance correlates better with metric distance even the points are relatively more equidistant to the origin
Good explanation, can you also explain how a sentence ends up as a point next to another point where the sentences has similar meaning. What does it mean for two sentences to be similar?
A good way to understand why cosine similarity is so common in NLP is to think in terms of a keyword search. A bag-of-words vector represents a document as a sparse vector of its word counts; counting the number of occurrences of some set of query words is the dot product of the query vector with the document vector; normalizing for length gives you cosine similarity. If you have word embedding vectors instead of discrete words, you can think of the same game, just now the “count” of a word with another word is the similarity of the word embeddings instead of a 0/1. Finally, LLMs give sentence embeddings as weighted sums of contextual word vectors, so it’s all just fuzzy word counting again.
One thing I've wondered for a while: Is there a principled reason (e.g. explainable in terms of embedding training) why a vector's magnitude can be ignored within a pretrained embedding, such that cosine similarity is a good measure of semantic distance? Or is it just a computationally-inexpensive trick that works well in practice?
For example, if I have a set of words and I want to consider their relative location on an axis between two anchor words (e.g. "good" and "evil"), it makes sense to me to project all the words onto the vector from "good" to "evil." Would comparing each word's "good" and "evil" cosine similarity be equivalent, or even preferable? (I know there are questions about the interpretability of this kind of geometry.)
Some embedding models are explicitly trained on cosine similarity. Otherwise, if you have a 512D vector, discarding magnitude is like discarding just a single dimension (i.e. you get 511 independent dimensions).
This is not quite right; you are actually losing information about each of the dimensions and your mental model of reducing the dimensionality by one is misleading.
Consider [1,0] and [x,x]
Normalised we get [1,0] and [sqrt(.5),sqrt(.5)] — clearly something has changed because the first vector is now larger in dimension zero than the second, despite starting off as an arbitrary value, x, which could have been smaller than 1. As such we have lost information about x’s magnitude which we cannot recover from just the normalized vector.
Well, depends. For some models (especially two tower style models that use a dot product), you're definitely right and it makes a huge difference. In my very limited experience with LLM embeddings, it doesn't seem to make a difference.
Magnitude is not a dimension, it’s information about each value that is lost when you normalize it. To prove this normalize any vector and then try to de-normalize it again.
Magnitude is a dimension. Any 2-dimensional vector can be explicitly transformed into the polar (r, theta) coordinate system where one of the dimensions is magnitude. Any 3-dimensional vector can be transformed into the spherical (r, theta, phi) coordinate where one of the dimensions is magnitude. This is high school mathematics. (Okay I concede that maybe the spherical coordinate system isn't exactly high school material, then just think about longitude, latitude, and distance from the center.)
There's something wrong with the picture here but I can't put my finger on it because my mathematical background here is too old. The space of k dimension vectors all normalized isn't a vector space itself. It's well-behaved in many ways but you lose the 0 vector (may not be relevant). Addition isn't defined anymore, and if you try to keep it inside by normalization post addition, distribution becomes weird. I have no idea what this transformation means for word2vec and friends.
But the intuitive notion is that if you take all 3D and flatten it / expand it to be just the surface of the 3D sphere, then paste yourself onto it Flatland style, it's not the same as if you were to Flatland yourself into the 2D plane. The obvious thing is that triangles won't sum to 180, but also parallel lines will intersect, and all sorts of differing strange things will happen.
I mean, it might still work in practice, but it's obviously different from some method of dimensionality reduction because you're changing the curvature of the space.
The space of all normalized k-dimensional vector is just a unit k-sphere. You can deal with it directly, or you can use the standard inverse stereographic projection to map every point (except for one) onto a plane.
> triangles won't sum to 180
Exactly. Spherical triangles have the sum of their interior angles exceed 180 degrees.
> parallel lines will intersect
Yes because parallel "lines" are really great circles on the sphere.
So is it actually the case that normalizing down and then mapping to the k-1 plane yields a useful (for this purpose) k-1 space? Something feels wrong about the whole thing but I must just have broken intuition.
So I first learned about cosine similarity in the context of traditional information retrieval, and the simplified models used in that field before the development of LLMs, TensorFlow, and large-scale machine learning might prove instructive.
Imagine you have a simple bag-of-words model of a document, where you just count the number of occurrences of each word in the document. Numerically, this is represented as a vector where each dimension is one token (so, you might have one number for the word "number", another for "cosine", another for "the", and so on), and the magnitude of that component is the count of the number of times it occurs. Intuitively, cosine similarity is a measure of how frequently the same word appears in both documents. Words that appear in both documents get multiplied together, but words that are only in one get multiplied by zero and drop out of the cosine sum. So because "cosine", "number", and "vector" appear frequently in my post, it will appear similar to other documents about math. Because "words" and "documents" appear frequently, it will appear similar to other documents about metalanguage or information retrieval.
And intuitively, the reason the magnitude doesn't matter is that those counts will be much higher in longer documents, but the length of the document doesn't say much about what the document is about. The reason you take the cosine (which has a denominator of magnitude-squared) is a form of length normalization, so that you can get sensible results without biasing toward shorter or longer documents.
Most machine-learned embeddings are similar. The components of the vector are features that your ML model has determined are important. If the product of the same dimension of two items is large, it indicates that they are similar in that dimension. If it's zero, it indicates that that feature is not particularly representative of the item. Embeddings are often normalized, and for normalized vectors the fact that magnitude drops out doesn't really matter. But it doesn't hurt either: the magnitude will be one, so magnitude^2 is also 1 and you just take the pair-wise product of the vectors.
> the reason the magnitude doesn't matter is that those counts will be much higher in longer documents ...
To be a bit more explicit (of my intuition). The vector is encoding a ratio, isn't it? You want to treat 3:2, 6:4, 12:8, ... as equivalent in this case; normalization does exactly that.
Dunno if I have the full answer, but it seems in high dimensional spaces, you can typically throw away a lot of information and still preserve distance.
The J-L lemma is at least somewhat related, even though it doesn't to my understanding quite describe the same transformation.
When I dabbled with latent semantic indexing[1], using cosine similarity made sense as the dimensions of the input vectors were words, for example a 1 if a word was present or 0 if not. So one would expect vectors that point in a similar direction to be related.
I haven't studied LLM embedding layers in depth, so yeah been wondering about using certain norms[2] instead to determine if two embeddings are similar. Does it depends on the embedding layer for example?
Should be noted it's been many years since I learned linear algebra, so getting somewhat rusty.
This doesn't properly explain what it says it explains. To explain it correctly, you have to explain why the dot product of two vectors computed as the sum of the products of the coefficients of an orthonormal basis is a scalar equal to the product of the Euclidean magnitudes of the vectors and the cosine of the angle between them. The Wikipedia article on dot product explains this reasonably well, so just read that.
Maybe I missed this but I was surprised they didn't mention the connection to correlation. Cosine similarity can be thought of as a correlation, and some bivariate distributions (normal I think?) can be rexpressed in terms of cosine similarity.
There's also some generalizations to higher dimensional notions of cosines that are kind of interesting.
In essence then, not as confusing to the beginner who might even know what a dot product 'is' operationally but not what it 'does'.
So level 1 'it's just a normalized dot product', level 2 more immediately intuitive: 'is arrow 1 pointing in the same direction as arrow 2?' or 'how close is arrow 1's direction to arrow 2's direction?'
Now what's left after that is 'Why is it so? Why did we decide on this in embeddings?'
Cosine similarly is the epitome of status quo bias. How many DS or ML people actually think through similarity metrics that might be appropriate, then choose cosine? Gods forbid they have to justify using a different measure to their colleagues
I agree with you I like Chebyshev Distance too, but as far I can tell you, dot product is not that much more complex than Chebyshev, I'd classify them as about the same. But if there is a good argument to argue Chebyshev is much simpler, and it's about as good, then I'd prefer Chebyshev too.
Having said this, nowadays the metric I like the most is SAD (sum of absolute difference) which is Sum(abs(x2i - x1i)), which is the L1-norm of the difference image. I find this oddly easy/simple to reason and implement, so I use it in any model where it works.
Cosine similarity considers only the angle between vectors, not their magnitude. This is problematic when the magnitude carries important information. For example, if vectors represent term frequencies in documents, cosine similarity treats two documents with vastly different lengths but the same proportion of words as identical.
Sensitive to High-dimensional Sparsity:
In high-dimensional spaces (e.g., text data), vectors are often sparse (many zeros). Cosine similarity might not provide meaningful results if most dimensions are zero since the similarity could be dominated by a few non-zero entries.
No Sense of Absolute Position:
Cosine similarity measures the angle between vectors but ignores their absolute position. For example, if vectors represent geographical coordinates, cosine similarity won't capture differences in distances properly.
Poor Performance with Highly Noisy Data:
If the data has significant noise, cosine similarity can be unreliable. The angle between noisy vectors might not reflect true similarity, especially in high-dimensional spaces.
Does Not Handle Negative Values Well:
If vectors contain negative values (e.g., sentiment scores, certain word embeddings), cosine similarity may yield unintuitive results since negative values can affect the angle differently compared to positive-only data.
Assumes Non-Negative Values:
Often, cosine similarity assumes non-negative values. In contexts where vectors have both positive and negative values (e.g., sentiment analysis with positive and negative sentiment words), this assumption can lead to misleading results.
Not Ideal for Measuring Dissimilarity:
Cosine similarity can be unintuitive when measuring dissimilarity. Two vectors that are orthogonal (90 degrees apart) will have a similarity score of 0, but vectors pointing in opposite directions (-1 cosine similarity) might need a different interpretation depending on the context.
Inappropriate Use Cases
Data with Magnitude Importance:
When the magnitude of vectors is crucial (e.g., comparing sales data, where larger magnitudes indicate higher sales), using cosine similarity would ignore valuable information.
Time Series Analysis:
For time-series data, the order and distance of data points matter. Cosine similarity does not account for these aspects and may not provide meaningful comparisons for temporal data.
Geospatial Data:
When working with geospatial coordinates (latitude, longitude), cosine similarity does not account for Earth’s curvature or distance metrics like the Haversine formula.
Data Representing Complex Structures:
For data representing graphs, trees, or other complex structures where connectivity or sequence matters, cosine similarity may not capture the intricate relationships between nodes or elements.
Vectors with Negative Components:
In cases where vectors have meaningful negative components (like certain word embeddings or feature vectors in machine learning models), cosine similarity can yield misleading similarity scores.
Suggestions for Alternatives
Euclidean Distance: When absolute magnitude is important, or when interpreting actual distances between points.
Jaccard Similarity: For binary or set-based data, where overlap or presence/absence matters.
Pearson Correlation: For datasets where linear relationships are of interest, especially with normally distributed values.
Hamming Distance: For comparing binary data, especially for bit strings or categorical attributes.
Manhattan Distance (L1 Norm): For high-dimensional data where you want to measure the absolute difference across dimensions.
Cosine similarity is effective for certain applications, such as text similarity, but its limitations make it unsuitable for other contexts where magnitude, distance, or data distribution play a critical role.
I might have missed this, but I think the post might bury the lede that in a high dimensional space, two randomly chosen vectors are very unlikely to have high cosine similarity. Or maybe another way to put it is that the expected value of the cosine of two random vectors approaches zero as the dimensionality increases.
Most similarity metrics will be very low if vectors don't even point in the same direction, so cosine similarity is a cheap way to filter out the vast majority of the data set.
It's been a while since I've studied this stuff, so I might be off target.
Even if two random vectors don't have high cosine similarity, and I have not had this issue in 3000 dimensions, the cosine similarity is still usable in relative terms, i.e. relative to other items in the dataset. This keeps it useful.
Nitpick: The expected value of the cosine is 0 even in low-dimensional spaces. It’s the expected square of that (i.e. the variance) which gets smaller with the dimension.
Three separate passes over JavaScript arrays are quite costly, especially for high-dimensional vectors. I'd recommend using `TypedArray` with vanilla `for` loops. It will make things faster, and will allow using C extensions, if you want to benefit from modern hardware features, while still implementing the logic in JavaScript: https://ashvardanian.com/posts/javascript-ai-vector-search/
Rather superficial and obfuscating. The article keeps raising the question "why ignore the magnitude" and never answers it.
"The important part of an embedding is its direction, not its length. If two embeddings are pointing in the same direction, then according to the model they represent the same "meaning"."
This can't be quite right. Any LLM transformer model looks at the embedding of the token sequence, (without normalizing, i.e. including its magnitude) for deciding on the next token. Why would you throw away that information, equivalent to throwing away one embedding dimension?
If I had to guess why cosine similarity is the standard for comparing embeddings I suspect it's simply because the score is bounded in [-1, 1], which you may find more interpretable than the unbounded score obtained by the unnormalized dot product or Euclidean distance.
In my experience, choice of similarity metric doesn't affect embedding performance much, simply use the one the embedding model was trained with.
Something worth mentioning is that if your vectors all have the same length then cosine similarity and Euclidean distance will order most (all?) neighbors in the same order. Think of your query vector as a point on a unit sphere. The Euclidean distance to a neighbor will be a chord from the query point to the neighbor. Just as with the angle between the query-to-origin and the neighbor-to-origin vectors, the farther you move the neighbor from the query point on the surface of the sphere, the longer the chord between those points gets too.
For those who don't want to use a full-blown RAG database, scipy.spatial.distance has a convenient cosine distance function. And for those who don't even want to use SciPy, the formula in the linked post.
For anyone new to the topic, note that the monotonic interpretation of cosine distance is opposite to that of cosine similarity.
SciPy distances module has its own problems. It's pretty slow, and constantly overflows in mixed precision scenarios. It also raises the wrong type of errors when it overflows, and uses general purpose `math` package instead of `numpy` for square roots. So use it with caution.
Noted, and thanks for your great work. My experience with it is limited to working with LLM embeddings, which I believe have been cleanly between 0 and 1. As such, I am yet to encounter these issues.
Regarding the speed, yes, I wouldn't use it with big data. Up to a few thousand items has been fine for me, or perhaps a few hundred if pairwise.
Sorta related -- whenever I'm doing something with embeddings, i just normalize them to length one, at which point cosine similarity becomes a simple dot product. Is there ever a reason to not normalize embedding length? An application where that length matters?
For the LLM itself, length matters. For example, the final logits are computed as the un-normalized dot product, making them a function of both direction and magnitude. This means that if you embed then immediately un-embed (using the same embeddings for both), a different token might be obtained. In models such as GPT2, the embedding vector magnitude is loosely correlated with token frequency.
On the practical side, dot products are great, but break in mixed precision and integer representations, where accurately normalizing to unit length isn't feasible.
In other cases people prefer L2 distances for embeddings, where the magnitude can have a serious impact on the distance between a pair of points.
In high dimensional Spaces the distances between nearest and farthest points from query points with respect to normal metrics become almost equal.
Cosine similarity still works though, since it only look at how aligned vectors are.
The thing that people tend to overlook is, that there is no need for embeddings to be a vector space endowed with an inner product.
Words don’t have this structure, we define it on the image of the mapping from words to n-tuples and the embeddings we use coevolved in such a way that we assume the cosine similarity to be meaningful.
Maybe I missed it but I don’t see the author doing that in the article. They use a dot to denote dot product. They use X for multiplying the magnitude of two vectors, which isn’t my favorite thing but isn’t offensive.
I think it would have been helpful to mention the Pythagorean theroem, as most people are familiar with it, but otherwise the post did a great job explaining and introducing the topic.
the notation here is bad. the bottom of the division looks like a cross product
as a games and graphics programmer i find it amazing that this would be a mystery... understanding the dot product is utterly foundational, and is some high-school level basics.
Cosine similarity for unit vectors looks like the angle between hands on a high dimensional clock. 12 and 6 have -1 cosine sim. 5 and 6 are pretty close.
Cosine similarity works if the model has been deliberately trained with cosine similarity as the distance metric. If they were trained with Euclidean distance the results aren’t reliable.
Example: (0,1) and (0,2) have a cosine similarity of 1 but nonzero Euclidean distance.
Instead just try to think about what it is: the sum of term-by-term products of normalized vectors. A product is the soft version of a logic AND, and it makes intuitive sense that vectors A and B are similar if there are a lot of traits that are present in both A AND B (represented by the sum) relative to the total number of traits that A and B have (that's the normalization process).
Forget about angles and geometry unless you are comfortable with N-dimensional space with N>>3. Most people aren't.