That cosine distance works at all as a concept isn't terribly shocking, especially given our habit of norming everything. Cosine similarity in a unit vector space is monotonic with euclidean distance, and we're using this stuff in "select the K most relevant vectors" sorts of queries, so cosine similarity behaves identically to euclidean distance. Tack on the fact that every finite set of vectors, with your favorite metric, can be embedded in euclidean space with at most ~41% relative error (and errors that high require somewhat special circumstances, so you'd expect most real-world data to have lower errors -- plus, the error doesn't apply to every pair of points, and many will definitely have much lower error), and you're able to use normed cosine similarity somewhat reasonably on every finite set of stuff you care about, so long as you choose an appropriate embedding. All sets of things you care about in ML are finite, and the sub-metric induced by whichever infinite set you're considering works just fine for everything we've discussed, so cosine similarity is reasonable for all practical ML purposes.
It's much more interesting that almost any set of ML-adjacent vectors can be somewhat reasonably compared via cosine distance _even without_ explicitly constructing an optimal embedding. It's not at all intuitive to me that an autoencoder's interior layer should behave well with respect to cosine similarity and not have any knots or anything warping that (associated) metric's usefulness.
Tbh, I would argue that's also pretty surprising, as Euclidean distance is notoriously unintuitive[1] (and noisy) in higher dimensions. (I guess norming does help, so that's likely a good point.)
I don't have any references off the top of my head, but I can state a result (which hopefully is easier to search) and then build to the thing I claimed:
- There's a well-known result that says you can embed any finite, undirected, unweighted graph (with shortest-path as the metric of choice) in Euclidean space with at most sqrt(2) multiplicative error (~41% relative error). The proof of that, as I recall, basically shows that the worst-case embedding is any graph with a 4-vertex cycle, from which the result immediately follows.
- Construct the complete graph where each vertex is one of your (finitely many) vectors.
- Let G(0) denote that graph.
- Greedily construct a sequence of graphs G(1), G(2), ... by adding one extra vertex splitting some edge. The edge you split is the one that minimizes skew between your metric's definition of distance between your original set of vectors and the new distance defined using shortest paths. When more than one equivalent split exists, choose one.
- Note that G(N) consists of better and better rational approximations of the desired distances, approaching 0 skew as N approaches infinity.
- Using that well-known result, we can embed G(N) with at most ~41% relative error, and as N grows the extra relative error approaches 0, so multiplying those together you get ~41%.
- Delete the extra vertices from G(N) to obtain the embedding for your original vectors.
- This isn't strictly necessary, but you can (I'm 90% sure -- I'd have to work through these details very carefully to be positive) obtain sqrt(2) exactly (instead of just less than sqrt(2)+epsilon for every positive epsilon) in a few ways. The one I'm most familiar with falls back to logical compaction, basically taking all the math that exists for the hyperreals and using that to create a nonstandard space holding all of these graphs. The limit approaching 0 can be phrased as a first-order statement with consequences that allow you to conclude that there exists an embedding which actually attains the limit. Topological arguments of some sort would probably be a more common way to go about it.
- That last point (attaining sqrt(2)) follows from Bolzano-Weierstrass. Just unroll the embedding into an n^2-dimensional space if you have n points, and that sequence of embeddings (as a technical detail, center them on the origin to squeeze them all into a bounded space). The function (skew of that embedding) of the limit equals the limit of the function in this case, so there is some embedding attaining that bound.
I'm still not exactly sure how the result I'm basing this all on is proven.
It's much more interesting that almost any set of ML-adjacent vectors can be somewhat reasonably compared via cosine distance _even without_ explicitly constructing an optimal embedding. It's not at all intuitive to me that an autoencoder's interior layer should behave well with respect to cosine similarity and not have any knots or anything warping that (associated) metric's usefulness.