Yea, that's great. As a full-time tech writer, I use a similar system when editing someone else's work. I've been using stupid acronyms like JAA (just an alternative [way of writing X]), which would be something like 1-2/10 and FBND (fix before next draft) for 8-10/10. Adam's number scale is far more nuanced and concise)
If anyone else out there was put off by the potential for the title to conform to Betteridge’s Law of Headlines (“Any headline that ends in a question mark can be answered by the word no”), you can rest easy. This one is a “maybe”
Really drawn to the idea of storing the final recommendations in a separate file. Didn't write it that way initially because most SSGs handle "data" files in their own unique way, and one of my goals was to be as "light touch" as possible. But I guess that could be solved with some config options (set `path/to/data.json` or similar).
I agree that "79% match" doesn't mean anything on it's own (and is, yes, totally arbitrary in a lot of ways), but it does provide some context when browsing across the whole site. It's a way to indicate that a "90% match" is more similar than a "60% match". Felt like useful info to me, so that's why I included it.
As for "why only top two?" - I'm constantly paranoid about adding too many bells and whistles to my blog and overloading the patience of whoever's taken the time to read. If I had official rules, they'd read "no carousels, and as few Calls To Action on a page as possible". I'm not super strict about it, but 1 recommendation felt too stingy and 3 felt like too many.
BM25 is new to me, and I'm mostly a n00b when it comes to "proper" search. But I'll definitely do some more reading. Currently setting up a head-to-head with an embedding index and a fuzzy-search library, but don't have any scientific way to measure the results. Sounds like you may have pointed me in the direction of a missing piece of the puzzle. Thanks!
Most SSGs will copy a literal file without a problem. I dump all the HTML snippets under a /metadata/ prefix and then in Hakyll it's just a matter of doing a literal-copy of '/metadata/*'. (Our convention for annotations or backlinks or similar links is that they live at '/metadata/$TYPE/$ESCAPED_URL.html'. Then you can fill in the necessary links easily in the template as long as you have '$ESCAPED_URL' provided to the template by the SSG.) The real obstacle is that most SSGs want you to do any transclusion at a compile-time, even though this leads to potentially exponential explosions of size, and won't include any JS library for doing client-side transclusion.
(And you do need a JS library, it's not just a line or two of throwaway code. Client-side transclusion is a bit tricky to get right for use-cases as advanced and general-purpose as ours - we use it for lots of things. Transclude other pages, transclude sections of pages, transclude section ranges, recursive transclusions... Needs to make sure styles get applied, render it off page for acceptable performance, rewrite paths inside the transcluded HTML so links go where you expect them to - that sort of thing.)
The percent match is also misleading because there is no sense in which it is a percentage. It just isn't. '79% match' is not 1% more similar than '78% match'. My finding with the OA embedding is that a distance of 0.01 actually corresponds to a pretty large semantic distance and after a few more increments, the suggestions are worthless. Also consider this: a distance of 0 (ie. itself) may arguably be '100%' (hard to get more similar than itself!), but then what is a distance like 1? (And can't the cosine distance go higher?) Can you really be '0% similar', never mind '-10% similar'? It is true that 80% is better than 79%, but that's all that means, and you can present that by simply putting them in a list by distance, as you do already.
Thanks for the heads up, although the legacy shut-off won't affect this script.
Not sure if it's clear from the post or not, but this script uses the newer, recommended, `v1/chat/completions` endpoint rather than the deprecated `v1/completions` endpoint
Very true! The real limiting factor on the success of this feature is the lack of posts on my blog :D
Works much more effectively for posts about topics I've covered a lot, like SVG or web audio. Happy with the outcomes in general, though, as I can now apply to more content-rich projects.
Even that simple explanation makes my brain itch a little :D I never did master trig
I'm curious if there ARE alternative methods to cosine similarity. A lot of the things I've read mention that cosine similarity is "one of the ways to compute distance..." or "a simple way...". But I've not seen any real suggestions for alternatives. Guess everyone's thinking "if it ain't broke, don't fix it" as cosine similarity works pretty darn well
Yeah there are a few other ways. The most common are the “L2 norm”, which would be the hypoteneuse of a right triangle. so if your points are (x1,y1), (x2,y2) then it is sqrt((x1-x2)^2 + (y1-y2)^2)) which you might recognise from Pythagoras’ theorem (c^2 = a^2 + b^2). If you have 1000 dimensions then instead of just twice for x and y you are doing that that a thousand times but the principle is the same.
Another one is “Manhattan distance” (known as the L1 norm or sometimes as “taxicab distance”), which is just abs(x1-x2)+abs(y1-y2) in that example. If you imagine a set of city blocks and you want to go from one place to another the cab has to go north/south and east/west and can’t go diagonally. That’s the distance it travels. You’re adding up all the North/south parts and the east/west parts.
There are a bunch of other distance measures eg one project I worked on we used Mahalanobis distance which is a more complex measure which adjusts for dimensions in your space being affected by covariance. That wouldn’t be useful for this particular problem though.
To be honest I've sidestepped the issue now that GPT4 context size has increased to ~8000 tokens. None of my content goes over that limit.
I did build in a "chunking" mechanism to break down the article into sections if it was over the limit, but I'm not entirely sure how effective the summarisation would be for those... Summarising the first part of an article and then doing a separate summary for the n-th part would probably make confusing results when recombined.
Might be some "prompt magic" that can make it possible, but for the 90%-case I bet you'd get perfectly useable results just by using the first under-limit part of the content. Not tested that idea in the real world yet, though.
Yeah - it's a great idea. The size of the embeddings is the big restricting factor IMO. Even with my approach of embedding the entire article, my embeddings index was about the same size as my "regular" search index.
Once you start increasing the granularity of what you're embedding (either by paragraph or sentence) then the old-fashioned search index has a big advantage.
Might be worth it in some scenarios because of the quality of the results. I bet there are places where an embedding search would be more effective by orders of magnitude.
My default approach is to use as few dependencies as possible for Proof of Concept experiments like this one, but it DOES favour convenience over long-term efficiency and performance.
I think a single-author blog is about the limit for what can be handled by my flat-file + working memory approach. Any dataset that's much larger would almost certainly need a vector-friendly DB. Typesense looks like it might be a good fit, and so does the pgvector extension of postgres