More

bclavie · 2024-12-21T01:47:35 1734745655

Hey! It’s more like comparing apples to apple pie.

BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.

ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.

bclavie · 2024-12-20T06:31:31 1734676291

They perform different roles, so they're not directly comparable.

Jina V3 is an embedding model, so it's a base model, further fine-tuned specifically for embedding-ish tasks (retrieval, similarity...). This is what we call "downstream" models/applications.

ModernBERT is a base model & architecture. It's not supposed to be out of the box, but fine-tuned for other use-cases, serving as their backbone. In theory (and, given early signal, most likely in practice too), it'll make for really good downstream embeddings once people build on top of it!

bclavie · 2024-12-20T02:33:32 1734662012

Sentence Transformers (https://sbert.net/), the most used library for embedding models (similarity, retrieval.)

bclavie · 2024-12-20T01:19:03 1734657543

We had a bit of a discussion around it, but I figured that 6 years warranted the prefix, and it's easier to remember in the sea of new acronyms popping up everyday.

Besides, PostModernBERT will be there for us for the next generational jump.

bclavie · 2024-12-20T01:15:28 1734657328

Thank you! We're fixing the link.

bclavie · 2024-12-20T01:14:40 1734657280

Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.

For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.

For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!

bclavie · 2024-06-27T16:36:27 1719506187

Hey! Thanks for posting this. I'm the author of this post -- please feel free to shout if you've got any questions

bclavie · on Jan 4, 2024

Longer Background/Explanation:

I’ve been working on RAG problems for quite a while now, and it’s very apparent that solving real-life problems with it is very, very different from the basic tutorials around.

There are a million moving parts, but a huge one is obviously the model you use to retrieve the data. The most common approach rely on just using dense embeddings (like OpenAI’s embedding models), and getting the documents that have the embedding vectors closest to the query’s own embedding.

The problem is that in practice, it’s a bit of a Sisyphean task: you’re asking a model to compress a document into a tiny vector. And then, it must also be able to encode a very differently worded query into another tiny vector, that must look similar to the previous vector. And it must do so in a way that can represent any specific aspect of the document that could be requested.

The result is that dense embeddings require tons of data to be trained (billions+ pertaining examples), are relatively hard to fine-tune (must find a hard-to-strike balance), and have been shown many times in the Information Retrieval (IR) literature to generalise worse outside of known benchmarks. This doesn’t mean they’re not a very useful tool, but there might be more suitable tools for retrieving your data.

In the IR literature again, late-interaction models, or “sparse embedding” approaches like ColBERT or SparseEmbed are clear winners. They train quickly, need less data, fine-tune relatively easily, and generalise very-well (their zero-shot performance is never far behind fine-tuned performance!)

This is because these models don’t encode full documents: they create bags-of-embeddings! It’s a twist on the old-timey keyword-based retrieval, except instead of hardcoded keywords, we now use contextualised semantic keywords. The models capture the meaning of all the “small units of content” within their context.

From there, a document’s represented as the sum of its parts. At retrieval time, “all you need to” is to match your query’s “semantic keywords” to the ones in your documents. It’s much easier for the model to learn representation for these tiny units, and much easier to match them. So what’s the catch? Why is this not everywhere? Because IR is not quite NLP — it hasn’t gone fully mainstream, and a lot of the IR frameworks are, quite frankly, a bit of a pain to work with in-production. Some solid efforts to bridge the gap like Vespa [1] are gathering steam, but it’s not quite there.

[1] https://vespa.ai

bclavie · on Oct 31, 2023

Thanks! I agree -- I find it much easier to skim a few paragraphs than to skim through a video when trying to consume information quickly if I'm not sure I want to commit to a full, long vid. Hoping to make it useful enough that it ends up paying for its own server costs so I can keep it around!

bclavie · on Oct 31, 2023

Merci! It's early on but I'm quite happy with how the first prototype turned out.