Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :) We're very excited about this model release -- it feels like it could be the basis of all kinds of interesting new startups and projects.
In fact, the stuff mentioned in the blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.
Anyhoo, if anyone has any questions, feel free to ask!
Jeremy, this is awesome! Personally excited for a new wave of sentence transformers built off ModernBERT. A poster below provided the link to a sample ST training script in the ModernBERT repo, so that's great.
Do you expect the ModernBERT STs to carry the same advantages over ModernBERT that BERT STs had over the original BERT? Or would you expect caveats based on ModernBERT's updated architecture and capabilities?
Hey Jeremy, very exciting release! I'm currently building my first product with RoBERTa as one central component, and I'm very excited to see how ModernBERT compares. Quick question: When do you think the first multilingual versions will show up? Any plans of you training your own?
Thank you so much for doing this work. I expect many NLP projects and organizations are going to benefit from this, and I'm looking forward to all the models that will be derived from this. I'm already imagining the things I might try to build with it over the holiday break.
Tiny feedback maybe you can pass along to whoever maintains the HuggingFace blog — the GTE-en-MLM link is broken.
1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?
2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?
Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.
For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.
For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!
Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention.
I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.
On your second point, most modern LLMs are decoder only. And as for why adding a classification head isn’t optimal, the decoders you’re referring to have 10x the parameters, and aren’t trained on encoder-type tasks like MLM. So there’s no advantage on any dimension really.
Llama and Mistral are decoder-only models; there is no encoder you could put a head on.
You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.
Probably a multilingual version will be needed, like with BERT and RoBERTa. I should hasten to add for multi language tasks(beyond detection), either simpler methods for tasks like multiple language classification/prediction(e.g. word frequency, BERTopic like approaches or SVMs) or LLMs are generally a better candidate.
There are a couple of reasons..
1) That size (even for the large) is too much for multiple languages with good BLEU scores.
2) Encoder and decoder models don't tend to get trained for translation as much as e.g. GPT models with large translation texts in their datasets across multiple languages (with exceptions such as T5 translation task).
Hey! It’s more like comparing apples to apple pie.
BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.
ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.
In fact, the stuff mentioned in the blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.
Anyhoo, if anyone has any questions, feel free to ask!