Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :) We're very excit...

ZQ-Dev8 · 2024-12-19T19:02:08 1734634928

Jeremy, this is awesome! Personally excited for a new wave of sentence transformers built off ModernBERT. A poster below provided the link to a sample ST training script in the ModernBERT repo, so that's great.

Do you expect the ModernBERT STs to carry the same advantages over ModernBERT that BERT STs had over the original BERT? Or would you expect caveats based on ModernBERT's updated architecture and capabilities?

jph00 · 2024-12-19T19:04:48 1734635088

Yes absolutely the same advantages -- in fact the maintainer of ST is on the paper team, and it's been a key goal from day one to make this work well.

data_ders · 2024-12-20T02:10:30 1734660630

what’s ST stand for here? I googled and only got results for BERT STS (semantic text similarity)

bclavie · 2024-12-20T02:33:32 1734662012

Sentence Transformers (https://sbert.net/), the most used library for embedding models (similarity, retrieval.)

derbaum · 2024-12-19T19:08:05 1734635285

Hey Jeremy, very exciting release! I'm currently building my first product with RoBERTa as one central component, and I'm very excited to see how ModernBERT compares. Quick question: When do you think the first multilingual versions will show up? Any plans of you training your own?

newfocogi · 2024-12-19T21:58:10 1734645490

Thank you so much for doing this work. I expect many NLP projects and organizations are going to benefit from this, and I'm looking forward to all the models that will be derived from this. I'm already imagining the things I might try to build with it over the holiday break.

Tiny feedback maybe you can pass along to whoever maintains the HuggingFace blog — the GTE-en-MLM link is broken.

https://huggingface.co/thenlper/gte-en-mlm-large should be https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base

bclavie · 2024-12-20T01:15:28 1734657328

Thank you! We're fixing the link.

querez · 2024-12-19T20:41:35 1734640895

Two questions:

1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?

2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?

bclavie · 2024-12-20T01:14:40 1734657280

Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.

For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.

For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!

cubie · 2024-12-19T22:12:52 1734646372

Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.

janalsncm · 2024-12-19T21:29:57 1734643797

On your second point, most modern LLMs are decoder only. And as for why adding a classification head isn’t optimal, the decoders you’re referring to have 10x the parameters, and aren’t trained on encoder-type tasks like MLM. So there’s no advantage on any dimension really.

yorwba · 2024-12-19T21:05:59 1734642359

Llama and Mistral are decoder-only models; there is no encoder you could put a head on.

You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.

spott · 2024-12-19T21:26:58 1734643618

ModernBERT-Base is larger than BERT-Base by 39M parameters.

bertobugreport · 2024-12-19T22:31:43 1734647503

Trying to fine tune on single rig multi-gpu and it crashes, going back down to 1 GPU fixes and training continues (excited to see its results).

Script is near identical with the one below, updated with new imports;

https://huggingface.co/docs/transformers/en/tasks/token_clas...

bomewish · 2024-12-20T06:10:39 1734675039

I can't find any info on whether ModernBERT will handle languages other than English; German, Chinese, Arabic? Any info there would be super helpful.

authorfly · 2024-12-20T14:03:16 1734703396

Probably a multilingual version will be needed, like with BERT and RoBERTa. I should hasten to add for multi language tasks(beyond detection), either simpler methods for tasks like multiple language classification/prediction(e.g. word frequency, BERTopic like approaches or SVMs) or LLMs are generally a better candidate.

There are a couple of reasons.. 1) That size (even for the large) is too much for multiple languages with good BLEU scores. 2) Encoder and decoder models don't tend to get trained for translation as much as e.g. GPT models with large translation texts in their datasets across multiple languages (with exceptions such as T5 translation task).

bomewish · 2024-12-20T21:04:21 1734728661

Looking to do super fast embeddings, basically. A few chinese teams seem to have produced some BERT variants so I’ll look there.

geekodour · 2024-12-20T14:56:39 1734706599

Hi Jeremy, I am trying to navigate the space and trying to understand what fits where.

Could you shed some lights on what parts of bge-m3 would modernbert overlap with or would this is comparing apples to oranges?

https://huggingface.co/BAAI/bge-m3

bclavie · 2024-12-21T01:47:35 1734745655

Hey! It’s more like comparing apples to apple pie.

BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.

ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.

TheTaytay · 2024-12-19T20:01:26 1734638486

Thank you for this. I can't wait to try this, especially on GLiNER tasks.

LunaSea · 2024-12-19T21:38:00 1734644280

Hi Jeremy, do you have plans to adapt this model for different languages?