They are better at separating clusters and keep the fact that distances under th...

They are better at separating clusters and keep the fact that distances under the correct metric also provide semantic information. The issue is that training is longer and you need at least 32, and ideally 64 bit floats during training and inference.

And possibly.

The company I did the work for kept it very quiet. Bert like models are small enough that you can train them a a work station today so there is a lot less prestige in them than 5 years ago, which is why for profit companies don't write papers on them any more.