I disagree. 1.5b is strictly superior to 117M or 345M on everything we've traine...

gwern · on Jan 25, 2020

For example, look at OpenAI's latest paper on scaling Transformers "Scaling Laws for Neural Language Models" https://arxiv.org/abs/2001.08361 , Kaplan et al 2020:

larger models are better, in the entire range they test up to billion-parameter models, in pretty much every way - they need hardly any additional data, achieve lower losses, they train faster, they can be parallelized better, and they're even more compute-efficient and sample-efficient (!).