Hacker News new | past | comments | ask | show | jobs | submit login

I disagree. 1.5b is strictly superior to 117M or 345M on everything we've trained it on, from 20MB contemporary poetry on up. Assuming of course you don't screw it up or train too long. The only times we've concluded the smaller models were worth using is when the transfer learning was basically useless (eg for our ABC/MIDI models, there's hardly anything English-like in it, so no transfer, and no point in using 1.5b since it'll just overfit like 117M does, so might as well stick with the small model, since that lets us do things like use >25k context windows).



For example, look at OpenAI's latest paper on scaling Transformers "Scaling Laws for Neural Language Models" https://arxiv.org/abs/2001.08361 , Kaplan et al 2020:

larger models are better, in the entire range they test up to billion-parameter models, in pretty much every way - they need hardly any additional data, achieve lower losses, they train faster, they can be parallelized better, and they're even more compute-efficient and sample-efficient (!).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: