Hacker News new | past | comments | ask | show | jobs | submit login

Do you have sources for "The MFU can be above 40% and certainly well above the 35 % in the estimate"?

Looking at [1], the authors there claim that their improvements were needed to push BERT training beyond 30% MFU, and that the "default" training only reaches 10%. Certainly numbers don't translate exactly, it might well be that with a different stack, model, etc., it is easier to surpass, but 35% doesn't seem like a terribly off estimate to me. Especially so if you are training a whole suite of different models (with different parameters, sizes, etc.) so you can't realistically optimize all of them.

It might be that the real estimate is around 40% instead of the 35% used here (frankly it might be that it is 30% or less, for that matter), but I would doubt it's so high as to make the estimates in this blog post terribly off, and I would doubt even more that you can get that "also for small models with plain pytorch and trivial tuning".

[1] https://www.databricks.com/blog/mosaicbert




Please look at any of the plain pytorch codes by Karpathy that complement llm.c. If you want scalable codes, please look at Megatron-LM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: