You require 6 * parameter * token flops[1] to train LLM. Which means (flop/s of H100 * MFU) / (6 * parameter) token per second. Assuming MFU of 40%, it is (1000 * 10^12 * 0.4) / (6 * 10^9) token/sec = 67,000 token/sec.
This repo[2] by Meta achieves 48% MFU, or 80k token/second.
This repo[2] by Meta achieves 48% MFU, or 80k token/second.
[1]: https://arxiv.org/pdf/2001.08361
[2]: https://github.com/facebookresearch/lingua