You require 6 \* parameter \* token flops[1] to train LLM. Which means (flop/s o...

You require 6 * parameter * token flops[1] to train LLM. Which means (flop/s of H100 * MFU) / (6 * parameter) token per second. Assuming MFU of 40%, it is (1000 * 10^12 * 0.4) / (6 * 10^9) token/sec = 67,000 token/sec.

This repo[2] by Meta achieves 48% MFU, or 80k token/second.

[1]: https://arxiv.org/pdf/2001.08361

[2]: https://github.com/facebookresearch/lingua