I didn't even catch that. Surely they meant 57 billion words. Some words are bro...

IanCal · on Jan 16, 2023

It's at about 7 minutes in the video, they really do say it several times in a few ways. He starts by saying it's trained on 570 billion megabytes, which is probably where this confusion starts. Looking again at the paper, Common Crawl after filtering is 570GB or 570 billion bytes. So he makes two main mistakes - one is straight up multiplying by another million, then by assuming one byte is equivalent to one word. Then a bit more because less than half of it is used. That's probably taking it out by a factor of about ten million or more.

300B is then the "training budget" in a sense, not every dataset is used in its entirety, some are processed more than once, but each of the GPT3 sizes were trained on 300B tokens.