> The pre-training process consists of three stages. In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens. This stage provided the model with basic language skills and general knowledge.
As this is in trillions, where does this amount of material come from?
The raw CommonCrawl has 100 trillion tokens, admittedly some duplicated. RedPajama has 30T deduplicated. That’s most of the way there, before including PDFs and Alibaba’s other data sources (Does Common Crawl include Chinese pages? Edit: Yes)
As this is in trillions, where does this amount of material come from?