Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The pre-training process consists of three stages. In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens. This stage provided the model with basic language skills and general knowledge.

As this is in trillions, where does this amount of material come from?



The raw CommonCrawl has 100 trillion tokens, admittedly some duplicated. RedPajama has 30T deduplicated. That’s most of the way there, before including PDFs and Alibaba’s other data sources (Does Common Crawl include Chinese pages? Edit: Yes)


Synthetic Data (after reasoning breakthroughs feels like more AI laabs are betting for synthetic data to scale.)

wonder at what price


If they’re using vision models to extract pdf data then they can’t be shy on throwing money at it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: