> The pre-training process consists of three stages. In the first stage (S1), th...

bionhoward · 2025-04-29T01:33:07 1745890387

The raw CommonCrawl has 100 trillion tokens, admittedly some duplicated. RedPajama has 30T deduplicated. That’s most of the way there, before including PDFs and Alibaba’s other data sources (Does Common Crawl include Chinese pages? Edit: Yes)

tough · 2025-04-28T21:18:26 1745875106

Synthetic Data (after reasoning breakthroughs feels like more AI laabs are betting for synthetic data to scale.)

wonder at what price

Havoc · 2025-04-29T12:24:37 1745929477

If they’re using vision models to extract pdf data then they can’t be shy on throwing money at it