Hacker News new | past | comments | ask | show | jobs | submit login

1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.




But its not just words…


I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: