Hacker News new | past | comments | ask | show | jobs | submit login

It's "BooksCorpus" (with an 's'), a 800M word dataset described in Zhu et al. (2015) IEEE ICCV, and also available on AWS at: https://aws.amazon.com/marketplace/pp/prodview-d3ghxqzkitn6y

The Google BERT paper (Devlin et al., 2018) also references it: https://aclanthology.org/N19-1423/

Privacy questions aside (as important as they of course are), it's very important to know what a model was trained on exactly: if Wikipedia was used in the training set, you can't use questions from Wikipedia to test it (as that would be cheating) - test data must be as "unseen" as a good exam.




Actually, it's 'BookCorpus'. OpenAI spelt it wrong in their GPT-1 paper.

It has also been analyzed here and here:

https://arxiv.org/abs/2105.05241

https://lifearchitect.ai/whats-in-my-ai/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: