It's "BooksCorpus" (with an 's'), a 800M word dataset described in Zhu et al. (2...

It's "BooksCorpus" (with an 's'), a 800M word dataset described in Zhu et al. (2015) IEEE ICCV, and also available on AWS at: https://aws.amazon.com/marketplace/pp/prodview-d3ghxqzkitn6y

The Google BERT paper (Devlin et al., 2018) also references it: https://aclanthology.org/N19-1423/

Privacy questions aside (as important as they of course are), it's very important to know what a model was trained on exactly: if Wikipedia was used in the training set, you can't use questions from Wikipedia to test it (as that would be cheating) - test data must be as "unseen" as a good exam.