I have been working with a group that is trying to clone this dataset and make i...

I have been working with a group that is trying to clone this dataset and make it publicly available (https://github.com/jcpeterson/openwebtext), and I have noticed quite a bit of code in the scraped dataset. Future releases of our dataset will be pre-filtered with another LSTM language model that will filter sentences by their probability under more conversational / literary datasets.