Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have been working with a group that is trying to clone this dataset and make it publicly available (https://github.com/jcpeterson/openwebtext), and I have noticed quite a bit of code in the scraped dataset. Future releases of our dataset will be pre-filtered with another LSTM language model that will filter sentences by their probability under more conversational / literary datasets.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: