That's not ideal and probably picking up some of the original training set from GPT-2 (this model is bolted on top of it)! How about DUOLINGOLOGY instead https://bit.ly/3fPGP8q
I'm using a blacklist to reject "real" words but it's surprisingly hard to build for rare words. I'm up to ~600K items after parsing Wikipedia tokens and it still doesn't capture everything.
airpods
air·pods
a large pair of wings and wings of a bird or other flying
animal, typically used as a guide for a figure skater and
paraglider
"a pair of airpods"
I'm using a blacklist to reject "real" words but it's surprisingly hard to build for rare words. I'm up to ~600K items after parsing Wikipedia tokens and it still doesn't capture everything.