This is from 21, not really news, and the paper version on arxiv and published at NeurIPS have quite a few citations. No one's suppressing this, people that don't reflect on their datasets or how they use them just either don't care or fail to acknowledge they're actual issues.
IANA AI developer but have been looking into this in detail recently for other purposes. I was puzzled at the lack of info about "books" and when searching for detail (in what I believe was a reasonably diligent manner) found a very surprisingly small amount of it. I assumed there would be more knowledge and did ask for it here. So now I will go look up those papers to get a better sense of things. Thank you for the tip.
I note neither this paper nor any discussion of "BookCorpus" or even "book corpus" has appeared on HN previously.
Addressing "Documentation Debt" in Machine Learning
Research: A Retrospective Datasheet for BookCorpus, 2021, Jack Bandy and Nicholas Vincent