Hacker News new | past | comments | ask | show | jobs | submit login
A closer look at BookCorpus, a key dataset in machine learning (towardsdatascience.com)
122 points by Kaibeezy on Sept 20, 2023 | hide | past | favorite | 32 comments



This is really a fantastic analysis of a dataset, and it's something that should be a mandatory form of smokescreen before proceeding with actual model training, in every organization or research group. Whether you are using public datasets, buying 3rd party data, doing in-house data collection and annotation, or paying someone to do it for you, you must check for class imbalance and over/under representation within your data - inevitably human biases will creep in. Ultimately you have to evaluate whether this data distribution is compatible with your target data distribution your model will be applied on in production. Doing this post hoc is a real pain.


I agree that this is extremely valuable, but it's worth flagging that it's harder to reason about the impacts of class imbalance for generative models than e.g. classifiers. For example, should we think about genre imbalance per novel, per token, or on some more complex basis? Which genres are most relevant to a target distribution of chatbot queries?

This isn't to suggest that organizations shouldn't invest in actively understanding their training data, but that post-hoc bias analysis is going to be a critical component of evaluation for the foreseeable future.


The Wikipedia article on BooksCorpus raises an important point about how researchers used the word "unpublished" to describe the books in this corpus -- this word appears in both the original Aligning Books paper, as well as OpenAI's papers, which don't even bother to acknowledge SmashWords. The books aren't "unpublished" -- SmashWords is a self-publishing platform! Whether deliberately or not, the word choice diminishes the human effort that was appropriated by the researchers to train the models.


OP here. I had a lot of difficulty finding info on Books1 and Books2, even on HN. If there's a better source of info, please link or post.

What's the value of these scant few thousand unpublished romance and fantasy novels in the context of the rest of the corpus -- vast scrapings, all of Wikipedia, etc.? A sample of how people write? Why aren't more public domain works included?


> What's the value of these

The Pile (the 800GB dataset by Eluther AI) contains BookCorpus2, along with two much larger datasets of books (and a whole lot of not-book stuff). From their paper [0] the reasoning for the book datasets is that they are "are invaluable for long-range context modeling research and coherent storytelling". The reasoning for including BookCorpus2 next to Books3 and Project Gutenberg boils down to "no significant overlap with the other datasets, and others use it".

In general books are a great source of extremely high quality long-form content. They are longer than most content found on the web, and are generally of high quality, having gone through many revision rounds between editor and author. Just that both of these aren't really true of BookCorpus. Even a dump of highly rated stories from fanfiction.org might be better.

0: https://arxiv.org/pdf/2101.00027.pdf


Are these primarily fiction books, or a mix of fiction and non-fiction?


The books in the books3 collection aren't categorized. The source, however, currently is at a ratio of 2:1 nonfiction to fiction, and from what I've seen, whoever created the books3 archive simply attempted to gather all the EPubs they could, with their only criteria being availability.


[flagged]


the bots are not capable of reflection, they do not know and cannot check what is in their training data


Why is that list ^ not interesting? Illustrative?


It's not useful. You can ask ChatGPT again in a new session, and you'll get a different list. You can then ask it about them and find out it's making it up. For example, "Wuthering Heights" is on your list, when I ask it "Pride and Prejudice" is on it. In another session, I can ask it for opening lines, character lists, etc. from those works and then ask for a crossover story where the characters from each meet each other. The model isn't likely to regurgitate the original text in its entirety, but it does know them.


Additionally, I believe a general norm has been developing to not post raw output of ChatGPT without commentary/editing, because it's low value. If anyone wanted similar output, they could just ask ChatGPT themselves. It also often gives false info, as this one demonstrates. You need to fact check it and demonstrate that before posting it if you're posting it as a source of information, IMO.


[flagged]


With the way Copyright now lasts roughly forever and covers roughly everything then it can't expect to be respected. Copyright owners have used their power to tilt the bargain entirely in their favour. They have only themselves to blame if people increasing don't want to obey their rules.


Yeah you're right, it's totally justified to steal the work of still living artists without their permission to build massive for profit systems to automate their jobs away because some corporations have abused copyright.


Not sure while they are complaining. Even if have a AI read their work is "stealing" they'll still be making money from their copyright for the next 100 years at least.

The "artists" seem to be as happy to abuse copyright as the corporations.


> A lot of people, especially on hacker news, believe something being on the internet is a license to do whatever the fuck they want with that data.

A license to do whatever the fuck they want? No. To do things other than make copies? Sure. Copyright is the right to copy, no more than that. Learning from something is not copying it. If you want to complain about memorisation, that’s fair. But learning from something is not something copyright was intended to restrict, so no, plenty of people absolutely will not care about copyright when it comes to this kind of thing, and rightly so.


I think you're reading too much into AI specific stuff, I would happily violate copyright for no reason at all.


ChatGPT told me it doesn't have the text of the US Declaration of Independence due to copyright. It does not have the English Magna Carta within its accessible text. That seems unexpected. It does have the US Constitution.


Large language models don’t know what they know. Asking them what they know is often going to give you an incorrect answer.

  $ llm -m 4 "Quote the declaration of independence:"
IN CONGRESS, JULY 4, 1776

The unanimous Declaration of the thirteen united States of America

When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed...

(Note: The full text of the Declaration of Independence is quite lengthy, so I have included the most well-known portion here. The full text, including a list of grievances against King George III and the signatures of the signers, can be found elsewhere.)


It made that answer up.



Perhaps there's part of the prompt that tells GPT not to tell users details about written words to avoid embarrassing copyrighted text leaking out, but the prompt is slightly too strict and lets GPT also not talk about 'open' texts? Pure speculation.


I don't think that's the case, although there was an interesting bug a while back where it would freeze after each word when asked to quote the opening to, e.g., Moby Dick.


I wonder how much better an LLM could be given even better training data.

For example, the total number of tokens contained in the physical and digital collections of a moderately-sized university library is (probably) equal to or on par with the size of the training data for GPT 3.5.

What would happen if you could train just on that? I know we're using huge training sets, but how much of it is just junk from the internet?

(There should be some representative junk in the dataset, but nowhere near the majority.)


Isn't this what "tiny stories" / Phi LLM are doing? https://arxiv.org/abs/2306.11644


https://arxiv.org/abs/2306.11644 is along these lines.


I see many courses and papers base their work off BookCorpus. So this should be somewhat significant. What are the 'places' that these concerns can be highlighted to the wider machine learning community? Not everyone will be here. Is there a forum or a 'reddit' or a conference where machine learning people regularly visit?

The cynical me is thinking that because this is inconvenient news and would require rework, a lot of people would prefer to suppress or ignore the author's findings (assuming true).


This is from 21, not really news, and the paper version on arxiv and published at NeurIPS have quite a few citations. No one's suppressing this, people that don't reflect on their datasets or how they use them just either don't care or fail to acknowledge they're actual issues.


IANA AI developer but have been looking into this in detail recently for other purposes. I was puzzled at the lack of info about "books" and when searching for detail (in what I believe was a reasonably diligent manner) found a very surprisingly small amount of it. I assumed there would be more knowledge and did ask for it here. So now I will go look up those papers to get a better sense of things. Thank you for the tip.

I note neither this paper nor any discussion of "BookCorpus" or even "book corpus" has appeared on HN previously.

Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus, 2021, Jack Bandy and Nicholas Vincent

https://arxiv.org/pdf/2105.05241.pdf?


The correct title is “Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning”.


HN sometimes changes titles to make them less lurid, clickbait-y, etc. Whether that's a bot or not ¯\_(ツ)_/¯

In this case, the modified title utilizes the more descriptive language of the article's subtitle. Editors edit.


I didn’t realize, I thought it was an ironclad rule to use the original title. Thank you.


I regret reading about half of that article and suggest you save the precious moments of your life that reading it would take and do something that is valuable out interesting instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: