Evaluating faithfulness and content selection of LLMs in book-length summaries

simonw · 2024-04-10T01:21:40 1712712100

As far as I can tell, this study was almost entirely about fiction: https://github.com/mungg/FABLES/blob/main/booklist.md lists 26 books, only one of which is classified as non-fiction.

I would imagine that summaries of non-fiction books are evaluated quite differently from summaries of fiction.

I've been trying to figure out what prompts they used. The https://github.com/mungg/FABLES GitHub repo says this:

    Summary -- (str) Entire book summarized
    by one of five models: Mixtral,
    GPT-3.5-Turbo, GPT-4, GPT-4-Turbo, and
    Claude-3-Opus, using the hierarchical
    merging method described in Chang et al..

With a link to https://arxiv.org/pdf/2310.00785.pdf - which then links to another GitHub repository, https://github.com/lilakk/BooookScore which has a bunch of prompts in https://github.com/lilakk/BooookScore/tree/main/prompts

Which makes me think that this original paper isn't evaluating LLMs so much as it's evaluating that one particular prompting technique for long summaries.

Gemini Pro 1.5 has 1 million token context length, which should remove the need for weird hierarchical summary tricks. I wonder how well it would score?

worldaroundthe · 2024-04-10T12:28:56 1712752136

The issue with non-fiction is that some information may come from the parametric memory, rather than source text supplied to the model. So then the issue is whether the model can process the text and summarize the book or is it cheating? Fiction published in the last few months is your best shot, though sure depending on the nature of non-fiction this evaluation can be different (but some things remain the same, like you want it to be faithful/factual and not omit important information)

simonw · 2024-04-10T14:19:41 1712758781

Thanks, I hadn't caught that. Now I understand why the study used fiction rather than non-fiction.

worldaroundthe · 2024-04-14T15:38:13 1713109093

No prob! there was not enough explanation in the paper about this (or none). As for the prompts used for merging, there is now a link in the github repo for this.

1024core · 2024-04-09T23:54:18 1712706858

I didn't read the paper (just skimmed it), but no mention of Gemini 1.5 Pro? It is supposed to have the longest context window (1M available, 10M claimed in lab tests).

hiddencost · 2024-04-10T01:06:11 1712711171

Let the potato rest a minute.

I suspect that they were preparing this for press by the time Gemini 1.5 Pro was released.

worldaroundthe · 2024-04-10T12:33:02 1712752382

Gemini 1.5 Pro API was literary released yesterday. It took 11h+ for a person to evaluate a book (but that's if they sit down and do it all at once), it takes weeks to run something like this so yes... no gemini... yet...