Summary -- (str) Entire book summarized
by one of five models: Mixtral,
GPT-3.5-Turbo, GPT-4, GPT-4-Turbo, and
Claude-3-Opus, using the hierarchical
merging method described in Chang et al..
Which makes me think that this original paper isn't evaluating LLMs so much as it's evaluating that one particular prompting technique for long summaries.
Gemini Pro 1.5 has 1 million token context length, which should remove the need for weird hierarchical summary tricks. I wonder how well it would score?
The issue with non-fiction is that some information may come from the parametric memory, rather than source text supplied to the model. So then the issue is whether the model can process the text and summarize the book or is it cheating? Fiction published in the last few months is your best shot, though sure depending on the nature of non-fiction this evaluation can be different (but some things remain the same, like you want it to be faithful/factual and not omit important information)
No prob! there was not enough explanation in the paper about this (or none). As for the prompts used for merging, there is now a link in the github repo for this.
I didn't read the paper (just skimmed it), but no mention of Gemini 1.5 Pro? It is supposed to have the longest context window (1M available, 10M claimed in lab tests).
Gemini 1.5 Pro API was literary released yesterday. It took 11h+ for a person to evaluate a book (but that's if they sit down and do it all at once), it takes weeks to run something like this so yes... no gemini... yet...
I would imagine that summaries of non-fiction books are evaluated quite differently from summaries of fiction.
I've been trying to figure out what prompts they used. The https://github.com/mungg/FABLES GitHub repo says this:
With a link to https://arxiv.org/pdf/2310.00785.pdf - which then links to another GitHub repository, https://github.com/lilakk/BooookScore which has a bunch of prompts in https://github.com/lilakk/BooookScore/tree/main/promptsWhich makes me think that this original paper isn't evaluating LLMs so much as it's evaluating that one particular prompting technique for long summaries.
Gemini Pro 1.5 has 1 million token context length, which should remove the need for weird hierarchical summary tricks. I wonder how well it would score?