The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5...

Workaccount2 · on March 25, 2025

Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators.

alexdzm · on March 25, 2025

Keen to hear more about this benchmark. Is it representative of chat-to-document style usecases with big docs?

sebzim4500 · on March 25, 2025

Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).

[1] https://arxiv.org/pdf/2409.12640

[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...

sebzim4500 · on March 25, 2025

Update: Gemini 2.5 also crushes fiction.livebench

swyx · on March 25, 2025

"MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo