Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length.


Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators.


Keen to hear more about this benchmark. Is it representative of chat-to-document style usecases with big docs?


Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).

[1] https://arxiv.org/pdf/2409.12640

[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...


Update: Gemini 2.5 also crushes fiction.livebench


"MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: