Are we plateauing with those LLM benchmarks?

causal · 2024-06-06T17:15:54 1717694154

Maybe but I don't see how this release would support that conclusion- their 72B model surpasses Llama 3 70B on so many metrics by such a wide margin I find it a little hard to believe.

segmondy · 2024-06-06T18:15:24 1717697724

that benchmark is by then, the community has to evaluate and report back. I never believe self reported benchmarks.

moffkalast · 2024-06-06T19:02:29 1717700549

Yeah the arena leaderboard will show where it really goes in a week or two.

irthomasthomas · 2024-06-07T12:55:46 1717764946

Arena lets you use 1k out of those 128k available tokens. It's not a good test.

causal · 2024-06-07T15:54:56 1717775696

It's a human-preference benchmark. Useful but not sufficient.

moffkalast · 2024-06-07T13:07:34 1717765654

Well they only have so much compute right? Beats the usual multiple choice, one token output benchmarks.

lhl · 2024-06-07T09:35:18 1717752918

I'm impressed by how many of the new benchmarks that the Qwen team ran. As the old benchmarks get saturated/overfit, new ones are of course required. Some of the latest ones they use include:

* MMLU-Pro https://github.com/TIGER-AI-Lab/MMLU-Pro - a new more challenging (and improved in other areas) version of MMLU that does a better job separating out the current top models

* MixEval(-Hard) https://github.com/Psycoy/MixEval - a very quick/cheap eval that has high correlation w/ Chatbot Arena ELOs that can w/ (statistically correlated) dynamically swappable question sets

* Arena Hard https://github.com/lm-sys/arena-hard-auto - another automatic eval tool that uses LLM-as-a-Judge w/ high correlation w/ Chatbot Arena / human rankings

* LiveCodeBench https://livecodebench.github.io/ - a coding test with different categories based off of LeetCode problems that also lets you filter/compare scores by problem release month to see if the impact of overfitting/contamination

minimaxir · 2024-06-06T17:05:22 1717693522

No, but getting better benchmarks tends to require more shenanigans (e.g. mixture-of-experts).

Qwen2 72B doesn't score that high on the leaderboard relative to brute-forced finetunes: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

GaggiX · 2024-06-06T17:21:19 1717694479

Going from 79.5% up to 84.2% is a -23% error rate for example, it's quite a huge difference.

YetAnotherNick · 2024-06-06T17:13:57 1717694037

We are plateauing with respect to compute. Unreleased llama 3 400b has significantly better benchmarks. Also Zuckerburg said that llama 3 continued to improve even after 15T tokens.

tosh · 2024-06-06T17:50:26 1717696226

better benchmarks with higher ceilings are also needed to be able to tell apart how good the better models are compared to the others

jimmySixDOF · 2024-06-06T22:10:02 1717711802

They actually used some of the newest benchmarks including MixEval which seems to be in line with Lmsys crowdsourced ELO scores and super efficient to run.

esafak · 2024-06-06T17:05:06 1717693506

I doubt it; they could be a whole lot smarter. We need to solve alignment in the meantime.