Hacker News new | past | comments | ask | show | jobs | submit login

Are we plateauing with those LLM benchmarks?



Maybe but I don't see how this release would support that conclusion- their 72B model surpasses Llama 3 70B on so many metrics by such a wide margin I find it a little hard to believe.


that benchmark is by then, the community has to evaluate and report back. I never believe self reported benchmarks.


Yeah the arena leaderboard will show where it really goes in a week or two.


Arena lets you use 1k out of those 128k available tokens. It's not a good test.


It's a human-preference benchmark. Useful but not sufficient.


Well they only have so much compute right? Beats the usual multiple choice, one token output benchmarks.


I'm impressed by how many of the new benchmarks that the Qwen team ran. As the old benchmarks get saturated/overfit, new ones are of course required. Some of the latest ones they use include:

* MMLU-Pro https://github.com/TIGER-AI-Lab/MMLU-Pro - a new more challenging (and improved in other areas) version of MMLU that does a better job separating out the current top models

* MixEval(-Hard) https://github.com/Psycoy/MixEval - a very quick/cheap eval that has high correlation w/ Chatbot Arena ELOs that can w/ (statistically correlated) dynamically swappable question sets

* Arena Hard https://github.com/lm-sys/arena-hard-auto - another automatic eval tool that uses LLM-as-a-Judge w/ high correlation w/ Chatbot Arena / human rankings

* LiveCodeBench https://livecodebench.github.io/ - a coding test with different categories based off of LeetCode problems that also lets you filter/compare scores by problem release month to see if the impact of overfitting/contamination


No, but getting better benchmarks tends to require more shenanigans (e.g. mixture-of-experts).

Qwen2 72B doesn't score that high on the leaderboard relative to brute-forced finetunes: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...


Going from 79.5% up to 84.2% is a -23% error rate for example, it's quite a huge difference.


We are plateauing with respect to compute. Unreleased llama 3 400b has significantly better benchmarks. Also Zuckerburg said that llama 3 continued to improve even after 15T tokens.


better benchmarks with higher ceilings are also needed to be able to tell apart how good the better models are compared to the others


They actually used some of the newest benchmarks including MixEval which seems to be in line with Lmsys crowdsourced ELO scores and super efficient to run.


I doubt it; they could be a whole lot smarter. We need to solve alignment in the meantime.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: