Maybe but I don't see how this release would support that conclusion- their 72B model surpasses Llama 3 70B on so many metrics by such a wide margin I find it a little hard to believe.
I'm impressed by how many of the new benchmarks that the Qwen team ran. As the old benchmarks get saturated/overfit, new ones are of course required. Some of the latest ones they use include:
* MMLU-Pro https://github.com/TIGER-AI-Lab/MMLU-Pro - a new more challenging (and improved in other areas) version of MMLU that does a better job separating out the current top models
* MixEval(-Hard) https://github.com/Psycoy/MixEval - a very quick/cheap eval that has high correlation w/ Chatbot Arena ELOs that can w/ (statistically correlated) dynamically swappable question sets
* LiveCodeBench https://livecodebench.github.io/ - a coding test with different categories based off of LeetCode problems that also lets you filter/compare scores by problem release month to see if the impact of overfitting/contamination
We are plateauing with respect to compute. Unreleased llama 3 400b has significantly better benchmarks. Also Zuckerburg said that llama 3 continued to improve even after 15T tokens.
They actually used some of the newest benchmarks including MixEval which seems to be in line with Lmsys crowdsourced ELO scores and super efficient to run.