Wild! So if this indeed holds up, it looks like OpenAI were about a year ahead when GPT-4 was released, compared to the open source world. However, given the timespan between matching GPT-3.5 (Mixtral perhaps?) and matching GPT-4 has just been a few weeks, I am wondering if the open source models have more momentum.
That said, I am very curious what OpenAI has in their labs... Are they actually barely ahead? Or do they have something much better that is not yet public? Perhaps they were waiting for Llama 3 to show it? Exciting times ahead either way!
You've also got to consider that we don't really know where OpenAI are though, what they have released in the past year have been tweaks to GPT4, while I am sure the real work is going into GPT5 or whatever it gets called.
While all the others are catching up and in some cases being slightly better, I wouldn't be surprised to see a rather large leap back into the lead from OpenAI pretty soon and then a scrabble for some time for others to get close again. We will really see who has the momentum soon, when we see OpenAI's next full release.
I find it somewhat interesting that there is a common perception about GPT-4 at release being actually smart, but that it got gradually nerfed for speed with turbo, which is better tuned but doesn't exhibit intelligence like the original.
There were times when I felt that too, but nowadays I predominantly use turbo. It's probably because turbo is faster and cheaper, but in lmsys turbo has 100 elo higher than original, so by and large people simply find turbo to be....better?
Nevertheless, I do wonder if not just in benchmarks but in how people use LLMs, intelligence is somewhat under utilised, or possibly offset by other qualities.
Given the incremental increase between GPT-4 and its turbo variant, I would weight “vibes” more heavily than this improvement on MMLU. OpenAI isn’t exactly a very honest or transparent company and the metric is imperfect. As a longtime time user of ChatGPT, I observed it got markedly worse at coding after the turbo release, specifically in its refusal to complete code as specified.
Have you tried Claude 3 Opus? I've been using that predominantly since release and find it's "smarts" as or better than my experience with GPT-4 (pre turbo).
I did. It definitely exudes more all around personality. Unfortunately in my private test suite (mostly about coding), it did somewhat worse than turbo or phind 70b.
Since price influences my calculus, I can't say this for sure, but it seems being slightly smarter is not much of an edge, because it's still dumb by human standards. For most non-coding use the smart doesn't make much difference (like summarisation), I find that cheaper options like mistral-large do just as good as Opus.
In the last month I have used Command R+ more and more. Finally had some excuse to write some function calling stuff. I have also been highly impressed by Gemini Pro 1.5 finding technical answers from a dense 650 page pdf manual. I have enjoyed chatting with the WizardLM2 fine-tune for the past few days.
Somehow I haven't quite found a consistent use case for Opus.
i think it might just be the subjective feelings (GPT-4-turbo being dumber) - the joy is always stronger when you first taste it, and the joy decays as you get used to it and the bar raises ever since.
This is mostly from technical report from OpenAI[1]. API performs better as I said in my previous comment. API models(0613/0125 etc.) also uses user data for training which could leak the benchmark data.
Divide not multiply. If a size is estimated in 8-bit, reducing to 4-bit halves the size (and entropy of each value). Difference between INT_MAX and SHORT_MAX (assuming you have such defs).
I could be wrong too but that’s my understanding. Like float vs half-float.
[1]: https://deepmind.google/technologies/gemini/