Note that the free version of ChatGPT that most people use is based on GPT-3.5 which is much worse than GPT-4. I haven't found comprehensive eval numbers for the latest GPT-3.5, however I believe Llama 3 70B handily beats it and even the 8B is close. It's very exciting to have models this good that you can run locally and modify!
Wild! So if this indeed holds up, it looks like OpenAI were about a year ahead when GPT-4 was released, compared to the open source world. However, given the timespan between matching GPT-3.5 (Mixtral perhaps?) and matching GPT-4 has just been a few weeks, I am wondering if the open source models have more momentum.
That said, I am very curious what OpenAI has in their labs... Are they actually barely ahead? Or do they have something much better that is not yet public? Perhaps they were waiting for Llama 3 to show it? Exciting times ahead either way!
You've also got to consider that we don't really know where OpenAI are though, what they have released in the past year have been tweaks to GPT4, while I am sure the real work is going into GPT5 or whatever it gets called.
While all the others are catching up and in some cases being slightly better, I wouldn't be surprised to see a rather large leap back into the lead from OpenAI pretty soon and then a scrabble for some time for others to get close again. We will really see who has the momentum soon, when we see OpenAI's next full release.
I find it somewhat interesting that there is a common perception about GPT-4 at release being actually smart, but that it got gradually nerfed for speed with turbo, which is better tuned but doesn't exhibit intelligence like the original.
There were times when I felt that too, but nowadays I predominantly use turbo. It's probably because turbo is faster and cheaper, but in lmsys turbo has 100 elo higher than original, so by and large people simply find turbo to be....better?
Nevertheless, I do wonder if not just in benchmarks but in how people use LLMs, intelligence is somewhat under utilised, or possibly offset by other qualities.
Given the incremental increase between GPT-4 and its turbo variant, I would weight “vibes” more heavily than this improvement on MMLU. OpenAI isn’t exactly a very honest or transparent company and the metric is imperfect. As a longtime time user of ChatGPT, I observed it got markedly worse at coding after the turbo release, specifically in its refusal to complete code as specified.
Have you tried Claude 3 Opus? I've been using that predominantly since release and find it's "smarts" as or better than my experience with GPT-4 (pre turbo).
I did. It definitely exudes more all around personality. Unfortunately in my private test suite (mostly about coding), it did somewhat worse than turbo or phind 70b.
Since price influences my calculus, I can't say this for sure, but it seems being slightly smarter is not much of an edge, because it's still dumb by human standards. For most non-coding use the smart doesn't make much difference (like summarisation), I find that cheaper options like mistral-large do just as good as Opus.
In the last month I have used Command R+ more and more. Finally had some excuse to write some function calling stuff. I have also been highly impressed by Gemini Pro 1.5 finding technical answers from a dense 650 page pdf manual. I have enjoyed chatting with the WizardLM2 fine-tune for the past few days.
Somehow I haven't quite found a consistent use case for Opus.
i think it might just be the subjective feelings (GPT-4-turbo being dumber) - the joy is always stronger when you first taste it, and the joy decays as you get used to it and the bar raises ever since.
This is mostly from technical report from OpenAI[1]. API performs better as I said in my previous comment. API models(0613/0125 etc.) also uses user data for training which could leak the benchmark data.
Divide not multiply. If a size is estimated in 8-bit, reducing to 4-bit halves the size (and entropy of each value). Difference between INT_MAX and SHORT_MAX (assuming you have such defs).
I could be wrong too but that’s my understanding. Like float vs half-float.
Not quite there yet, but very close and not done training! It's quite plausible that this model could be state of the art over GPT-4 in some domains when it finishes training, unless GPT-5 comes out first.
Although 400B will be pretty much out of reach for any PC to run locally, it will still be exciting to have a GPT-4 level model in the open for research so people can try quantizing, pruning, distilling, and other ways of making it more practical to run. And I'm sure startups will build on it as well.
The real question will be, how much you can quantize that while still retaining sanity. 400b at 2-bit would be possible to run on a Mac Studio - probably at multiple seconds per token, but sometimes that's "fast enough".
Yes. I expect an explosion of research and experimentation in model compression. The good news is I think there are tons of avenues that have barely been explored at all. We are at the very beginning of understanding this stuff, and my bet is that in a few years we'll be able to compress these models 10x or more.
Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.
Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along
HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.
Hook this up to LLM arena we will get a better picture regarding how powerful they really are
It's a really funny story that I comment about at least once a week because it drives me nuts.
1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.
2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.
3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.
4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.
5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.
my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.
You're ignoring geohot, who is a credible source (is an active researcher himself, is very well-connected) and gave more details (MoE with 8 experts, when no-one else was doing production MoE yet) than the Twitter spam.
Geohot? I know enough people at OpenAI to know 4 people's reaction at the time he started claiming 1T based on timing latency in the ChatGPT webui per token.
In general, not someone you wanna be citing with lengthy platitudes, he's an influencer who speaks engineer, he's burned out of every community he's been in, acrimonously.
I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T
It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.
That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.
I think it's not really defined how to compare parameter counts with a MoE model.
But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.
Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.
From Databricks:
"DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."
A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.
Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.
You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.
I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.
I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.
But I'm waiting for the finetunedz/merged models. Many devs produced great models based on Llama 2, that outperformed the vanilla one, so I expect similar treatment for the new version. Exciting nonetheless!
https://ai.meta.com/blog/meta-llama-3/ has it about a third of the way down. It's a little bit better on every benchmark than Mixtral 8x22B (according to Meta).
Llama 3 70B takes half the VRAM as Mixtral 8x22B. But it does need almost twice the FLOPS/bandwidth. Yes, Llama's context is smaller although that should be fixable in the near future. Another thing is that Llama is English-focused while Mixtral is more multilingual.
GPT-4 numbers from from https://github.com/openai/simple-evals gpt-4-turbo-2024-04-09 (chatgpt)