Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was curious how the numbers compare to GPT-4 in the paid ChatGPT Plus, since they don't compare directly themselves.

           Llama 3 8B Llama 3 70B GPT-4
 MMLU      68.4       82.0        86.5
 GPQA      34.2       39.5        49.1
 MATH      30.0       50.4        72.2
 HumanEval 62.2       81.7        87.6
 DROP      58.4       79.7        85.4
Note that the free version of ChatGPT that most people use is based on GPT-3.5 which is much worse than GPT-4. I haven't found comprehensive eval numbers for the latest GPT-3.5, however I believe Llama 3 70B handily beats it and even the 8B is close. It's very exciting to have models this good that you can run locally and modify!

GPT-4 numbers from from https://github.com/openai/simple-evals gpt-4-turbo-2024-04-09 (chatgpt)



The bottom of https://ai.meta.com/blog/meta-llama-3/ has in-progress results for the 400B model as well. Looks like it's not quite there yet.

  Llama 3 400B Base / Instruct
  MMLU         84.8   86.1
  GPQA          -     48.0
  MATH          -     57.8
  HumanEval     -     84.1
  DROP         83.5    -


For the still training 400B:

          Llama 3 GPT 4(Published)
    BBH   85.3    83.1
    MMLU  86.1    86.4
    DROP  83.5    80.9
    GSM8K 94.1    92.0    
    MATH  57.8    52.9
    HumEv 84.1    74.4
Although it should be noted that the API numbers were generally better than published numbers for GPT4.

[1]: https://deepmind.google/technologies/gemini/


Wild! So if this indeed holds up, it looks like OpenAI were about a year ahead when GPT-4 was released, compared to the open source world. However, given the timespan between matching GPT-3.5 (Mixtral perhaps?) and matching GPT-4 has just been a few weeks, I am wondering if the open source models have more momentum.

That said, I am very curious what OpenAI has in their labs... Are they actually barely ahead? Or do they have something much better that is not yet public? Perhaps they were waiting for Llama 3 to show it? Exciting times ahead either way!


You've also got to consider that we don't really know where OpenAI are though, what they have released in the past year have been tweaks to GPT4, while I am sure the real work is going into GPT5 or whatever it gets called.

While all the others are catching up and in some cases being slightly better, I wouldn't be surprised to see a rather large leap back into the lead from OpenAI pretty soon and then a scrabble for some time for others to get close again. We will really see who has the momentum soon, when we see OpenAI's next full release.


Those numbers are for the original GPT-4 (Mar 2023). Current GPT-4-Turbo (Apr 2024) is better:

          Llama 3 GPT-4   GPT-4-Turbo* (Apr 2024)
    MMLU  86.1    86.4    86.7
    DROP  83.5    80.9    86.0
    MATH  57.8    52.9    73.4
    HumEv 84.1    74.4    88.2
*using API prompt: https://github.com/openai/simple-evals


I find it somewhat interesting that there is a common perception about GPT-4 at release being actually smart, but that it got gradually nerfed for speed with turbo, which is better tuned but doesn't exhibit intelligence like the original.

There were times when I felt that too, but nowadays I predominantly use turbo. It's probably because turbo is faster and cheaper, but in lmsys turbo has 100 elo higher than original, so by and large people simply find turbo to be....better?

Nevertheless, I do wonder if not just in benchmarks but in how people use LLMs, intelligence is somewhat under utilised, or possibly offset by other qualities.


Given the incremental increase between GPT-4 and its turbo variant, I would weight “vibes” more heavily than this improvement on MMLU. OpenAI isn’t exactly a very honest or transparent company and the metric is imperfect. As a longtime time user of ChatGPT, I observed it got markedly worse at coding after the turbo release, specifically in its refusal to complete code as specified.


Have you tried Claude 3 Opus? I've been using that predominantly since release and find it's "smarts" as or better than my experience with GPT-4 (pre turbo).


I did. It definitely exudes more all around personality. Unfortunately in my private test suite (mostly about coding), it did somewhat worse than turbo or phind 70b.

Since price influences my calculus, I can't say this for sure, but it seems being slightly smarter is not much of an edge, because it's still dumb by human standards. For most non-coding use the smart doesn't make much difference (like summarisation), I find that cheaper options like mistral-large do just as good as Opus.

In the last month I have used Command R+ more and more. Finally had some excuse to write some function calling stuff. I have also been highly impressed by Gemini Pro 1.5 finding technical answers from a dense 650 page pdf manual. I have enjoyed chatting with the WizardLM2 fine-tune for the past few days.

Somehow I haven't quite found a consistent use case for Opus.


i think it might just be the subjective feelings (GPT-4-turbo being dumber) - the joy is always stronger when you first taste it, and the joy decays as you get used to it and the bar raises ever since.


Which specific GPT-4 model is this? gpt-4-0613? gpt-4-0125-preview?


This is mostly from technical report from OpenAI[1]. API performs better as I said in my previous comment. API models(0613/0125 etc.) also uses user data for training which could leak the benchmark data.

[1]: https://arxiv.org/pdf/2303.08774.pdf


IIRC this model had finished pretraining in the summer of 2022.


Hm, how much VRAM would this take to run?


My guess is around 256GiB but it depends on what level of quantization you are okay with. At full 16bit it will be massive, near 512GiB.

I figure we will see some Q4's that can probably fit on 4 4090s with CPU offloading.


With 400 billion parameters and 8 bits per parameter, wouldn't it be ~400 GB? Plus context size which could be quite large.


he said "Q4" - meaning 4-bit weights.


Ok but at 16-bit it would be 800GB+, right? Not 512.


Divide not multiply. If a size is estimated in 8-bit, reducing to 4-bit halves the size (and entropy of each value). Difference between INT_MAX and SHORT_MAX (assuming you have such defs).

I could be wrong too but that’s my understanding. Like float vs half-float.


yes


Back of the envelope, maybe 0.75TB? More than you have, probably ...


"More than you can afford, pal--NVidia."


Not quite there yet, but very close and not done training! It's quite plausible that this model could be state of the art over GPT-4 in some domains when it finishes training, unless GPT-5 comes out first.

Although 400B will be pretty much out of reach for any PC to run locally, it will still be exciting to have a GPT-4 level model in the open for research so people can try quantizing, pruning, distilling, and other ways of making it more practical to run. And I'm sure startups will build on it as well.


There are rumors about an upcoming M3 or M4 Extreme chip... which would certainly have enough RAM, and probably a 1600-2000 GB/s bandwidth.

Still wouldn't be super performant AFA token gen, ~4-6 per second, but certainly runnable.

Of course by the time that lands in 6-12 months we'll probably have a 70-100G model that is similarly performant.


The real question will be, how much you can quantize that while still retaining sanity. 400b at 2-bit would be possible to run on a Mac Studio - probably at multiple seconds per token, but sometimes that's "fast enough".


Yes. I expect an explosion of research and experimentation in model compression. The good news is I think there are tons of avenues that have barely been explored at all. We are at the very beginning of understanding this stuff, and my bet is that in a few years we'll be able to compress these models 10x or more.


This is tantalizingly close in multiple benchmarks though. Pretty sure this one will finally be the open GPT-4 match.


Wild considering, GPT-4 is 1.8T.


Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.


Also, the technological leader focuses less on the benchmarks


Interesting claim, is there data to back this up? My impression is that Intel and NVIDIA have always gamed the benchmarks.


NVIDIA needs T models not B models to keep the share price up.


Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are


"graduate student descent"

Ahhh that takes me back!


The original GPT4 may have been around that size (16x 110B).

But it's pretty clear GPT4 Turbo is a smaller and heavily quantized model.


Yeah, it’s not even close to doing inference on 1.8T weights for turbo queries.


Where did you find this number? Not doubting it, just want to get a better idea of how precise the estimate may be.


It's a really funny story that I comment about at least once a week because it drives me nuts.

1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.

2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.

3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.

4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.

5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.

my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.


You're ignoring geohot, who is a credible source (is an active researcher himself, is very well-connected) and gave more details (MoE with 8 experts, when no-one else was doing production MoE yet) than the Twitter spam.


Geohot? I know enough people at OpenAI to know 4 people's reaction at the time he started claiming 1T based on timing latency in the ChatGPT webui per token.

In general, not someone you wanna be citing with lengthy platitudes, he's an influencer who speaks engineer, he's burned out of every community he's been in, acrimonously.


Probably from Nvidia's GTC keynote: https://www.youtube.com/live/USlE2huSI_w?t=2995.

In the keynote, Jensen uses 1.8T in an example and suggests that this is roughly the size of GPT-4 (if I remember correctly).


I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T


I'm pretty sure he suggested it was a 16 way 110 MoE


The exact quote: "Sam Altman won’t tell you that GPT 4 has 220 billion parameters and is a 16 way mixture model with eight sets of weights."


It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.

That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.

I think it's not really defined how to compare parameter counts with a MoE model.


But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.

Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.


"For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models"

Are you sure about that? I'm pretty sure Miqu (the leaked Mistral 70b model) is generally thought to be smarter than Mixtral 8x7b.


What is the reason for settling on 7/8 experts for mixture of experts? Has there been any serious evaluation of what would be a good MoE split?


It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...


A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.

Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.

You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.


(For a model of GPT-4's size, it could also be 8 nodes with several GPUs each, each node comprising a single expert.)


I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.


I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.


It's a mixture of experts model. Only a small part of those parameters are active at any given time. I believe it's 16x110B


But I'm waiting for the finetunedz/merged models. Many devs produced great models based on Llama 2, that outperformed the vanilla one, so I expect similar treatment for the new version. Exciting nonetheless!


Has anyone prepared a comparison to Mixtral 8x22B? (Life sure moves fast.)


it's in the official post the comparison with Mixtral 8x22B


Where? I only see comparisons to Mistral 7B and Mistral Medium, which are totally different models.


https://ai.meta.com/blog/meta-llama-3/ has it about a third of the way down. It's a little bit better on every benchmark than Mixtral 8x22B (according to Meta).


Oh cool! But at the cost of twice the VRAM and only having 1/8th of the context, I suppose?


Llama 3 70B takes half the VRAM as Mixtral 8x22B. But it does need almost twice the FLOPS/bandwidth. Yes, Llama's context is smaller although that should be fixable in the near future. Another thing is that Llama is English-focused while Mixtral is more multilingual.


also curious how it compares to WizardLM 2 8x22B


I was particularly excited for the high HumanEval score, and this is before the 400B model and the CodeLlama tune!

I just added Llama 3 70B to our coding copilot https://www.double.bot if anyone wants to try it for coding within their IDE


Via Microsoft Copilot (and perhaps Bing?) you can get access to GPT-4 for free.


* With targeted advertising


Eh, no worse than any other free (and many paid!) products on the web.


Is Copilot free now?


There's a free tier and a 'pro' tier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: