I was curious how the numbers compare to GPT-4 in the paid ChatGPT Plus, since t...

sp332 · on April 18, 2024

The bottom of https://ai.meta.com/blog/meta-llama-3/ has in-progress results for the 400B model as well. Looks like it's not quite there yet.

  Llama 3 400B Base / Instruct
  MMLU         84.8   86.1
  GPQA          -     48.0
  MATH          -     57.8
  HumanEval     -     84.1
  DROP         83.5    -

YetAnotherNick · on April 18, 2024

For the still training 400B:

          Llama 3 GPT 4(Published)
    BBH   85.3    83.1
    MMLU  86.1    86.4
    DROP  83.5    80.9
    GSM8K 94.1    92.0    
    MATH  57.8    52.9
    HumEv 84.1    74.4

Although it should be noted that the API numbers were generally better than published numbers for GPT4.

[1]: https://deepmind.google/technologies/gemini/

oliwary · on April 18, 2024

Wild! So if this indeed holds up, it looks like OpenAI were about a year ahead when GPT-4 was released, compared to the open source world. However, given the timespan between matching GPT-3.5 (Mixtral perhaps?) and matching GPT-4 has just been a few weeks, I am wondering if the open source models have more momentum.

That said, I am very curious what OpenAI has in their labs... Are they actually barely ahead? Or do they have something much better that is not yet public? Perhaps they were waiting for Llama 3 to show it? Exciting times ahead either way!

ChildOfChaos · on April 18, 2024

You've also got to consider that we don't really know where OpenAI are though, what they have released in the past year have been tweaks to GPT4, while I am sure the real work is going into GPT5 or whatever it gets called.

While all the others are catching up and in some cases being slightly better, I wouldn't be surprised to see a rather large leap back into the lead from OpenAI pretty soon and then a scrabble for some time for others to get close again. We will really see who has the momentum soon, when we see OpenAI's next full release.

tedsanders · on April 18, 2024

Those numbers are for the original GPT-4 (Mar 2023). Current GPT-4-Turbo (Apr 2024) is better:

          Llama 3 GPT-4   GPT-4-Turbo* (Apr 2024)
    MMLU  86.1    86.4    86.7
    DROP  83.5    80.9    86.0
    MATH  57.8    52.9    73.4
    HumEv 84.1    74.4    88.2

*using API prompt: https://github.com/openai/simple-evals

natrys · on April 18, 2024

I find it somewhat interesting that there is a common perception about GPT-4 at release being actually smart, but that it got gradually nerfed for speed with turbo, which is better tuned but doesn't exhibit intelligence like the original.

There were times when I felt that too, but nowadays I predominantly use turbo. It's probably because turbo is faster and cheaper, but in lmsys turbo has 100 elo higher than original, so by and large people simply find turbo to be....better?

Nevertheless, I do wonder if not just in benchmarks but in how people use LLMs, intelligence is somewhat under utilised, or possibly offset by other qualities.

bugglebeetle · on April 18, 2024

Given the incremental increase between GPT-4 and its turbo variant, I would weight “vibes” more heavily than this improvement on MMLU. OpenAI isn’t exactly a very honest or transparent company and the metric is imperfect. As a longtime time user of ChatGPT, I observed it got markedly worse at coding after the turbo release, specifically in its refusal to complete code as specified.

thelittleone · on April 18, 2024

Have you tried Claude 3 Opus? I've been using that predominantly since release and find it's "smarts" as or better than my experience with GPT-4 (pre turbo).

natrys · on April 18, 2024

I did. It definitely exudes more all around personality. Unfortunately in my private test suite (mostly about coding), it did somewhat worse than turbo or phind 70b.

Since price influences my calculus, I can't say this for sure, but it seems being slightly smarter is not much of an edge, because it's still dumb by human standards. For most non-coding use the smart doesn't make much difference (like summarisation), I find that cheaper options like mistral-large do just as good as Opus.

In the last month I have used Command R+ more and more. Finally had some excuse to write some function calling stuff. I have also been highly impressed by Gemini Pro 1.5 finding technical answers from a dense 650 page pdf manual. I have enjoyed chatting with the WizardLM2 fine-tune for the past few days.

Somehow I haven't quite found a consistent use case for Opus.

ljhskyso · on April 19, 2024

i think it might just be the subjective feelings (GPT-4-turbo being dumber) - the joy is always stronger when you first taste it, and the joy decays as you get used to it and the bar raises ever since.

mdeeks · on April 18, 2024

Which specific GPT-4 model is this? gpt-4-0613? gpt-4-0125-preview?

YetAnotherNick · on April 18, 2024

This is mostly from technical report from OpenAI[1]. API performs better as I said in my previous comment. API models(0613/0125 etc.) also uses user data for training which could leak the benchmark data.

[1]: https://arxiv.org/pdf/2303.08774.pdf

pama · on April 18, 2024

IIRC this model had finished pretraining in the summer of 2022.

tmikaeld · on April 18, 2024

Hm, how much VRAM would this take to run?

bearjaws · on April 18, 2024

My guess is around 256GiB but it depends on what level of quantization you are okay with. At full 16bit it will be massive, near 512GiB.

I figure we will see some Q4's that can probably fit on 4 4090s with CPU offloading.

sp332 · on April 18, 2024

With 400 billion parameters and 8 bits per parameter, wouldn't it be ~400 GB? Plus context size which could be quite large.

yalok · on April 18, 2024

he said "Q4" - meaning 4-bit weights.

sp332 · on April 18, 2024

Ok but at 16-bit it would be 800GB+, right? Not 512.

reactordev · on April 18, 2024

Divide not multiply. If a size is estimated in 8-bit, reducing to 4-bit halves the size (and entropy of each value). Difference between INT_MAX and SHORT_MAX (assuming you have such defs).

I could be wrong too but that’s my understanding. Like float vs half-float.

asadm · on April 18, 2024

mrtranscendence · on April 18, 2024

Back of the envelope, maybe 0.75TB? More than you have, probably ...

kyboren · on April 19, 2024

"More than you can afford, pal--NVidia."

modeless · on April 18, 2024

Not quite there yet, but very close and not done training! It's quite plausible that this model could be state of the art over GPT-4 in some domains when it finishes training, unless GPT-5 comes out first.

Although 400B will be pretty much out of reach for any PC to run locally, it will still be exciting to have a GPT-4 level model in the open for research so people can try quantizing, pruning, distilling, and other ways of making it more practical to run. And I'm sure startups will build on it as well.

brandall10 · on April 19, 2024

There are rumors about an upcoming M3 or M4 Extreme chip... which would certainly have enough RAM, and probably a 1600-2000 GB/s bandwidth.

Still wouldn't be super performant AFA token gen, ~4-6 per second, but certainly runnable.

Of course by the time that lands in 6-12 months we'll probably have a 70-100G model that is similarly performant.

int_19h · on April 18, 2024

The real question will be, how much you can quantize that while still retaining sanity. 400b at 2-bit would be possible to run on a Mac Studio - probably at multiple seconds per token, but sometimes that's "fast enough".

modeless · on April 18, 2024

Yes. I expect an explosion of research and experimentation in model compression. The good news is I think there are tons of avenues that have barely been explored at all. We are at the very beginning of understanding this stuff, and my bet is that in a few years we'll be able to compress these models 10x or more.

jug · on April 18, 2024

This is tantalizingly close in multiple benchmarks though. Pretty sure this one will finally be the open GPT-4 match.

gliched_robot · on April 18, 2024

Wild considering, GPT-4 is 1.8T.

andy99 · on April 18, 2024

Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.

acchow · on April 18, 2024

Also, the technological leader focuses less on the benchmarks

manmal · on April 18, 2024

Interesting claim, is there data to back this up? My impression is that Intel and NVIDIA have always gamed the benchmarks.

jgalt212 · on April 18, 2024

NVIDIA needs T models not B models to keep the share price up.

karmasimida · on April 18, 2024

Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are

bilbo0s · on April 18, 2024

"graduate student descent"

Ahhh that takes me back!

qeternity · on April 18, 2024

The original GPT4 may have been around that size (16x 110B).

But it's pretty clear GPT4 Turbo is a smaller and heavily quantized model.

IceHegel · on April 19, 2024

Yeah, it’s not even close to doing inference on 1.8T weights for turbo queries.

oersted · on April 18, 2024

Where did you find this number? Not doubting it, just want to get a better idea of how precise the estimate may be.

refulgentis · on April 18, 2024

It's a really funny story that I comment about at least once a week because it drives me nuts.

1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.

2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.

3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.

4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.

5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.

my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.

cjbprime · on April 18, 2024

You're ignoring geohot, who is a credible source (is an active researcher himself, is very well-connected) and gave more details (MoE with 8 experts, when no-one else was doing production MoE yet) than the Twitter spam.

anoncareer0212 · on April 18, 2024

Geohot? I know enough people at OpenAI to know 4 people's reaction at the time he started claiming 1T based on timing latency in the ChatGPT webui per token.

In general, not someone you wanna be citing with lengthy platitudes, he's an influencer who speaks engineer, he's burned out of every community he's been in, acrimonously.

huijzer · on April 18, 2024

Probably from Nvidia's GTC keynote: https://www.youtube.com/live/USlE2huSI_w?t=2995.

In the keynote, Jensen uses 1.8T in an example and suggests that this is roughly the size of GPT-4 (if I remember correctly).

sputknick · on April 18, 2024

I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T

qeternity · on April 18, 2024

I'm pretty sure he suggested it was a 16 way 110 MoE

brandall10 · on April 19, 2024

The exact quote: "Sam Altman won’t tell you that GPT 4 has 220 billion parameters and is a 16 way mixture model with eight sets of weights."

cjbprime · on April 18, 2024

It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.

That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.

I think it's not really defined how to compare parameter counts with a MoE model.

wongarsu · on April 18, 2024

But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.

Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.

staticman2 · on April 22, 2024

"For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models"

Are you sure about that? I'm pretty sure Miqu (the leaked Mistral 70b model) is generally thought to be smarter than Mixtral 8x7b.

worldsayshi · on April 18, 2024

What is the reason for settling on 7/8 experts for mixture of experts? Has there been any serious evaluation of what would be a good MoE split?

nycdatasci · on April 18, 2024

It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...

wongarsu · on April 18, 2024

A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.

Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.

You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.

cjbprime · on April 19, 2024

(For a model of GPT-4's size, it could also be 8 nodes with several GPUs each, each node comprising a single expert.)

chessgecko · on April 18, 2024

I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.

anvuong · on April 18, 2024

I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.

Simon321 · on April 18, 2024

It's a mixture of experts model. Only a small part of those parameters are active at any given time. I believe it's 16x110B

3abiton · on April 18, 2024

But I'm waiting for the finetunedz/merged models. Many devs produced great models based on Llama 2, that outperformed the vanilla one, so I expect similar treatment for the new version. Exciting nonetheless!

cjbprime · on April 18, 2024

Has anyone prepared a comparison to Mixtral 8x22B? (Life sure moves fast.)

Davidzheng · on April 18, 2024

it's in the official post the comparison with Mixtral 8x22B

cjbprime · on April 18, 2024

Where? I only see comparisons to Mistral 7B and Mistral Medium, which are totally different models.

gs17 · on April 18, 2024

https://ai.meta.com/blog/meta-llama-3/ has it about a third of the way down. It's a little bit better on every benchmark than Mixtral 8x22B (according to Meta).

cjbprime · on April 19, 2024

Oh cool! But at the cost of twice the VRAM and only having 1/8th of the context, I suppose?

modeless · on April 20, 2024

Llama 3 70B takes half the VRAM as Mixtral 8x22B. But it does need almost twice the FLOPS/bandwidth. Yes, Llama's context is smaller although that should be fixable in the near future. Another thing is that Llama is English-focused while Mixtral is more multilingual.

pzo · on April 18, 2024

also curious how it compares to WizardLM 2 8x22B

geepytee · on April 18, 2024

I was particularly excited for the high HumanEval score, and this is before the 400B model and the CodeLlama tune!

I just added Llama 3 70B to our coding copilot https://www.double.bot if anyone wants to try it for coding within their IDE

eru · on April 18, 2024

Via Microsoft Copilot (and perhaps Bing?) you can get access to GPT-4 for free.

tinybear1 · on April 18, 2024

* With targeted advertising

eru · on April 19, 2024

Eh, no worse than any other free (and many paid!) products on the web.

oezi · on April 18, 2024

Is Copilot free now?

eru · on April 19, 2024

There's a free tier and a 'pro' tier.