For the still training 400B: Llama 3 GPT 4(Published) BBH 85.3 83.1 MMLU 86.1 86...

oliwary · on April 18, 2024

Wild! So if this indeed holds up, it looks like OpenAI were about a year ahead when GPT-4 was released, compared to the open source world. However, given the timespan between matching GPT-3.5 (Mixtral perhaps?) and matching GPT-4 has just been a few weeks, I am wondering if the open source models have more momentum.

That said, I am very curious what OpenAI has in their labs... Are they actually barely ahead? Or do they have something much better that is not yet public? Perhaps they were waiting for Llama 3 to show it? Exciting times ahead either way!

ChildOfChaos · on April 18, 2024

You've also got to consider that we don't really know where OpenAI are though, what they have released in the past year have been tweaks to GPT4, while I am sure the real work is going into GPT5 or whatever it gets called.

While all the others are catching up and in some cases being slightly better, I wouldn't be surprised to see a rather large leap back into the lead from OpenAI pretty soon and then a scrabble for some time for others to get close again. We will really see who has the momentum soon, when we see OpenAI's next full release.

tedsanders · on April 18, 2024

Those numbers are for the original GPT-4 (Mar 2023). Current GPT-4-Turbo (Apr 2024) is better:

          Llama 3 GPT-4   GPT-4-Turbo* (Apr 2024)
    MMLU  86.1    86.4    86.7
    DROP  83.5    80.9    86.0
    MATH  57.8    52.9    73.4
    HumEv 84.1    74.4    88.2

*using API prompt: https://github.com/openai/simple-evals

natrys · on April 18, 2024

I find it somewhat interesting that there is a common perception about GPT-4 at release being actually smart, but that it got gradually nerfed for speed with turbo, which is better tuned but doesn't exhibit intelligence like the original.

There were times when I felt that too, but nowadays I predominantly use turbo. It's probably because turbo is faster and cheaper, but in lmsys turbo has 100 elo higher than original, so by and large people simply find turbo to be....better?

Nevertheless, I do wonder if not just in benchmarks but in how people use LLMs, intelligence is somewhat under utilised, or possibly offset by other qualities.

bugglebeetle · on April 18, 2024

Given the incremental increase between GPT-4 and its turbo variant, I would weight “vibes” more heavily than this improvement on MMLU. OpenAI isn’t exactly a very honest or transparent company and the metric is imperfect. As a longtime time user of ChatGPT, I observed it got markedly worse at coding after the turbo release, specifically in its refusal to complete code as specified.

thelittleone · on April 18, 2024

Have you tried Claude 3 Opus? I've been using that predominantly since release and find it's "smarts" as or better than my experience with GPT-4 (pre turbo).

natrys · on April 18, 2024

I did. It definitely exudes more all around personality. Unfortunately in my private test suite (mostly about coding), it did somewhat worse than turbo or phind 70b.

Since price influences my calculus, I can't say this for sure, but it seems being slightly smarter is not much of an edge, because it's still dumb by human standards. For most non-coding use the smart doesn't make much difference (like summarisation), I find that cheaper options like mistral-large do just as good as Opus.

In the last month I have used Command R+ more and more. Finally had some excuse to write some function calling stuff. I have also been highly impressed by Gemini Pro 1.5 finding technical answers from a dense 650 page pdf manual. I have enjoyed chatting with the WizardLM2 fine-tune for the past few days.

Somehow I haven't quite found a consistent use case for Opus.

ljhskyso · on April 19, 2024

i think it might just be the subjective feelings (GPT-4-turbo being dumber) - the joy is always stronger when you first taste it, and the joy decays as you get used to it and the bar raises ever since.

mdeeks · on April 18, 2024

Which specific GPT-4 model is this? gpt-4-0613? gpt-4-0125-preview?

YetAnotherNick · on April 18, 2024

This is mostly from technical report from OpenAI[1]. API performs better as I said in my previous comment. API models(0613/0125 etc.) also uses user data for training which could leak the benchmark data.

[1]: https://arxiv.org/pdf/2303.08774.pdf

pama · on April 18, 2024

IIRC this model had finished pretraining in the summer of 2022.

tmikaeld · on April 18, 2024

Hm, how much VRAM would this take to run?

bearjaws · on April 18, 2024

My guess is around 256GiB but it depends on what level of quantization you are okay with. At full 16bit it will be massive, near 512GiB.

I figure we will see some Q4's that can probably fit on 4 4090s with CPU offloading.

sp332 · on April 18, 2024

With 400 billion parameters and 8 bits per parameter, wouldn't it be ~400 GB? Plus context size which could be quite large.

yalok · on April 18, 2024

he said "Q4" - meaning 4-bit weights.

sp332 · on April 18, 2024

Ok but at 16-bit it would be 800GB+, right? Not 512.

reactordev · on April 18, 2024

Divide not multiply. If a size is estimated in 8-bit, reducing to 4-bit halves the size (and entropy of each value). Difference between INT_MAX and SHORT_MAX (assuming you have such defs).

I could be wrong too but that’s my understanding. Like float vs half-float.

asadm · on April 18, 2024

mrtranscendence · on April 18, 2024

Back of the envelope, maybe 0.75TB? More than you have, probably ...

kyboren · on April 19, 2024

"More than you can afford, pal--NVidia."