It’s great that this can run on a laptop but FWIW, llama 70B model is no where n...

simonw · on Dec 9, 2024

Are you sure about that?

When I say GPT-4 class I'm talking about being comparable to the GPT-4 that was released in March 2023.

The Llama 3.3 70B model is clearly no way near as good as today's GPT-4o family of models, or the other top-ranking models today like Gemini 1.5 Pro and Claude 3.5 Sonnet.

To my surprise, Llama 3.3 70B is ranking higher than Claude 3 Opus on https://livebench.ai/ - I'm suspicious of that result, personally. I think Opus was the best available model for a few months earlier this year.

mmiyer · on Dec 9, 2024

I guess it's because it has the highest score of all models in instruction following, 20 points higher then Opus, which compensates for shortcomings elsewhere (e.g. in language), and which wouldn't necessarily translate to human evaluation of usefulness.

simonw · on Dec 9, 2024

Wow, yeah I think you're right - 3.3 somehow gets top position on the entire leaderboard for that category, I bet that skews the average up a lot.

ac29 · on Dec 10, 2024

The model you are running isnt the one used in the benchmarks you link.

The default llama3.3 model in ollama is heavily quantized (~4 bit). Running the full fp16 model, or even an 8-bit quant wouldnt be possible on your laptop with 64G RAM.

simonw · on Dec 11, 2024

Thanks - yeah, I should have mentioned that. I just added a note directly above this heading https://simonwillison.net/2024/Dec/9/llama-33-70b/#honorable...

MichaelZuo · on Dec 9, 2024

How do you reliably compare it with the GPT-4 released in March 2023?

simonw · on Dec 9, 2024

Vibes, based on what I can remember using that model for.

There's still a gpt-4 model available via the OpenAI API, but it's gpt-4-0613 from June 2023 - the March 2023 snapshot gpt-4-0314 is no longer available.

I ran one of my test prompts against that old June 2023 GPT-4 model here: https://gist.github.com/simonw/de4951452df2677f2a1a3cd415168...

I'm not going to try for an extensive evaluation comparing it with Llama 3.3 though, life's too short and that's already been done better than I could by https://livebench.ai/

MichaelZuo · on Dec 9, 2024

Why not ask it to solve math questions?

The bar for GPT-4 was so low that unambiguously clearing that threshold should be pretty easy.

simonw · on Dec 9, 2024

I am not particularly interested in those benchmarks that deliberately expose weaknesses in models: I know that models have weaknesses already!

What I care about is the things that they're proven to be good at - can I do those kinds of things (RAG, summarization, code generation, language translation) directly on my laptop?

buildbot · on Dec 9, 2024

The new 3.3 70B model has comparable benchmarks to the 405B model, which is probably what people mean by GPT-4 class.

zamadatix · on Dec 9, 2024

> when I ran Llama 3.3 70B on the same laptop for the first time.

There is no llama 3.3 405B to test, 3.3 only comes in 70B. Are you sure you aren't thinking of llama 3 or 3.1?

simonw · on Dec 9, 2024

No, I meant Llama 3.3 70B.