These big models are getting pumped out like crazy, that is the business of thes...

eigenvalue · on July 18, 2024

There are a lot of models coming out, but in my view, most don't really matter or move the needle. There are the frontier models which aren't open (like GPT-4o) and then there are the small "elite" local LLMs like Llama3 8B. The rest seem like they are mostly about manipulating benchmarks. Whenever I try them, they are worse in actual practice than the Llama3 models.

hdhshdhshdjd · on July 18, 2024

I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU, so I’m not sure the use case. I have an A6000 which I use for a lot of things, Mixtral was my go-to until Llama3, then I switched over.

If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.

azeirah · on July 18, 2024

You don't need a 4090 at all. 16 bit requires about 24GB of VRAM, 8bit quants (99% same performance) requires only 12GB of VRAM.

That's without the context window, so depending on how much context you want to use you'll need some more GB.

That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)

This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.

You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.

reissbaker · on July 19, 2024

Comparing this to 70b doesn't make sense: this is a 12b model, which should easily fit on consumer GPUs. A 70b will have to be quantized to near-braindead to fit on a consumer GPU; 4bit is about as small as you can go without serious degradation, and 70b quantized to 4bit is still ~35GB before accounting for context space. Even a 4090 can't run a 70b.

Supposedly Mistral NeMo better than Llama-3-8b, which is the more apt comparison, although benchmarks usually don't tell the full story; we'll see how it does on the LMSYS Chatbot Arena leaderboards. The other (huge) advantage of Mistral NeMo over Llama-3-8b is the massive context window: 128k (and supposedly 1MM with RoPE scaling, according to their HF repo), vs 8k.

Also, this was trained with 8bit quantization awareness, so it should handle quantization better than the Llama 3 series in general, which will help more people be able to run it locally. You don't need a 4090.