Mistral AI Launches New 8x22B MOE Model

freeqaz · 2024-04-10T02:40:20 1712716820

What's the easiest way to run this assuming that you have the weights and the hardware? Even if it's offloading half of the model to RAM, what tool do you use to load this? Ollama? Llama.cpp? Or just import it with some Python library?

Also, what's the best way to benchmark a model to compare it with others? Are there any tools to use off-the-shelf to do that?

fbdab103 · 2024-04-10T02:44:45 1712717085

I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format).

You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?

[0] https://github.com/Mozilla-Ocho/llamafile

jart · 2024-04-10T05:35:01 1712727301

llamafile author here. I'm downloading Mixtral 8x22b right now. I can't say for certain it'll work until I try it, but let's keep our fingers crossed! If not, we'll be shipping a release as soon as possible that gets it working.

My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.

moffkalast · 2024-04-10T11:16:16 1712747776

Correct me if I'm wrong, but in the tests I've run, the matmul optimizations only have an effect if there's no other blas acceleration. If one can at least offload the KV cache to cublas or run with openblas it's not really used, right? At least I didn't see any speedup in with that config when comparing that PR to the main llama.cpp branch.

jart · 2024-04-10T12:10:33 1712751033

The code that launches my code (see ggml_compute_forward_mul_mat) comes after CLBLAST, Accelerate, and OpenBLAS. The latter take precedence. So if you're not seeing any speedup in enabling them, it's probably because tinyBLAS has reached terms of equality with the BLAS. It's obviously nowhere near as fast as cuBLAS, but maybe PCIE memory transfer overhead explains it. It also really depends on various other factors, like quantization type. For example, the BLAS doesn't support formats like Q4_0 and tinyBLAS does.

noman-land · 2024-04-10T02:52:11 1712717531

+1 on llamafile. You can point it to a custom model.

varunvummadi · 2024-04-10T02:44:00 1712717040

The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)

sheepscreek · 2024-04-10T22:21:21 1712787681

In that regard, it’s even easier to use one Apple Studio with sufficient RAM and llama.cpp or even PyTorch for inference.

hmottestad · 2024-04-10T16:14:56 1712765696

LM Studio is a great way to test out LLMs on my MacBook: https://lmstudio.ai/

Really easy to search huggingface for new models to test directly in the app.

LeoPanthera · 2024-04-10T19:34:16 1712777656

Make sure you get the prompt template set correctly, the defaults are wrong for a lot of models.

unifer1 · 2024-04-10T23:50:59 1712793059

Could you explain how to do this properly ? I've been having problems with the app and am wondering if this is ehy

LeoPanthera · 2024-04-11T00:23:24 1712795004

Look at the HuggingFace page for the model you are using. (The original page, not the page for the GGUF conversion, if necessary.) This will explain the chat format you need to use.

bevekspldnw · 2024-04-11T15:14:34 1712848474

There is a user called The Bloke on hugging face- they release pre quantized models pretty soon after the full size drop. Just watch their page and pray you can fit the 4 bit in your GPU.

I’m sure they are already working on it.

nathanasmith · 2024-04-11T16:54:55 1712854495

TheBloke stopped uploading in January. There are others that have stepped up though.

bevekspldnw · 2024-04-11T18:20:49 1712859649

Oh really? Who else should I be looking at?

That person is a hero, super bummed!

fzzzy · 2024-04-11T19:38:08 1712864288

TheBloke's grant ran out.

MPSimmons · 2024-04-11T17:18:52 1712855932

I think 4b for this is support to be over 70GB, so definitely still heavy hardware.

bevekspldnw · 2024-04-11T18:40:07 1712860807

Fucking hell, my A6000 is shy of that and I can’t reasonably justify picking up a second.

mritchie712 · 2024-04-10T14:08:40 1712758120

you can try it on together here:

https://api.together.xyz/playground/language/mistralai/Mixtr...

SushiHippie · 2024-04-10T02:02:06 1712714526

[dupe] https://news.ycombinator.com/item?id=39986047

Which has the link to the tweet instead of the profile:

https://twitter.com/MistralAI/status/1777869263778291896

mlsu · 2024-04-10T01:40:35 1712713235

8x22b. If this is as good as Mixtral 8x7b we are in for a wonderful time.

cchance · 2024-04-10T01:57:21 1712714241

I've heard command-r is first opensource to beat gpt4 in benchmarks

jxy · 2024-04-10T03:12:30 1712718750

It's "Command R+". "Command R" is a smaller model.

varunvummadi · 2024-04-10T02:02:23 1712714543

It beats the old GPT4 version in lmsys benchmark you can check it out here https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar... but Command R is commercially licensed We can assume that mistral will do a better job.

skissane · 2024-04-10T03:11:54 1712718714

> but Command R is commercially licensed

It is licensed under CC-BY-NC-4.0. That license means you are free to use, modify and redistribute it, so long as you aren't doing so "commercially". What exactly counts as "commercial" use is a complex legal question, and the answer may vary from jurisdiction to jurisdiction (different courts may interpret the phrase differently). But, for example, if you are just using it at home for private experimentation on your own personal time, with no plans to make money from doing so (whether now or in the future), I think pretty much everyone will agree that counts as "non-commercial".

Other cases – e.g., if a government agency uses the software to provide some government function, is that "non-commercial"? – are far less clear. Those are really the kind of questions you need to ask a lawyer (which I am not).

ryao · 2024-04-10T04:24:38 1712723078

I am not a lawyer, but lately, I have been wondering whether the contra proferentem rule interacts with these licenses.

column · 2024-04-11T14:19:21 1712845161

For anybody else not in the know : "Contra proferentem is a legal principle that suggests when there is ambiguity in the terms of a contract, the ambiguity should be resolved against the party that drafted the contract."

skissane · 2024-04-10T04:59:10 1712725150

I think the correct answer to your question is almost certainly some combination of "it depends on the jurisdiction" and also (in many cases) "nobody can be entirely sure because no court has considered the issue yet"

There have been a handful of court decisions on what "non-commercial" use means – the Creative Commons legal case database records [0] records three cases involving non-commercial CC licenses in the US, one in Belgium, one in Israel, plus I also know of one in Germany [1] which their database seems to be missing. I don't know if any of them addressed the contra proferentem rule which you mention.

The German and US cases on this topic appear contradictory – from what I understand, the German case assumed that all government use is commercial, interpreting "non-commercial" to basically mean "private home use", whereas two of the US cases (Great Minds v FedEx Office and Great Minds v Office Depot) were about use by commercial entities acting under contract to public school districts, and the holdings of those cases assume that government-operated schools are "non-commercial" (and furthermore, the commercial entities were engaging in "non-commercial" use, even though they were acting commercially, because they were doing so on behalf of a "non-commercial" customer). That said, all these cases have somewhat limited precedential value – the US cases are binding precedent in two federal judicial circuits (2nd and 9th) but have merely persuasive value in the remainder of the US; I don't know what the ultimate outcome of the German case was (Deutschlandradio said they were going to appeal but I don't know if they did and what the outcome was if they did), and German law doesn't view precedent as "binding" in quite the same sense that common law systems do anyway

[0] https://legaldb.creativecommons.org/cases/?keywords=&tags%5B...

[1] https://www.techdirt.com/2014/03/27/german-court-says-creati... and if you can read German, here is the actual court judgement: https://netzpolitik.org/wp-upload/OLG-K%C3%B6ln-CC-NC-Entsch...

refulgentis · 2024-04-11T01:19:33 1712798373

I have a weird problem where I want to charge per month for you to use my app that allows you to use N different paid models and any llama.cpp model you want. Im curious if you have any thoughts in what situation I'm in if it's one of 5 built in local options highlighted in the app

Morally I feel 100% fine because the app would be just as appealing without it, and subscribing means you get sync, you could theoretically not pay me and use Command R

cyanydeez · 2024-04-10T22:56:24 1712789784

The His website tends to move towards things that can make money.

That's typically synonymous with commercial.

moralestapia · 2024-04-10T02:29:41 1712716181

You mean better, right?

Why would you want another 8x7b, if you already have it ...

nazka · 2024-04-10T06:34:35 1712730875

Out of topic but are we now back at the same performance than ChatGPT 4 at the time people said it worked like magic (meaning before the nerf to make it more politically correct but making his performance crash)?

hmottestad · 2024-04-10T16:19:49 1712765989

I’ve been testing a lot of LLMs on my MacBook and I would say that all of them are far away from being as good as GPT-4, at any time. Many are as good as GPT-3 though. There are also a lot of models that are fine tuned for specific tasks.

Language support is one big thing that is missing from open models. I’ve only found one model that can do anything useful with Norwegian, which has never been an issue GPT-4.

Eisenstein · 2024-04-11T17:35:54 1712856954

Which ones have you tested? There were some huge ones released recently.

hmottestad · 2024-04-12T07:26:22 1712906782

Samantha, llama 2 pubmed, marcoroni, openchat, fashiongpt, falcon 180B, deepseek llm chat, orca 2, orca 2 alpac uncersored, meditron, tigerbot, mixtral instruct, wizardcoder, gemma, nouse hermes 2 solar, yarn solar 64k, nouse hermes 2 yi, nous hermes 2 mixtral, nouse hermes llama 2, starcode2, hermes 2 pro mistral, norskgpt mistral and norskgpt llama.

Nouse Hermes 2 Solar is the best model for Norwegian that I've tried so far. It's much better than NorskGPT Mistral/Llama. I actually got it to make fairly decent summaries of news articles, though it wouldn't follow any stricter commands like producing 5 keywords in a json list. Kept producing more than 5 keywords and if I doubled down on the restriction on the number of keywords it would start messing up the json.

The best competitor to GPT-4 was falcon 180b, it's still terrible compared to GPT-4. Mixtral is my new favourite though, it's faster than falcon and in general as good or better. Though I would still pick GPT-4 over Mixtral any day of the week, it's leagues ahead of Mixtral.

Tigerbot has a very interesting trait. It tends to disagree when you try to convince it that it's wrong.

I haven't been able to test out the new 8x22 mixtral or command r plus. These are the next ones on my list!

hmottestad · 2024-04-12T10:32:57 1712917977

Just tested out Command R+ with some niche SHACL constraint questions and it performs considerably worse than GTP-4. Might be a bit better than GPT-3.5 though, which is actually pretty amazing.

Eisenstein · 2024-04-13T17:13:07 1713028387

You need to use their beginning and end token scheme and set rep pen to 1 to get good quality out of cr+.

segmondy · 2024-04-10T09:45:03 1712742303

With open models, yes we are at the performance of at least the first release of ChatGPT 4.

sp332 · 2024-04-10T22:02:00 1712786520

Could you recommend one or a few in particular?

sanjiwatsuki · 2024-04-10T22:58:00 1712789880

The current best open weights model is probably Cohere Command-R+. The memory requirements on it are quite high, though.

bevekspldnw · 2024-04-11T18:18:59 1712859539

I really want to see some benchmarks with performance weighted by energy use. I think Mistral 7B performance to watt would be the leader by a huge margin. On many tasks I get equal performance on zero shot classification tasks on Mistral than in bigger models.

zmmmmm · 2024-04-10T02:08:30 1712714910

A pre-Llama3 race for everyone to get their best small models on the table?

moffkalast · 2024-04-10T11:19:28 1712747968

262 GB is not exactly small. But yes it seems they're all getting them out the door in case they end up being worse than llama-3 in which case it'll be too embarrassing to release later.

hmottestad · 2024-04-10T16:23:51 1712766231

Since it’s a MOE model it will only need to load a few of the 8 sub models into vram in order to answer a query. So it may look large, but I think a quantized model will easily fit on a Mac with 64GB of memory and maybe even a bit fewer bits and it’ll fit into 32GB.

I think it might be the end for 24GB 4090 cards though :(

dragonwriter · 2024-04-11T17:03:27 1712855007

MOE models don’t, in practice, selectively load experts on activation (and if a runtime for them could be designed that would do that, it would make them perform worse, since the experts activated may differ from token to token, so you’d be churning a whole lot swapping portions of the model into and out of VRAM.) But they do less computation per token for their size than monolithic so you can often get tolerable performance on CPU or split between GPU/CPU at a ratio that would work poorly with a similarly-sized monolithic model.

But, still, its going to need 262GB for weights + a variable amount based on context without quantization, and 66GB+ at 4-bit quantization.

brandall10 · 2024-04-10T19:40:39 1712778039

Unless something has changed, it needs to load the full 8 models at the same time. During inference it performs like a 2 x base model.

Mixtral 7B @ 5 bit takes up over 30gb on my M3 Max. That's over 90 for this at the same quantization. Realistically you probably need a 128gb machine to run this with good results.

fzzzy · 2024-04-10T23:44:36 1712792676

A 4 bit quant of the new one would still be about 70 gb, so yeah. Gonna need a lot more ram.

Kubuxu · 2024-04-11T17:27:01 1712856421

The 8x is misleading; there are 8 sets of weights (experts) per token and per layer. If it is similar to the previous MoE Mistral models, then two experts get activated per token per layer. This reduces the amount of compute and memory bandwidth you need to perform inference but doesn't reduce the amount of memory you need as you cannot load the experts into GPU memory on demand without performance impact.

mark_l_watson · 2024-04-11T12:08:45 1712837325

I think you are an optimist here. I can barely run mixtral-8x-7B on my M2 Pro 32G Mac, but I am grateful to be able to run it at all.

JanisErdmanis · 2024-04-11T12:38:40 1712839120

Which quantization level are you using?

mark_l_watson · 2024-04-11T22:12:17 1712873537

Q2, so not so great. I usually run other models. I would be embarrassed to tell you how long my “ollama list” is.

swyx · 2024-04-10T03:05:22 1712718322

this is likely v true given llama 3 rumored to release in next 2 weeks

nen-nomad · 2024-04-10T01:53:31 1712714011

Mixtral 8x7b has been good to work with, and I am looking forward to trying this one as well.

ZeljkoS · 2024-04-11T13:28:25 1712842105

Here is the unofficial benchmark: https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1/...

bevekspldnw · 2024-04-11T15:15:58 1712848558

Wish it had GPT-4, that’s the one to beat still.

GuB-42 · 2024-04-11T16:13:55 1712852035

It is there, not for all the benchmarks, but for those where it is included, GPT-4 scores much higher.

Not surprising since GPT-4 is still state-of-the-art and much bigger. Where Mistral has been particularly impressive is when you take the size of the model into account.

mirekrusin · 2024-04-11T17:23:20 1712856200

GPT-4 is instruct tuned model, of course it's going to score higher, apples and oranges.

bevekspldnw · 2024-04-11T18:20:02 1712859602

Yeah and the instruct tunes provided by Mistral on other models are pretty great.

deoxykev · 2024-04-10T03:09:24 1712718564

4 bit quants should require 85GB VRAM, so this will fit nicely on 4x 24G consumer GPUs, plus some leftover for KV cache optimization.

qeternity · 2024-04-10T10:12:37 1712743957

4bit should take up less than this, there are quite a few shared parameters between experts.

But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.

hedgehog · 2024-04-10T03:58:14 1712721494

I've found the 2 bit quant of Mixtral 8x7B is usable for some purposes with an 8GB GPU. I'm curious how this new model will work in similar cheap 8-16GB GPU configurations.

reissbaker · 2024-04-10T08:43:31 1712738611

16GB will be way too small unfortunately — this has over 3x the param count, so at best you're looking at a 24GB card with extreme 2bit quantization.

Really though if you're just looking to run models personally and not finetune (which requires monstrous amounts of VRAM), Macs are the way to go for this kind of mega model: Macs have unified memory between the GPU and CPU, and you can buy them with a lot of RAM. It'll be cheaper than trying to buy enough GPU VRAM. A Mac Studio with 192GB unified RAM is under $6k — two A6000s will run you over $9k and still only give you 96GB VRAM (and God help you if you try to build the equivalent system out of 4090s or A100s/H100s).

Or just rent the GPU time as needed from cloud providers like RunPod, although that may or may not be what you're looking for.

timschmidt · 2024-04-10T15:09:49 1712761789

Reasonably priced Epyc systems with up to 12 memory channels and support for several TB of system memory are now available. Used datacenter hardware is even less expensive. They are on par with the memory bandwidth available to any one of the CPU, GPU, or NPU in the highest end Macs, but capable of driving MUCH more memory. And much simpler to run Linux or Windows on.

reissbaker · 2024-04-11T01:21:51 1712798511

I would be very curious to see pricing on Epyc systems with terabytes of RAM that cost less than $6k including the RAM...

timschmidt · 2024-04-11T04:32:08 1712809928

Well the motherboard and CPU can be had for $1450. As they're built around standard cases and power supplies and storage, many folks like me will have those already - far less costly than buying the same from Apple if you don't. Spend what you want on ram, unlike with Apple, you can upgrade it any time.

Can't reuse my old parts on a brand new Mac, or upgrade it later if I find I need more. Lock-in is rough.

https://www.ebay.com/itm/315029731825?itmmeta=01HV561YV4AJG5...

Manabu-eo · 2024-04-11T20:31:10 1712867470

Note that this is a "QS" CPU, very likely B0 stepping ES by posts elsewhere. A new one of those is around 3k USD alone. The 16-core can be had for around $1200 however and the board for $780.

12x32=384 GB of RAM seems to be about $1400 right now. Going for less capacity don't save that much, unlike the insanely marked up apple memory. And then you need the CPU heatsink for $130.

timschmidt · 2024-04-12T00:42:32 1712882552

The errata isn't particularly scary: https://www.amd.com/content/dam/amd/en/documents/epyc-techni...

Manabu-eo · 2024-04-12T19:08:20 1712948900

I'm only seeing the errata for the B1 stepping there, not the B0 stepping that are those "QS".

timschmidt · 2024-04-13T00:31:46 1712968306

Good catch! I'd missed that. Still, experiences of folks in the level1tech forums are positive: https://forum.level1techs.com/t/genoa-9654-qs-experiences/19...

hmottestad · 2024-04-10T16:28:57 1712766537

Do you have any feel for the performance compared to the M3 Max?

Manabu-eo · 2024-04-10T18:57:07 1712775427

LLM inference is mostly memory bound. An 12-channel Epyc Genoa with 4800MT/s DDR5 ram clocks at 460.8 GB/sec. It's more than the 400GB/s of the M3 Max, only part of that accessible for the CPU.

And on the Epyc System you can plug much more memory for when you need larger memory and PCI-E gpus, for when you need less faster memory.

Threadripper PRO is only 8-channel, but with memory overclocking it might reach numbers similar to those too.

hedgehog · 2024-04-10T20:35:25 1712781325

I'm curious how the newer consumer Ryzens might fare. With LPDDR5X they have >100 GB/s memory bandwidth and the GPUs have been improved quite a bit (16 TFLOPS FP16 nominal in the 780M). There are likely all kinds of software problems but setting that aside the perf/$ and perf/watt might be decent.

cjbprime · 2024-04-11T04:49:20 1712810960

Consumer Ryzens only have two-channel memory controllers. Two dual-rank (double sided) DIMMs per channel, which you would need to use to get enough RAM for LLMs, drops the memory bandwidth dramatically -- almost all the way back down to DDR4 speeds.

timschmidt · 2024-04-11T07:03:12 1712818992

Yup. Strix Halo will change this, with a 256bit memory bus (4 channel) which CPU and GPU have access to. However it is only likely to be available in laptop designs and probably with soldered-down RAM to reduce timing and board space issues. So it won't be easy to get enough memory for large LLMs with either. But it should be faster than previous models for LLM work.

hedgehog · 2024-04-11T20:36:21 1712867781

For consumer Ryzen to pencil out it would require a cluster of APU-equipped machines with the model striped across them. Given say 16GB of model per machine and 60GBps actual memory bandwidth @ $500 it's favorable vs A100s if the software is workable (which my guess is it's not today due to AMD's spotty support). This is for inference, training probably would be too slow due to interconnect overhead.

sliken · 2024-04-11T03:40:09 1712806809

If you Epyc's are too pricey, there's the Threadripper pro, 8 channels. AMD Siena/8000 series with 6 channels, and and Threadripper with 4 channels.

hmottestad · 2024-04-12T10:26:53 1712917613

That's interesting. It's about the same speed as the M3 Max then.

Have you tested it yourself?

Manabu-eo · 2024-04-18T14:09:39 1713449379

Nope, but this guy has a similar build: https://www.reddit.com/r/LocalLLaMA/comments/1bt8kc9/compari...

It seems to reach only a little above half the theoretical speed, and scale only up to 32 threads for some reason. Might be a temporary software limitation or something more fundamental.

timschmidt · 2024-04-12T10:48:27 1712918907

Should be at least twice the speed of the M3 Max, as the M3 CPU or GPU only get about half the memory bandwidth available to the package each. M3 Max can't take full advantage of it's memory bandwidth unless CPU, GPU, and NPU are all working at the same time.

hmottestad · 2024-04-12T16:45:01 1712940301

I tried looking for some info on this but could only find the M1 Max review over at anandtech that managed to push 200 GB/s when using multiple cores on the CPU, but couldn’t really get any numbers for just the GPU that seemed realistic.

Do you have a source for the GPU only having access to half the bandwidth of the memory?

dannyw · 2024-04-10T13:54:25 1712757265

You can QLoRA decent models on 24GB VRAM. There’s also optimised kernels like Unsloth that are really VRAM efficient and good for hobbyists.

reissbaker · 2024-04-11T01:32:25 1712799145

Yes, but I still don't think you'll be able to run Mixtral 8x22b with 16GB VRAM, or QLoRA it, even with Unsloth. It's much bigger than the original Mixtral.

aydyn · 2024-04-10T19:45:02 1712778302

AFAIK, 2-bit quant leads to too much loss of performance, such that you're better off using a different smaller model altogether. See here:

https://www.reddit.com/r/LocalLLaMA/comments/18ituzh/mixtral...

cjbprime · 2024-04-10T06:39:28 1712731168

Wouldn't expect that to work at all.

hedgehog · 2024-04-10T15:06:02 1712761562

Ollama (which wraps llama.cpp) supports splitting a model across devices so you get some acceleration even on models too big to fit entirely in GPU memory.

zone411 · 2024-04-11T16:33:44 1712853224

Very important to note that this is a base model, not an instruct model. Instruct fine-tuned models are what's useful for chat.

haolez · 2024-04-11T17:02:28 1712854948

What's the feeling of playing with a powerful base model? Will it just complete the prompt text like a continuation of it?

MPSimmons · 2024-04-11T17:16:05 1712855765

Generally, yes, it literally just tries to predict the next token again and again and again.

This model is apparently surprisingly good at chat, even though it is a base model, and will take part it it to some extent. It should be really interesting once it's fine-tuned.

talsperre · 2024-04-10T02:30:20 1712716220

Right on time as LLama 3 is released.

jimmySixDOF · 2024-04-10T09:00:15 1712739615

And the same day Google Gemini Pro gets almost complete open long context multimodal access and OpenAI upgrade to GPT4-Turbo it was a big day in general for news drops that's for sure!

abdullahkhalids · 2024-04-10T02:20:45 1712715645

Why are some of their models open, and others closed? What is their strategy?

Jackson__ · 2024-04-10T06:02:44 1712728964

My personal speculation is that their closed models are based on other companies' models.

For example on EQbench[0], Miqu[1], a leaked continued pretrain based on LLama2, performs extremely similar to the mistral medium model their API offers.

Maybe they're thinking it'd be bad PR for them to release models they didn't create from scratch, or there is some contractual obligation preventing the release.

[0]https://eqbench.com/index.html

[1]https://huggingface.co/miqudev/miqu-1-70b

moffkalast · 2024-04-10T11:22:08 1712748128

That's quite likely, some have also speculated that Mistral 7B got some EU grant funding that stipulated it had to be openly released later, and Mixtral is based on Mistral 7B so it would likely be subject to the same terms. I haven't found any source to substantiate it though.

unraveller · 2024-04-10T03:12:12 1712718732

Mistral have stated they want to chase the fine-tune dollar to support le research. We should get thrown a bone of hard to tune mid-range stuff occasionally. Especially when big announcements about small models are expected later in the week (llama3) or when haiku is stealing the thunder from mixtral 8x7b.

kvmet · 2024-04-10T02:23:09 1712715789

It's gotta be either perceived value or training data/licensing restrictions.

blackeyeblitzar · 2024-04-10T02:27:35 1712716055

I am not sure why some are open and some are closed - if I had to speculate, it’s perhaps that the commercial models help fund the team. They come with safety features built-in as well as API-based access (instead of needing to self-host). They word their mission (https://mistral.ai/company/#missions) as follows:

> Our mission is to make frontier AI ubiquitous, and to provide tailor-made AI to all the builders. This requires fierce independence, strong commitment to open, portable and customisable solutions, and an extreme focus on shipping the most advanced technology in limited time.

wkat4242 · 2024-04-11T08:05:49 1712822749

Weird, the last post I see at that link is from the 8th of December 2023 and it's not about this.

Edit: Ah, it's the wrong link. https://news.ycombinator.com/item?id=39986047

Thanks SushiHippie!

intellectronica · 2024-04-11T12:54:31 1712840071

It's weird that more than a day after the weights dropped, there still isn't a proper announcement from Mistral with a model card. Nor is it available on Mistral's own platform.

tosh · 2024-04-11T13:09:42 1712840982

at least they confirmed it is Apache 2.0

https://twitter.com/arthurmensch/status/1778308399144333411

ein0p · 2024-04-10T02:48:32 1712717312

To this day 8x7b Mixtral remains the best model you can run on a single 48GB GPU. This has the potential to become the best model you can run on two such GPUs, or on an MBP with maxed out RAM, when 4-bit quantized.

ryao · 2024-04-11T02:33:33 1712802813

I am looking forward to the pricing of those dropping. It is a shame that high memory graphics cards are not mainstream.

rspoerri · 2024-04-10T07:21:46 1712733706

I hope i get it to run on my 96gb m2 in q4.

rspoerri · 2024-04-10T17:11:41 1712769101

It actually does, in case anybody wonders. But it seems as if it's not fine tuned to chat, or i'm doing it wrong at the moment. Getting a lot of duplicates and non useful answers.

cyanydeez · 2024-04-10T23:02:27 1712790147

They might've tweaked the prompt tokens.

noman-land · 2024-04-10T02:51:13 1712717473

My first thought was how much RAM? Will it work on 64GB M1?

jwitthuhn · 2024-04-10T02:57:57 1712717877

It is ~260GB with presumably fp16 weights. Should fit into 64GB at 3-bit quantization (~49GB).

Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.

wkat4242 · 2024-04-11T08:15:31 1712823331

I wonder, can you quantize it yourself with some tool?

pja · 2024-04-11T13:54:58 1712843698

llama.cpp can quantize a model for you:

https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#qu...

wkat4242 · 2024-04-11T17:08:05 1712855285

Thanks!!

ein0p · 2024-04-10T02:54:45 1712717685

Nope. Just the weights would take 88GB at 4 bit. 128GB MBP ought to be able to run it. If I were to guess, a version for Apple MLX should be available within a few days, for those of us fortunate enough to own such a thing.

Art9681 · 2024-04-11T12:13:41 1712837621

It’s already available. I had it running yesterday morning in an M3 MAX 128GB. I get about 6tps.

https://www.reddit.com/r/LocalLLaMA/s/MSsrqWHYga

varunvummadi · 2024-04-10T01:33:33 1712712813

They Just announced their new model on Twitter, which you can download using torrent

aurareturn · 2024-04-10T07:26:39 1712733999

Might be a dumb question but does this mean this model has 176B params?

idiliv · 2024-04-10T07:52:42 1712735562

In Mixtral 8x7B, the 8 means that the model uses Mixture-of-Experts (MoE) layers with 8 experts. The 7B means that if you were to remove 7 of the 8 experts in each layer, then you would end up with a 7B model (which would have exactly the same architecture as Mistral 7B). Therefore, a 1x7B model has 7B params. An 8x7B model has 1 * 7B + (8-1) * sz_expert params, where sz_expert is some constant value that the MoE layers increase by when adding one expert. In the case of Mixtral 8x7B the model size is 46.3GB, so, sz_expert ≈ 5.6B.

If these assumptions port over to 8x22B, then 8x22B has, at 281GB, sz_expert ≈ 13.8B.

KTibow · 2024-04-10T12:24:34 1712751874

I tried to check this for myself.

I agreed for the first one, (46.3 - 7) / 7 = 5.61b.

The second one doesn't match up, (281 - 22) / 7 = 37b or (140.5 - 22) / 7 = 16.92b. Am I doing something wrong?

idiliv · 2024-04-10T20:09:54 1712779794

Just tried this again and I also arrive at 16.92B. Not sure what I did wrong the first time, thanks for double-checking this!

idiliv · 2024-04-10T07:55:42 1712735742

Oh, and to answer your actual question: Assuming that the model is released with 16 bits per parameter, then it as 281GB / 16 bit = 140.5 parameters.

hovering_nox · 2024-04-10T07:28:50 1712734130

8x7 had 46B or so.

resource_waste · 2024-04-11T16:09:26 1712851766

What is the excitement around models that arent as good as llama?

This is clearly an inferior model that they are willing to share for marketing purposes.

If it was an improvement over llama, sure, but it seems like just an ad for bad AI.

Me1000 · 2024-04-11T19:31:56 1712863916

Mixtral 7x8b was way better than llama2 70b and used less RAM and compute at the same time. This model is way better than llama.

In fact I would go as far as saying llama2 isn’t that good compared to some of the most recent models.

jeppebemad · 2024-04-11T16:49:13 1712854153

We use their earlier Mixtral model because it outperforms llama for our use case. They do not release full models for marketing purposes, though it definitely grabs attention! You may need to revise your views..

cma · 2024-04-11T16:20:26 1712852426

It beats llama on the benchmark posted below (though maybe leaked into training data). But also you can run it on cheaper split up hardware with less individual vram than the big llama.

zone411 · 2024-04-11T16:30:03 1712853003

What makes it you think it's not as good as LLaMA? It's likely much better. There are multiple open-weight models that are better than LLaMA 2 out there already.

swalsh · 2024-04-10T01:33:32 1712712812

Is this Mistral large?

Jackson__ · 2024-04-10T06:04:47 1712729087

Unlikely, this model has a max sequence length of 65k, while mistral large is 32k.

varunvummadi · 2024-04-10T01:33:58 1712712838

Not sure trying to download the torrent and checking it out

fbdab103 · 2024-04-10T02:54:58 1712717698

For those of us without twitter, how many GB is the model?

KTibow · 2024-04-10T03:01:00 1712718060

(hope this isn't against rules but) If you don't have Twitter, the magnet link is

  magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%http://2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce

confused_boner · 2024-04-10T03:06:32 1712718392

262 gb

fbdab103 · 2024-04-10T03:20:11 1712719211

Ooof. I really need to pick up another HD, these model sizes are killer.

Lacking a godly GPU, I will probably hold off for a quanitized version which has the potential to run okish on CPU or my modest GPU, but really appreciate the info.

stainablesteel · 2024-04-11T15:43:48 1712850228

has anyone had success making an auto-gpt concept for mistral/llama models? i haven't found one

dkasper · 2024-04-11T15:58:10 1712851090

Has anyone had success making an auto-gpt with any models? Besides toy use cases

danenania · 2024-04-11T16:04:06 1712851446

I built one using GPT-4[1]. It's not perfect but is working quite well and is now being used by hundreds of users, apart from me, to work on real, non-toy tasks. For example, I used it to build most of a production-ready AWS infrastructure (and accompanying deploy script) with the AWS CDK.

I want to add Mistral support soon, probably via together.ai or a similar service.

1 - https://github.com/plandex-ai/plandex

Filligree · 2024-04-11T16:06:24 1712851584

Your link is broken.

danenania · 2024-04-11T16:07:43 1712851663

Sorry, just fixed it.

freeqaz · 2024-04-11T16:07:27 1712851647

Your link 404s fyi

angilly · 2024-04-10T02:21:46 1712715706

The lack of a corresponding announcement on their blog makes me worry about a Twitter account compromise and a malicious model. Any way to verify it’s really from them?

simonw · 2024-04-10T03:16:11 1712718971

Their https://twitter.com/MistralAI account has 5 tweets since the account opened, three of which were model release magnet links.

https://twitter.com/MistralAILabs is their other Twitter account, which is very slightly more useful though still very low traffic.

swyx · 2024-04-10T02:22:08 1712715728

you must be new to mistral releases. they invented the magnet first blog later meta

angilly · 2024-04-10T02:24:44 1712715884

At 3:30a France local? Alrighty. I still wait a lil bit ;)

moralestapia · 2024-04-10T02:28:12 1712716092

What could a malicious model do, though? Curse at you?

Teever · 2024-04-10T02:33:04 1712716384

https://arstechnica.com/security/2024/03/hugging-face-the-gi...

Tiberium · 2024-04-10T02:40:31 1712716831

Not .safetensors though

Aissen · 2024-04-10T08:29:17 1712737757

Exploit a memory safety issue in the tokenizer/or other parts of your LLM infra written in a native language.

moralestapia · 2024-04-10T11:57:46 1712750266

??? With weights?

fzzzy · 2024-04-10T23:55:19 1712793319

There was a buffer overflow or some other exploit like that in llama.cpp and the gguf format. It has been fixed now, but it's definitely possible. Also weights distributed as python pickles can run arbitrary code.

bevekspldnw · 2024-04-11T02:33:47 1712802827

Distributing anything as python pickles seems utterly batshit to me.

fzzzy · 2024-04-11T10:46:23 1712832383

Completely agree.

abound · 2024-04-10T13:10:41 1712754641

There are plenty of exploits where the payload is just "data" read by some vulnerable program (PDF readers, image viewers, browsers, compression tools, messaging apps, etc)

sp332 · 2024-04-10T22:18:12 1712787492

Yes, there's a reason weights are now distributed as "safetensors" files. Malicious weights files in the old formats are possible, and while I haven't seen evidence of the new format being exploitable, I wouldn't be surprised if someone figures out how to do it eventually.

llm_trw · 2024-04-10T02:29:26 1712716166

This is how they released every model so far.

tjtang2019 · 2024-04-10T02:19:47 1712715587

What are the advantages compared to GPT? Looking forward to using it!

qball · 2024-04-11T16:52:46 1712854366

>What are the advantages compared to GPT?

It actually does what you tell it, and won't try to silently change your prompt to conform to a specific flavor of Californian hysterics, which is what OpenAI's products do.

Also, since it's a local model, your queries aren't being datamined nor can access to the service be revoked on a whim.