Run Mistral 7B on M1 Mac

woadwarrior01 · on Dec 17, 2023

One thing that's worth mentioning about llama.cpp wrappers like ollama, LM Studio and Faraday is that they don't yet support[1] sliding window attention, and instead use vanilla causal attention from llama2. As noted in the Mistral 7B paper[2], SWA has some benefits in terms of attention span over regular causal attention.

Disclaimer: I have a competing universal macOS/iOS app[3] that does support SWA with Mistral models (using mlc-llm).

[1]: https://github.com/ggerganov/llama.cpp/issues/3377

[2]: https://arxiv.org/abs/2310.06825

[3]: https://apps.apple.com/us/app/private-llm/id6448106860

mark_l_watson · on Dec 17, 2023

I like your app Private LLM. I run it on an old M1 iPad Pro with 16G memory. That said, most of my experiments with integrating LLMs with programs written in Python, Racket, and Common Lisp use Ollama’s REST service. I have a Mac Mini with 32B and it blows my mind how effective an inexpensive energy efficient home computer is for running LLMs.

smcleod · on Dec 17, 2023

Ah, I tried your app out just the other day. It sounded interesting but I couldn’t see a way to download models (on iOS anyway) and it seemed quite slow compared to MLC Chat but perhaps that was the built in model?

woadwarrior01 · on Dec 17, 2023

Currently, the iOS app only allows downloading the 7B Llama 2 Uncensored model on iPhone 13 Pro and newer phones and also Apple Silicon iPads. I'm working on an update to allow downloading multiple models from the StableLM 3B, Llama 2 7B and Mistral 7B based models. As before, due to memory and compute constraints, 7B models can only be downloaded on the aforementioned devices.

rf15 · on Dec 17, 2023

It is important to note here that SWA is not supported by all models, e.g. Mixtral is not working correctly.

woadwarrior01 · on Dec 17, 2023

Just to be clear, Mistral 7B based models work best with sliding window attention. But they seem to have recently disabled[1][2] SWA for Mixtral, their MoE model. I learnt that from this[3] r/LocalLLaMA post, yesterday.

[1]: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/...

[2]: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/commit/58...

[3]: https://old.reddit.com/r/LocalLLaMA/comments/18k0fek/

swah · on Dec 17, 2023

Do the models on your apps work well for coding tasks?

woadwarrior01 · on Dec 17, 2023

Not on iOS. On macOS, I personally think WizardLM 13B v1.2 is a very strong model and keep hearing good things about it from users on our discord and in support emails. Now that there's OmniQuant support for Mixtral models[1], I'm plan to add support for Mixtral-8x7B-Instruct-v0.1 in the next version of the macOS app, which in my tests, looks like a very good all purpose model that's also pretty good at coding. It's pretty memory hungry (~41GB of RAM), but that's the price to pay for an uncompromising implementation. Existing quantized implementations quantize the MoE gates, leading to a significant drop in perplexity when compared with results from fp16 inference.

[1]: https://github.com/OpenGVLab/OmniQuant/commit/798467

FergusArgyll · on Dec 17, 2023

For anyone who only has 8gb RAM;

I can run orca-mini:3b, It's dumber than just flipping a coin but it still feels cool to have a LLM running on your own computer.

emadm · on Dec 17, 2023

Just run stable lm zephyr https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF

vaillant · on Dec 17, 2023

Trivial to run Mistral 7B on an M1 Macbook Air using LM Studio. Just make sure you use a quantized version.

ekianjo · on Dec 17, 2023

Trivial to run it from command line with llama.cpp

Takes 5 minutes really.

jasonjmcghee · on Dec 17, 2023

If you prefer web UI: https://github.com/ollama-webui/ollama-webui

Const-me · on Dec 17, 2023

Windows equivalent: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...

Runs on GPUs, uses about 5GB VRAM. On integrated GPUs generates 1-2 tokens/second, on discrete ones often over 20 tokens/second.

rahimnathwani · on Dec 17, 2023

Wow this sounds cool:

  Replacement of CUDA with Direct3D 11

Const-me · on Dec 17, 2023

The insane complexity of these ML models is limited to the data being handled, billions of numbers in hundreds of tensors. The GPU-running compute kernels are surprisingly simple, they just called many times with different input tensors.

This makes these compute kernels relatively portable across GPU APIs and languages.

I think on Windows, Direct3D is often a better tech choice for GPGPU. (1) It runs on GPUs of all vendors (2) Direct3D is an essential Windows component; the OS even composes windows on desktop with D3D11. Unlike CUDA, D3D runtime libraries are already available in the OS by default, which simplifies software installation and support.

yawnxyz · on Dec 17, 2023

Running mixtral on an 32gb M3 Max is surprisingly hard on LM Studio... turning on "Apple Metal" mode just crashes it. It takes about 10 minutes to load otherwise into memory, and inference is super slow.

Anyone got suggestions? Would love for it to run summarization over GBs of ncbi data.

sunpazed · on Dec 17, 2023

Try a quantized model on llama.cpp — I have an M1 Max 32Gb and it’s running fairly fast in 20Gb of memory.

smcleod · on Dec 17, 2023

32GB memory only leaves about 24-26GB for the GPU by default which is quite low for a larger model like that. For comparison it runs great on a M2 Max 96GB.

mark_l_watson · on Dec 17, 2023

It runs OK on my 32B Mac Mini, but I have mostly been running it using the inexpensive Mistral and AllScale APIs.

minimaxir · on Dec 17, 2023

Do you need to use Mixtral? The typical quantized Mistral 7B and similar models can do summarization fine.

ithkuil · on Dec 17, 2023

I'm using gpt-4 as an assistant for basic computing questions such helping with quick shell one liners or looking up for the simplest way to do some manipulation of rust values (vec of result into result of vec which I always forget how to do etc).

I wanted to try how local models compare with that use case of mine.

The only one that worked decently was Mixtral

yawnxyz · on Dec 17, 2023

Haha no not at all, but it’s fun!

SillyUsername · on Dec 17, 2023

Get a dirt cheap windows machine with >32GB or rent a VM?

heyoni · on Dec 17, 2023

Strange. I remember trying to get this to work on a 16gb machine and all of the comments on a github issue mentioning it were saying it needs at least 32 or more.

/edit this was with llama cpp though not ollama

trailbits · on Dec 17, 2023

Try llamafile https://github.com/Mozilla-Ocho/llamafile I have Mistral 7B running this way on a 10 year old laptop and it only seems to use a few GB with it's memory mapping approach.

Filligree · on Dec 17, 2023

It doesn’t count most of the model, since it’s memory; it only shows up as memory used by the disk cache.

Though if your machine can’t keep it all in memory, then speed will still fall off a cliff.

trailbits · on Dec 17, 2023

If it is lazy loading just what it needs, seems like an efficient use of memory. In any case, this 4GB model will easily fit into the commenter's 16GB machine.

astrange · on Dec 17, 2023

If you're running on GPU then it would need to be wired, and wired file-backed pages do count as process memory and have to physically fit in DRAM.

heyoni · on Dec 17, 2023

Wow that's incredible. And legit too. I was reading through issues on llama-cpp about implementing memory swapping so I didn't think it had been done.

Thanks!

Filligree · on Dec 17, 2023

It’s really just a difference in accounting. Memory used for memory-mapped files aren’t shown in the “used” header, but instead the disk cache one. And doesn’t need to be swapped out to be discarded, so if you lack the memory it just slows everything down without an obvious cause.

RandomWorker · on Dec 17, 2023

I've got a 3.xBIL model running on a 8GB version which is fine-tuned to be better than 13BIL models; heres how I did it: https://christiaanse.ca/posts/running_llm/

heyoni · on Dec 17, 2023

That's not memory swapping but something else right? I ask because it looks like the new mistral model but slightly different.

RandomWorker · on Dec 18, 2023

It’s fine tuned minstrel model based on orca Microsoft model.

georgel · on Dec 17, 2023

What is the performance? I didn't see any benchmarks listed.

chestertn · on Dec 17, 2023

I’m sure this question has been asked plenty of times but what is the best place to start with LLMs (no theory, just installing them locally, fine tuning them, etc)

rahimnathwani · on Dec 17, 2023

https://github.com/oobabooga/text-generation-webui

nextaccountic · on Dec 17, 2023

Does this run on CPU or GPU?

Would it be advantageous to divide the load and run on both?

ekianjo · on Dec 17, 2023

Obviously cpu

vunderba · on Dec 17, 2023

No. Ollama takes advantage of Metal on Mac M1.

endorphine · on Dec 17, 2023

Not obvious to me. I thought these AI things required vast amount of GPU power. Or is that for the training part?

ekianjo · on Dec 17, 2023

Inference works fine on CPUs while a bit slow but fine for a chatbot

rattray · on Dec 17, 2023

Is there a llamafile?

dangero · on Dec 17, 2023

Aside from LM Studio there's also Faraday https://faraday.dev/

dankle · on Dec 17, 2023

LM Studio is even easier.

vunderba · on Dec 17, 2023

While true, I'm not a fan of the fact that it's closed source. I use Ollama since it's trivial to swap models on the fly and exposes a very simple REST api.

dixie_land · on Dec 17, 2023

I find it hard to trust a tutorial that tells me forking curl is an exemplary way to interact with a local restful API in Python.

minimaxir · on Dec 17, 2023

curl is the default method for a lot of simple AI API tutorials, unfortunately, although running a curl command in Python is a new one.

It's very easy to port it to Python requests or a similar HTTP library, at the least.

justinsaccount · on Dec 17, 2023

It also manages to have both command injection vulnerabilities (passing untrusted data directly to a shell) as well as json injection vulnerabilities (templating untrusted data directly into json).

mark_l_watson · on Dec 17, 2023

It is not so bad. I routinely fork processes to integrate useful code with Common Lisp and Racket code. What is a millisecond process startup time between friends :-)

I agree that since Ollama provides a nice REST API, why not use that.

teaearlgraycold · on Dec 17, 2023

Yeah. WTF? Have they not heard of urllib2?