Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Run Mistral 7B on M1 Mac (wandb.ai)
111 points by byyoung3 on Dec 16, 2023 | hide | past | favorite | 51 comments


One thing that's worth mentioning about llama.cpp wrappers like ollama, LM Studio and Faraday is that they don't yet support[1] sliding window attention, and instead use vanilla causal attention from llama2. As noted in the Mistral 7B paper[2], SWA has some benefits in terms of attention span over regular causal attention.

Disclaimer: I have a competing universal macOS/iOS app[3] that does support SWA with Mistral models (using mlc-llm).

[1]: https://github.com/ggerganov/llama.cpp/issues/3377

[2]: https://arxiv.org/abs/2310.06825

[3]: https://apps.apple.com/us/app/private-llm/id6448106860


I like your app Private LLM. I run it on an old M1 iPad Pro with 16G memory. That said, most of my experiments with integrating LLMs with programs written in Python, Racket, and Common Lisp use Ollama’s REST service. I have a Mac Mini with 32B and it blows my mind how effective an inexpensive energy efficient home computer is for running LLMs.


Ah, I tried your app out just the other day. It sounded interesting but I couldn’t see a way to download models (on iOS anyway) and it seemed quite slow compared to MLC Chat but perhaps that was the built in model?


Currently, the iOS app only allows downloading the 7B Llama 2 Uncensored model on iPhone 13 Pro and newer phones and also Apple Silicon iPads. I'm working on an update to allow downloading multiple models from the StableLM 3B, Llama 2 7B and Mistral 7B based models. As before, due to memory and compute constraints, 7B models can only be downloaded on the aforementioned devices.


It is important to note here that SWA is not supported by all models, e.g. Mixtral is not working correctly.


Just to be clear, Mistral 7B based models work best with sliding window attention. But they seem to have recently disabled[1][2] SWA for Mixtral, their MoE model. I learnt that from this[3] r/LocalLLaMA post, yesterday.

[1]: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/...

[2]: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/commit/58...

[3]: https://old.reddit.com/r/LocalLLaMA/comments/18k0fek/


Do the models on your apps work well for coding tasks?


Not on iOS. On macOS, I personally think WizardLM 13B v1.2 is a very strong model and keep hearing good things about it from users on our discord and in support emails. Now that there's OmniQuant support for Mixtral models[1], I'm plan to add support for Mixtral-8x7B-Instruct-v0.1 in the next version of the macOS app, which in my tests, looks like a very good all purpose model that's also pretty good at coding. It's pretty memory hungry (~41GB of RAM), but that's the price to pay for an uncompromising implementation. Existing quantized implementations quantize the MoE gates, leading to a significant drop in perplexity when compared with results from fp16 inference.

[1]: https://github.com/OpenGVLab/OmniQuant/commit/798467


For anyone who only has 8gb RAM;

I can run orca-mini:3b, It's dumber than just flipping a coin but it still feels cool to have a LLM running on your own computer.



Trivial to run Mistral 7B on an M1 Macbook Air using LM Studio. Just make sure you use a quantized version.


Trivial to run it from command line with llama.cpp

Takes 5 minutes really.



Windows equivalent: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...

Runs on GPUs, uses about 5GB VRAM. On integrated GPUs generates 1-2 tokens/second, on discrete ones often over 20 tokens/second.


Wow this sounds cool:

  Replacement of CUDA with Direct3D 11


The insane complexity of these ML models is limited to the data being handled, billions of numbers in hundreds of tensors. The GPU-running compute kernels are surprisingly simple, they just called many times with different input tensors.

This makes these compute kernels relatively portable across GPU APIs and languages.

I think on Windows, Direct3D is often a better tech choice for GPGPU. (1) It runs on GPUs of all vendors (2) Direct3D is an essential Windows component; the OS even composes windows on desktop with D3D11. Unlike CUDA, D3D runtime libraries are already available in the OS by default, which simplifies software installation and support.


Running mixtral on an 32gb M3 Max is surprisingly hard on LM Studio... turning on "Apple Metal" mode just crashes it. It takes about 10 minutes to load otherwise into memory, and inference is super slow.

Anyone got suggestions? Would love for it to run summarization over GBs of ncbi data.


Try a quantized model on llama.cpp — I have an M1 Max 32Gb and it’s running fairly fast in 20Gb of memory.


32GB memory only leaves about 24-26GB for the GPU by default which is quite low for a larger model like that. For comparison it runs great on a M2 Max 96GB.


It runs OK on my 32B Mac Mini, but I have mostly been running it using the inexpensive Mistral and AllScale APIs.


Do you need to use Mixtral? The typical quantized Mistral 7B and similar models can do summarization fine.


I'm using gpt-4 as an assistant for basic computing questions such helping with quick shell one liners or looking up for the simplest way to do some manipulation of rust values (vec of result into result of vec which I always forget how to do etc).

I wanted to try how local models compare with that use case of mine.

The only one that worked decently was Mixtral


Haha no not at all, but it’s fun!


Get a dirt cheap windows machine with >32GB or rent a VM?


Strange. I remember trying to get this to work on a 16gb machine and all of the comments on a github issue mentioning it were saying it needs at least 32 or more.

/edit this was with llama cpp though not ollama


Try llamafile https://github.com/Mozilla-Ocho/llamafile I have Mistral 7B running this way on a 10 year old laptop and it only seems to use a few GB with it's memory mapping approach.


It doesn’t count most of the model, since it’s memory; it only shows up as memory used by the disk cache.

Though if your machine can’t keep it all in memory, then speed will still fall off a cliff.


If it is lazy loading just what it needs, seems like an efficient use of memory. In any case, this 4GB model will easily fit into the commenter's 16GB machine.


If you're running on GPU then it would need to be wired, and wired file-backed pages do count as process memory and have to physically fit in DRAM.


Wow that's incredible. And legit too. I was reading through issues on llama-cpp about implementing memory swapping so I didn't think it had been done.

Thanks!


It’s really just a difference in accounting. Memory used for memory-mapped files aren’t shown in the “used” header, but instead the disk cache one. And doesn’t need to be swapped out to be discarded, so if you lack the memory it just slows everything down without an obvious cause.


I've got a 3.xBIL model running on a 8GB version which is fine-tuned to be better than 13BIL models; heres how I did it: https://christiaanse.ca/posts/running_llm/


That's not memory swapping but something else right? I ask because it looks like the new mistral model but slightly different.


It’s fine tuned minstrel model based on orca Microsoft model.


What is the performance? I didn't see any benchmarks listed.


I’m sure this question has been asked plenty of times but what is the best place to start with LLMs (no theory, just installing them locally, fine tuning them, etc)



Does this run on CPU or GPU?

Would it be advantageous to divide the load and run on both?


Obviously cpu


No. Ollama takes advantage of Metal on Mac M1.


Not obvious to me. I thought these AI things required vast amount of GPU power. Or is that for the training part?


Inference works fine on CPUs while a bit slow but fine for a chatbot


Is there a llamafile?


Aside from LM Studio there's also Faraday https://faraday.dev/


LM Studio is even easier.


While true, I'm not a fan of the fact that it's closed source. I use Ollama since it's trivial to swap models on the fly and exposes a very simple REST api.


I find it hard to trust a tutorial that tells me forking curl is an exemplary way to interact with a local restful API in Python.


curl is the default method for a lot of simple AI API tutorials, unfortunately, although running a curl command in Python is a new one.

It's very easy to port it to Python requests or a similar HTTP library, at the least.


It also manages to have both command injection vulnerabilities (passing untrusted data directly to a shell) as well as json injection vulnerabilities (templating untrusted data directly into json).


It is not so bad. I routinely fork processes to integrate useful code with Common Lisp and Racket code. What is a millisecond process startup time between friends :-)

I agree that since Ollama provides a nice REST API, why not use that.


Yeah. WTF? Have they not heard of urllib2?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: