One thing that's worth mentioning about llama.cpp wrappers like ollama, LM Studio and Faraday is that they don't yet support[1] sliding window attention, and instead use vanilla causal attention from llama2. As noted in the Mistral 7B paper[2], SWA has some benefits in terms of attention span over regular causal attention.
Disclaimer: I have a competing universal macOS/iOS app[3] that does support SWA with Mistral models (using mlc-llm).
I like your app Private LLM. I run it on an old M1 iPad Pro with 16G memory. That said, most of my experiments with integrating LLMs with programs written in Python, Racket, and Common Lisp use Ollama’s REST service. I have a Mac Mini with 32B and it blows my mind how effective an inexpensive energy efficient home computer is for running LLMs.
Ah, I tried your app out just the other day. It sounded interesting but I couldn’t see a way to download models (on iOS anyway) and it seemed quite slow compared to MLC Chat but perhaps that was the built in model?
Currently, the iOS app only allows downloading the 7B Llama 2 Uncensored model on iPhone 13 Pro and newer phones and also Apple Silicon iPads. I'm working on an update to allow downloading multiple models from the StableLM 3B, Llama 2 7B and Mistral 7B based models. As before, due to memory and compute constraints, 7B models can only be downloaded on the aforementioned devices.
Just to be clear, Mistral 7B based models work best with sliding window attention. But they seem to have recently disabled[1][2] SWA for Mixtral, their MoE model. I learnt that from this[3] r/LocalLLaMA post, yesterday.
Not on iOS. On macOS, I personally think WizardLM 13B v1.2 is a very strong model and keep hearing good things about it from users on our discord and in support emails. Now that there's OmniQuant support for Mixtral models[1], I'm plan to add support for Mixtral-8x7B-Instruct-v0.1 in the next version of the macOS app, which in my tests, looks like a very good all purpose model that's also pretty good at coding. It's pretty memory hungry (~41GB of RAM), but that's the price to pay for an uncompromising implementation. Existing quantized implementations quantize the MoE gates, leading to a significant drop in perplexity when compared with results from fp16 inference.
The insane complexity of these ML models is limited to the data being handled, billions of numbers in hundreds of tensors. The GPU-running compute kernels are surprisingly simple, they just called many times with different input tensors.
This makes these compute kernels relatively portable across GPU APIs and languages.
I think on Windows, Direct3D is often a better tech choice for GPGPU. (1) It runs on GPUs of all vendors (2) Direct3D is an essential Windows component; the OS even composes windows on desktop with D3D11. Unlike CUDA, D3D runtime libraries are already available in the OS by default, which simplifies software installation and support.
Running mixtral on an 32gb M3 Max is surprisingly hard on LM Studio... turning on "Apple Metal" mode just crashes it. It takes about 10 minutes to load otherwise into memory, and inference is super slow.
Anyone got suggestions? Would love for it to run summarization over GBs of ncbi data.
32GB memory only leaves about 24-26GB for the GPU by default which is quite low for a larger model like that. For comparison it runs great on a M2 Max 96GB.
I'm using gpt-4 as an assistant for basic computing questions such helping with quick shell one liners or looking up for the simplest way to do some manipulation of rust values (vec of result into result of vec which I always forget how to do etc).
I wanted to try how local models compare with that use case of mine.
Strange. I remember trying to get this to work on a 16gb machine and all of the comments on a github issue mentioning it were saying it needs at least 32 or more.
Try llamafile https://github.com/Mozilla-Ocho/llamafile
I have Mistral 7B running this way on a 10 year old laptop and it only seems to use a few GB with it's memory mapping approach.
If it is lazy loading just what it needs, seems like an efficient use of memory. In any case, this 4GB model will easily fit into the commenter's 16GB machine.
It’s really just a difference in accounting. Memory used for memory-mapped files aren’t shown in the “used” header, but instead the disk cache one. And doesn’t need to be swapped out to be discarded, so if you lack the memory it just slows everything down without an obvious cause.
I’m sure this question has been asked plenty of times but what is the best place to start with LLMs (no theory, just installing them locally, fine tuning them, etc)
While true, I'm not a fan of the fact that it's closed source. I use Ollama since it's trivial to swap models on the fly and exposes a very simple REST api.
It also manages to have both command injection vulnerabilities (passing untrusted data directly to a shell) as well as json injection vulnerabilities (templating untrusted data directly into json).
It is not so bad. I routinely fork processes to integrate useful code with Common Lisp and Racket code. What is a millisecond process startup time between friends :-)
I agree that since Ollama provides a nice REST API, why not use that.
Disclaimer: I have a competing universal macOS/iOS app[3] that does support SWA with Mistral models (using mlc-llm).
[1]: https://github.com/ggerganov/llama.cpp/issues/3377
[2]: https://arxiv.org/abs/2310.06825
[3]: https://apps.apple.com/us/app/private-llm/id6448106860