I've never run one of these models locally, but their README has some pretty eas...

lhl · on Sept 27, 2023

Use the code/model included here: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

Change the initial device line from "cuda" to "cpu" and it'll run.

(Edit: just a note, use the main/head version of transformers which has merged Mistral support. Also saw TheBloke uploaded a GGUF and just confirmed that latest llama.cpp works w/ it.)

polygamous_bat · on Sept 27, 2023

You gotta wait until GGML and the likes repackage the model; early releases are almost always targeted at ML folks with dedicated GPUs.

londons_explore · on Sept 27, 2023

I think it's really lame that ML, which is just math really, hasn't got some system-agnostic language to define what math needs to be done, and then it can run easily on CPU/GPU/TPU/whatever...

A whole industry being locked into NVidia seems bad all round.

miloignis · on Sept 27, 2023

https://onnx.ai/ sounds close to what you're thinking of, it's an open interchange format.

esafak · on Sept 27, 2023

It's not Nvidia's fault that the competition (AMD) does not provide the right software. There is an open alternative to CUDA called OpenCL.

londons_explore · on Sept 27, 2023

Nvidia has wrapped their cuda language in patents and licensing so tightly that there is no way AMD could release anything cuda-compatible.

kkielhofner · on Sept 27, 2023

Yes but AMD could release a ROCm that actually works and then put actually meaningful resources into some of the countless untold projects out there that have been successfully building on CUDA for 15 years.

There was a recent announcement that after six years AMD finally sees the $$$ and will be starting to (finally) put some real effort into ROCm[0]. That announcement was two days ago and they claim they started on this last year. My occasional experience with ROCm doesn't show much progress or promise.

I'm all for viable Nvidia competition in the space but AMD has really, really, really dropped the ball on GPGPU with their hardware up to this point.

[0] - https://www.eetimes.com/rocm-is-amds-no-1-priority-exec-says...

tormeh · on Sept 27, 2023

As sad as it is, this is true. AMD has never spent lots of money on software, while Nvidia always has, which was fine for traditional graphics, but with ML this really doesn't cut it. AMD could have ported Pytorch to OpenCL or Vulkan or WebGPU, but they just... can't be bothered???

avereveard · on Sept 27, 2023

it's not entirely their fault, they rely on xformers, and that library is gpu only.

other models will happily run on cpu only mode, depending on your environment there are super easy ways to get started, and 32 core should be ok for a llama2 13b and bearable with some patient for running 33b models. for reference I'm willingly running 13b llama2 on cpu only mode so I can leave the gpu to diffusers, and it's just enough to be generating at a comfortable reading speed.

kardianos · on Sept 27, 2023

Use llama.cpp to run models locally.

turnsout · on Sept 27, 2023

Can llama.cpp run this yet? That would be surprising

daakus · on Sept 27, 2023

It can! TheBloke is to thank for the incredibly quick turnaround.

https://github.com/ggerganov/llama.cpp/pull/3362

https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/tree/ma...

moffkalast · on Sept 27, 2023

Birds fly, sun shines, and TheBloke always delivers.

Though I can't figure out that prompt and with LLama2's template it's... weird. Responds half in Korean and does unnecessary numbering of paragraphs.

Just one big sigh towards those supposed efforts on prompt template standardization. Every single model just has to do something unique that breaks all compatibility but has never resulted in any performance gain.

daakus · on Sept 28, 2023

I used the prompt included in llama.cpp and it worked for me in English (for fun GK type questions):

MODEL=./models/mistral-7b-v0.1.Q5_K_M.gguf N_THREAD=16 ./examples/chat-13B.sh

aidenn0 · on Sept 27, 2023

I have yet to get any useful output out of the Q5_K_S version; haven't tried any others yet.

poser-boy · on Sept 28, 2023

Linked is the base model. What you want is the instruct model (also on TheBloke's profile), which has been trained on following instructions.

daakus · on Sept 28, 2023

I used mistral-7b-v0.1.Q5_K_M.gguf and it responded to basic questions.

turnsout · on Sept 27, 2023

Wow, awesome!

programd · on Sept 27, 2023

I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in).

So great performance on a cheap CPU from 2 years ago which costs, what $130 or so?

I tried Llama.65B on the same hardware and it was way slower, but it worked fine. Took about 10 minutes to output some cooking recipe.

I think people way overestimate the need for expensive GPUs to run these models at home.

I haven't tried fine tuning, but I suspect instead of 30 hours on high end GPUs you can probably get away with fine tuning in what, about a week? two weeks? just on a comparable CPU. Has anybody actually run that experiment?

Basically any kid with an old rig can roll their own customized model given a bit of time. So much for alignment.

loudmax · on Sept 27, 2023

It would be very surprising.

Mistral AI's github page has more information on their sliding window attention method to achieve this performance: https://github.com/mistralai/mistral-src

If Mistral 7b lives up to the claims, I expect these techniques will make their way into llama.cpp. But I would be surprised if the required updates were quick or easy.

jsight · on Sept 28, 2023

ollama runs it pretty well on CPUs like that.