Hacker News new | past | comments | ask | show | jobs | submit login

I've never run one of these models locally, but their README has some pretty easy to follow instructions, so I tried it out...

> RuntimeError: Found no NVIDIA driver on your system.

It's true that I don't have an NVIDIA GPU in this system. But I have 64GB of memory and 32 cpu cores. Are these useless for running these types of large language models? I don't need blazing fast speed, I just need a few tokens a second to test-drive the model.




Use the code/model included here: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

Change the initial device line from "cuda" to "cpu" and it'll run.

(Edit: just a note, use the main/head version of transformers which has merged Mistral support. Also saw TheBloke uploaded a GGUF and just confirmed that latest llama.cpp works w/ it.)


You gotta wait until GGML and the likes repackage the model; early releases are almost always targeted at ML folks with dedicated GPUs.


I think it's really lame that ML, which is just math really, hasn't got some system-agnostic language to define what math needs to be done, and then it can run easily on CPU/GPU/TPU/whatever...

A whole industry being locked into NVidia seems bad all round.


https://onnx.ai/ sounds close to what you're thinking of, it's an open interchange format.


It's not Nvidia's fault that the competition (AMD) does not provide the right software. There is an open alternative to CUDA called OpenCL.


Nvidia has wrapped their cuda language in patents and licensing so tightly that there is no way AMD could release anything cuda-compatible.


Yes but AMD could release a ROCm that actually works and then put actually meaningful resources into some of the countless untold projects out there that have been successfully building on CUDA for 15 years.

There was a recent announcement that after six years AMD finally sees the $$$ and will be starting to (finally) put some real effort into ROCm[0]. That announcement was two days ago and they claim they started on this last year. My occasional experience with ROCm doesn't show much progress or promise.

I'm all for viable Nvidia competition in the space but AMD has really, really, really dropped the ball on GPGPU with their hardware up to this point.

[0] - https://www.eetimes.com/rocm-is-amds-no-1-priority-exec-says...


As sad as it is, this is true. AMD has never spent lots of money on software, while Nvidia always has, which was fine for traditional graphics, but with ML this really doesn't cut it. AMD could have ported Pytorch to OpenCL or Vulkan or WebGPU, but they just... can't be bothered???


it's not entirely their fault, they rely on xformers, and that library is gpu only.

other models will happily run on cpu only mode, depending on your environment there are super easy ways to get started, and 32 core should be ok for a llama2 13b and bearable with some patient for running 33b models. for reference I'm willingly running 13b llama2 on cpu only mode so I can leave the gpu to diffusers, and it's just enough to be generating at a comfortable reading speed.


Use llama.cpp to run models locally.


Can llama.cpp run this yet? That would be surprising



Birds fly, sun shines, and TheBloke always delivers.

Though I can't figure out that prompt and with LLama2's template it's... weird. Responds half in Korean and does unnecessary numbering of paragraphs.

Just one big sigh towards those supposed efforts on prompt template standardization. Every single model just has to do something unique that breaks all compatibility but has never resulted in any performance gain.


I used the prompt included in llama.cpp and it worked for me in English (for fun GK type questions):

MODEL=./models/mistral-7b-v0.1.Q5_K_M.gguf N_THREAD=16 ./examples/chat-13B.sh


I have yet to get any useful output out of the Q5_K_S version; haven't tried any others yet.


Linked is the base model. What you want is the instruct model (also on TheBloke's profile), which has been trained on following instructions.


I used mistral-7b-v0.1.Q5_K_M.gguf and it responded to basic questions.


Wow, awesome!


I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in).

So great performance on a cheap CPU from 2 years ago which costs, what $130 or so?

I tried Llama.65B on the same hardware and it was way slower, but it worked fine. Took about 10 minutes to output some cooking recipe.

I think people way overestimate the need for expensive GPUs to run these models at home.

I haven't tried fine tuning, but I suspect instead of 30 hours on high end GPUs you can probably get away with fine tuning in what, about a week? two weeks? just on a comparable CPU. Has anybody actually run that experiment?

Basically any kid with an old rig can roll their own customized model given a bit of time. So much for alignment.


It would be very surprising.

Mistral AI's github page has more information on their sliding window attention method to achieve this performance: https://github.com/mistralai/mistral-src

If Mistral 7b lives up to the claims, I expect these techniques will make their way into llama.cpp. But I would be surprised if the required updates were quick or easy.


ollama runs it pretty well on CPUs like that.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: