I've never run one of these models locally, but their README has some pretty easy to follow instructions, so I tried it out...
> RuntimeError: Found no NVIDIA driver on your system.
It's true that I don't have an NVIDIA GPU in this system. But I have 64GB of memory and 32 cpu cores. Are these useless for running these types of large language models? I don't need blazing fast speed, I just need a few tokens a second to test-drive the model.
Change the initial device line from "cuda" to "cpu" and it'll run.
(Edit: just a note, use the main/head version of transformers which has merged Mistral support. Also saw TheBloke uploaded a GGUF and just confirmed that latest llama.cpp works w/ it.)
I think it's really lame that ML, which is just math really, hasn't got some system-agnostic language to define what math needs to be done, and then it can run easily on CPU/GPU/TPU/whatever...
A whole industry being locked into NVidia seems bad all round.
Yes but AMD could release a ROCm that actually works and then put actually meaningful resources into some of the countless untold projects out there that have been successfully building on CUDA for 15 years.
There was a recent announcement that after six years AMD finally sees the $$$ and will be starting to (finally) put some real effort into ROCm[0]. That announcement was two days ago and they claim they started on this last year. My occasional experience with ROCm doesn't show much progress or promise.
I'm all for viable Nvidia competition in the space but AMD has really, really, really dropped the ball on GPGPU with their hardware up to this point.
As sad as it is, this is true. AMD has never spent lots of money on software, while Nvidia always has, which was fine for traditional graphics, but with ML this really doesn't cut it. AMD could have ported Pytorch to OpenCL or Vulkan or WebGPU, but they just... can't be bothered???
it's not entirely their fault, they rely on xformers, and that library is gpu only.
other models will happily run on cpu only mode, depending on your environment there are super easy ways to get started, and 32 core should be ok for a llama2 13b and bearable with some patient for running 33b models. for reference I'm willingly running 13b llama2 on cpu only mode so I can leave the gpu to diffusers, and it's just enough to be generating at a comfortable reading speed.
Birds fly, sun shines, and TheBloke always delivers.
Though I can't figure out that prompt and with LLama2's template it's... weird. Responds half in Korean and does unnecessary numbering of paragraphs.
Just one big sigh towards those supposed efforts on prompt template standardization. Every single model just has to do something unique that breaks all compatibility but has never resulted in any performance gain.
I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in).
So great performance on a cheap CPU from 2 years ago which costs, what $130 or so?
I tried Llama.65B on the same hardware and it was way slower, but it worked fine. Took about 10 minutes to output some cooking recipe.
I think people way overestimate the need for expensive GPUs to run these models at home.
I haven't tried fine tuning, but I suspect instead of 30 hours on high end GPUs you can probably get away with fine tuning in what, about a week? two weeks? just on a comparable CPU. Has anybody actually run that experiment?
Basically any kid with an old rig can roll their own customized model given a bit of time. So much for alignment.
If Mistral 7b lives up to the claims, I expect these techniques will make their way into llama.cpp. But I would be surprised if the required updates were quick or easy.
> RuntimeError: Found no NVIDIA driver on your system.
It's true that I don't have an NVIDIA GPU in this system. But I have 64GB of memory and 32 cpu cores. Are these useless for running these types of large language models? I don't need blazing fast speed, I just need a few tokens a second to test-drive the model.