Noob question, and may be probably being asked at the wrong place. Is there any ...

mike978 · on Jan 25, 2024

I have a 11th gen intel cpu with 64gb ram and I can run most of big models slowly... so it's partly what you can put up with.

mark_l_watson · on Jan 25, 2024

On my 32G M2 Pro Mac, I can run up to about 30B models using 4 bit quantization. It is fast unless I am generating a lot of text. If I ask a 30B model to generate 5 pages of text it can take over 1 minute. Running smaller models like Mistral 7B is very fast.

Install Ollama from https://ollama.ai and experiment with it using the command line interface. I mostly use Ollama’s local API from Common Lisp or Racket - so simple to do.

EDIT: if you only have 8G RAM, try some of the 3B models. I suggest using at least 4 bit quantization.

hellsten · on Jan 25, 2024

Check out this guide for some recommendations: https://www.hardware-corner.net/guides/computer-to-run-llama...

You can easily experiment with smaller models, for example, Mistral 7B or Phi-2 on M1/M2/M3 processors. With more memory, you can run larger models, and better memory bandwidth (M2 Ultra vs. M2 base model) means improved performance (tokens/second).

slawr1805 · on Jan 25, 2024

They have a high level summary of ram requirements for the parameter size of each model and how much storage each model uses on their GitHub: https://github.com/ollama/ollama#model-library

nextlevelwizard · on Jan 25, 2024

Rule of thumb I have used is to check the size and if it fits into your GPUs VRAM then it will run nicely.

I have not ran into a llama that won't run, but if it doesn't fit into my GPU you have to count seconds per token instead of tokens per second

wazoox · on Jan 25, 2024

Llama2 7b and Mistral 7b run at about 8 tk/s on my Mac Pro, which is usable if you're not in a hurry.

palashkulsh · on Jan 25, 2024

Thank you so much everyone, all the help was really needed and useful : )

explorigin · on Jan 25, 2024

I run ollama on my steamdeck. It's a bit slow but can run most 7b models.