if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).
Welp, my data point of one shows you need more than 8 GB of vRam.
When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"
This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.
When I run mistral-chat on Ubuntu 22.04 after cleaning up some smaller processes from the GPU (like gnome-remote-desktop-daemon) I am able to start Mistral-Nemo 2407 and get a Prompt on RTX 4090, but after entering the prompt it still fails with OOM, so, as someone noted, it narrowly fits 4090.
Agreed, it narrowly fits on RTX 4090. Yesterday I rented an RTX 4090 on vast.ai and setup Mistral-Nemo-2407. I got it to work, but just barely. I can run mistral-chat, get the prompt, and it will start generating a response to the prompt after 10 to 15 seconds. The second prompt always causes it to crash immediately from OOM error. At first I almost bought an RTX 4090 from Best Buy, but it was going to cost $2,000 after tax, so I'm glad that instead I only spent 40 cents.
What about for fine-tuning? Are the memory requirements comparable to inference? If not, is there a rule of thumb for the difference? Would it be realistic to do it on a macbook with 96G of unified memory?
Yes, but it's not common for the original model to be 8 bit int. The community can downgrade any model to 8 bit int, but it's always linked to quality loss.