if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you...

peterleiser · on July 19, 2024

Welp, my data point of one shows you need more than 8 GB of vRam.

When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"

This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.

neonbrain · on July 19, 2024

When I run mistral-chat on Ubuntu 22.04 after cleaning up some smaller processes from the GPU (like gnome-remote-desktop-daemon) I am able to start Mistral-Nemo 2407 and get a Prompt on RTX 4090, but after entering the prompt it still fails with OOM, so, as someone noted, it narrowly fits 4090.

peterleiser · on July 20, 2024

Agreed, it narrowly fits on RTX 4090. Yesterday I rented an RTX 4090 on vast.ai and setup Mistral-Nemo-2407. I got it to work, but just barely. I can run mistral-chat, get the prompt, and it will start generating a response to the prompt after 10 to 15 seconds. The second prompt always causes it to crash immediately from OOM error. At first I almost bought an RTX 4090 from Best Buy, but it was going to cost $2,000 after tax, so I'm glad that instead I only spent 40 cents.

BaculumMeumEst · on July 19, 2024

What about for fine-tuning? Are the memory requirements comparable to inference? If not, is there a rule of thumb for the difference? Would it be realistic to do it on a macbook with 96G of unified memory?

hislaziness · on July 18, 2024

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

ancientworldnow · on July 18, 2024

This was trained to be run at FP8 with no quality loss.

hislaziness · on July 19, 2024

The model description on huggingface says - Model size - 12.2B params, Tensor type - BF16. Is the Tensor type different from the training param size?

fzzzy · on July 18, 2024

it's very common to run local models in 8 bit int.

qwertox · on July 18, 2024

Yes, but it's not common for the original model to be 8 bit int. The community can downgrade any model to 8 bit int, but it's always linked to quality loss.