I have a single RTX 3060. It can't handle a 70b model. I got something like 1-2 ...

brucethemoose2 · on Sept 27, 2023

That can handle a 20B model, either in llama.cpp or exLLaMA:

https://huggingface.co/models?sort=modified&search=20B

https://huggingface.co/Kooten/U-Amethyst-20B-3bpw-exl2?not-f...

coolspot · on Sept 27, 2023

With this setup you can as well throw your 3060 out and just use CPU, because your bottleneck is RAM-to-VRAM bandwidth, 3060 is basically idle.

LoganDark · on Sept 27, 2023

I would love to throw the 3060 out and replace it with a 3090... once money permits. (It's only about $800 nowadays.)

But yes. I'm aware how laughably insane it is to run a 70b model that way. And that's why I was pointing it out to the commenter who suggested to just run a 70b model instead.

freedomben · on Sept 27, 2023

downvoters: why did you downvote? is this comment technically incorrect or inaccurate?

LoganDark · on Sept 27, 2023

To a comment that suggested I try the 70b model, I replied "my card can't run that model". Someone replies back with "you may as well throw the card out if you're going to be trying to run that model". My point exactly.

More seriously, using all-CPU is not much faster as my computer only has 16GB of actual memory, which I'm aware is also hugely underspecced for a 70b model, even with memory mapping.

I have a nice NVMe SSD, so there's not much else for me to do here except upgrade my memory or graphics card.

freedomben · on Sept 27, 2023

that would make sense the downvotes, thank you!