Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have a single RTX 3060. It can't handle a 70b model.

I got something like 1-2 tokens per second the last time I tried, with CPU offloading and an absolutely offensive page file (32gb).




With this setup you can as well throw your 3060 out and just use CPU, because your bottleneck is RAM-to-VRAM bandwidth, 3060 is basically idle.


I would love to throw the 3060 out and replace it with a 3090... once money permits. (It's only about $800 nowadays.)

But yes. I'm aware how laughably insane it is to run a 70b model that way. And that's why I was pointing it out to the commenter who suggested to just run a 70b model instead.


downvoters: why did you downvote? is this comment technically incorrect or inaccurate?


To a comment that suggested I try the 70b model, I replied "my card can't run that model". Someone replies back with "you may as well throw the card out if you're going to be trying to run that model". My point exactly.

More seriously, using all-CPU is not much faster as my computer only has 16GB of actual memory, which I'm aware is also hugely underspecced for a 70b model, even with memory mapping.

I have a nice NVMe SSD, so there's not much else for me to do here except upgrade my memory or graphics card.


that would make sense the downvotes, thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: