You have to change the --percent flag. It takes some experimentation. The format...

lxe · on Feb 20, 2023

How much system RAM are you running with? And I'm guessing it wouldn't hurt to have a fast SSD for disk offloading?

Miraste · on Feb 20, 2023

128GB, but by turning on compression I managed to fit the whole thing on the GPU. I did try it off a mix of RAM and SSD as well, and it was slower but still usable. Presumably disk speed matters a lot.

lxe · on Feb 20, 2023

Well just got some more sticks. While I wait for RAM to arrive, will try with compress_weight and compress_cache. If you're in any discord or any other space where people are tinkering with this, would love to join!

lxe · on Feb 20, 2023

With compression, was able to get 30b to run on 3090 with '100 0'! Let me see if I can tweak the prompt a bit and make it come alive...

ImprobableTruth · on Feb 20, 2023

How fast is it in single batch mode?

Miraste · on Feb 20, 2023

After turning on compression I was able to fit the whole thing in GPU memory and then it became much faster. Not ChatGPT speeds or anything, but under a minute for a response in their chatbot demo. A few seconds in some cases.