You have to change the --percent flag. It takes some experimentation. The format is three pairs of 0-100 integers, one for parameters, attention cache and hidden states respectively. The first zero is percent on GPU, the second one is percent on CPU (system RAM), and the remaining percentage will go on disk.
For disk offloading to work you may also have to specify --offload-dir.
I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.
128GB, but by turning on compression I managed to fit the whole thing on the GPU. I did try it off a mix of RAM and SSD as well, and it was slower but still usable. Presumably disk speed matters a lot.
Well just got some more sticks. While I wait for RAM to arrive, will try with compress_weight and compress_cache. If you're in any discord or any other space where people are tinkering with this, would love to join!
After turning on compression I was able to fit the whole thing in GPU memory and then it became much faster. Not ChatGPT speeds or anything, but under a minute for a response in their chatbot demo. A few seconds in some cases.
For disk offloading to work you may also have to specify --offload-dir.
I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.