It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.
But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.
I've finally managed to download the model and it seems to be working well for me. There's been some updates to the quantization code, so maybe if you do a 'git pull && make' and rerun the quantization script it will work for you. I'm getting about 350ms per token with the 30B model.
It definitely runs. It uses almost 20GB of RAM so I had to exit my browser and VS Code to keep the memory usage down.
But it produces completely garbled output. Either there's a bug in the program, or the tokens are different to 13B model, or I performed the conversion wrong, or the 4bit quantization breaks it.