>is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?
Not simply, no.
But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.
Not simply, no.
But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.