Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tried that one. Quality is great but sometimes generations fail and it's rather slow. Also needs ~13 GB of VRAM, it's not my first choice for voice agents tbh.


alright, dumb question.

(1) I assume these things can do multiple languages

(2) Given (1), can you strip all the languages you aren't using and speed things up?


Actually good question.

I'd say probably not. You can't easily "unlearn" things from the model weights (and even if this alone doesn't help). You could retrain/finetune the model heavily on a single language but again that alone does not speed up inference.

To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only. That might work but it's also quite probable that it introduces other issues in the synthesis. In a perfect world the model would only use all that "free parameters" not used now for other languages for a better synthesis of that single trained language. Might be true to a certain degree, but it's not exactly how ai parameter scaling works.


I don't know what I'm talking about, but could you use distillation techniques?


Maybe possible, I did not look into that much for Coqui XTTS. What i know is that the quantized versions for Orpheus sound noticably worse. I feel audio models are quite sensitive to quantization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: