Is there any better model you can point at? I would be interested in having a li...

Uehreka · 2025-09-03T14:45:33 1756910733

It’s tough to name the best local TTS since they all seem to trade off on quality and features and none of them are as good as ElevenLabs’ closed-source offering.

However Kokoro-82M is an absolute triumph in the small model space. It curbstomps models 10-20x its size in terms of quality while also being runnable on like, a Raspberry Pi. It’s the kind of thing I’m surprised even exists. Its downside is that it isn’t super expressive, but the af_heart voice is extremely clean, and Kokoro is way more reliable than other TTS models: It doesn’t have the common failure mode where you occasionally have a couple extra syllables thrown in because you picked a bad seed.

If you want something that can do convincing voice acting, either pay for ElevenLabs or keep waiting. If you’re trying to build a local AI assistant, Kokoro is perfect, just use that and check the space again in like 6 months to see if something’s beaten it. https://huggingface.co/hexgrad/Kokoro-82M

refulgentis · 2025-09-03T15:38:29 1756913909

There's a certain know-nothing feeling I get that makes me worried if we start at the link (which has data showing it > ElevenLabs quality), jump to eh it's actually worse than anything I've heard then last 2 years, and end up at "none are as good as ElevenLabs" - the recommendation and commentary on it, of course, has nothing to do with my feeling, cheers

sandreas · 2025-09-03T16:48:45 1756918125

What is your opinion about F5-TTS or Fish-TTS?

brettpro · 2025-09-04T04:17:21 1756959441

I recently implemented Fish for a project and found it adequate for TTS but wildly impressive in voice cloning. My POC originally required 3-10 audio samples but I removed the minimum because it could usually one shot it.

The model is good, but I will say their inference code leaves a lot to be desired. I had to rewrite large portions of it for simple things like correct chunking and streaming. The advertised expressive keywords are very much hit and miss, and the devs have gone dark unfortunately.

sandreas · 2025-09-11T21:32:00 1757626320

Did you consider contributing your improvements?

lynx97 · 2025-09-03T15:04:53 1756911893

I cobbled together llm-tts to run as many local (and remote) TTs models s I could find and get working.

https://github.com/mlang/llm-tts

Strictly speaking, even music generation fits the usage pattern: text in, audio out.

llm-tts is far from complete, but it makes it relatively "easy" to try a few models in an uniform way.

nipponese · 2025-09-03T15:41:25 1756914085

Not OS or local, but just try ChatGPT Voice Conversation mode. To my ears, it's a generation ahead of these VibeVoice samples.

riquito · 2025-09-03T16:53:29 1756918409

Probably not even the best ones, but among some recent models I find Dia and Orpheus more natural

- http://dia-tts.com/

- https://github.com/canopyai/Orpheus-TTS

popalchemist · 2025-09-04T07:26:37 1756970797

Higgs Audio v2 is currently SOTA in OSS TSS.

satellite2 · 2025-09-04T02:54:32 1756954472

Elevenlabs v3 (not local)

whimsicalism · 2025-09-03T21:17:01 1756934221

i think orpheus and sesame sound better