I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.c...

genewitch · 2026-01-23T03:26:42 1769138802

If i am ever in the same city as you, i'll buy you dinner. I poked around during my free time today trying to figure out how to run these models, and here is the estimable Simon Willison just presenting it on a platter.

hopefully i can make this work on windows (or linux, i guess).

thanks so much.

cube00 · 2026-01-23T10:57:03 1769165823

> hopefully i can make this work on windows (or linux, i guess).

mlx-audio only works on Apple Silicon

bigyabai · 2026-01-23T17:45:37 1769190337

The original script supports CPU inference, nonetheless.

rahimnathwani · 2026-01-23T20:44:27 1769201067

If you want to do custom voice cloning, record a sample wav file with a sentence or two, and then try this:

  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
    
  python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

indigodaddy · 2026-01-23T01:36:50 1769132210

Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)

simonw · 2026-01-23T03:06:43 1769137603

No idea at all, but my guess is it would work but be a bit slow.

You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.

genewitch · 2026-01-23T03:30:45 1769139045

the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.

anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.

gcr · 2026-01-22T23:44:13 1769125453

This is wonderful, thank you. Another win for uv!