I recently implemented Fish for a project and found it adequate for TTS but wildly impressive in voice cloning. My POC originally required 3-10 audio samples but I removed the minimum because it could usually one shot it.
The model is good, but I will say their inference code leaves a lot to be desired. I had to rewrite large portions of it for simple things like correct chunking and streaming. The advertised expressive keywords are very much hit and miss, and the devs have gone dark unfortunately.