Can anyone in the field comment on what is the best offline TTS available today? This project is old, hardly maintained and the samples sound horrible. Festival is... not very good, but 'acceptable'. The MS build-in TTS engines are OK-ish. Dragon has some TTS but hard to use outside the use cases they designed it for it seems, and I'm not sure if they're even using their own TTS engine? Haven't had a chance to try it out myself. Any major ones I'm missing? And just how much better are the online ones, like Google's? Anyone have any quantitative data on that?
AFAIK there are a few that produce such results that they can not be distinguished from a real person. None of the really good ones are open source, and the best ones I have heard of are not even for sale.
The best implementations make an advantage in markets so they are well guarded. We do not know about the best implementations, because we did not notice a thing. For example some phone operators have replaced their customer services with TTS / STT solutions. Because people tend to lock up when they realize they are talking with a computer, they have had to make those systems sound very natural.
I know a few are pretty crappy ones, but a few are plain spooky. Customers that tend to joke and flirt with the computer, hoping for an emotional response, probably are most likely to notice them.
Then there's the case where the US intelligence services demonstrated their capabilities to a politician (senator/congressman) by recording him, and producing a voice clip of him saying something like "death to america" so well that no one could distinguish the speaker. It also seemingly passed further voice analysis. Google it up, pretty interesting read.
> I know a few are pretty crappy ones, but a few are plain spooky. Customers that tend to joke and flirt with the computer, hoping for an emotional response, probably are most likely to notice them.
I got a call two years ago from a telemarketer that kept asking me yes/no questions with a robot voice and didn't leave me any space in the conversation to say anything. I really felt like I was talking to a computer so I tried to Turing-test him by forcing him to answer an open-ended question. It took two minutes, and the caller turned out to be human after all. That wasn't very relieving though. Humans should not speak with such zombified voice.
For English (and I think Japanese), HTS with the STRAIGHT vocoder can be pretty amazing. Very amazing. Licensing on both of those pieces is problematic. (Edit: I should probably mention that HTS is not easy to use. It will require some mental fortitude.)
Festival can produce very good results, but you have to get into the weeds with it and do a lot of planning. If you don't like Scheme, you're not going to like dealing with festival, and if you do like scheme, you're going to be frustrated by their particular scheme.
I've also noticed a lot of folks looking at post filtering, but that is a sea of dragons at the moment.
Assuming that for their demo at http://flite-hts-engine.sp.nitech.ac.jp/ they would select a good data set and parameters, I have to say that I'm not particularly impressed. Do you know of any online examples of a well-tuned Festival set up? I didn't find the 'official' Festival or Festvox examples especially great, certainly worse than online available TTS services, and I have never been able to find an example of a well-tuned result of Festival.
I don't know of any online examples, sorry. The problem is that a well-aligned festival voice is not academically interesting, so people generally don't publish it. Festival performs well when it can rely solely on unit selection and has multiple choices of the same phoneme sequence. Once it has to do some guessing on the synthesis, you'll see a drop in quality.
With a small bit of work, you can have festival use an HTS voice directly. I usually use festival to generate my label files, post filter the phoneme timing, synthesize with HTS, then apply a post-filter.
Mary sounds quite good. Apparently if you know how to write input files in the correct format you can get it to speak much more like a real human by including stress/speed/tone/etc. variations too.
Can't confirm how well espeak works, since I didn't try it, but I have tried pyttsx on Windows. (pyttsx is a Python library which uses SAPI5 on Windows and espeak on Linux.)
Worked okay for me. Not tested a lot though. Here's my simple Python snippet for trying pyttsx on Windows:
"The pyttsx library is a cross-platform wrapper that supports the native text-to-speech libraries of Windows and Linux at least, using SAPI5 on Windows and eSpeak on Linux."
Also, check out the synthesized train announcer's voice in my blog post above - somewhat eerie and cool :)
The TTS engine built in to OSX works offline and is good. In my opinion it's as good as any of the online ones for most uses. But the OSX license limits it to "personal, non-commercial use". I'm not sure whether it's possible to get the same technology in a commercial license. Some of the voices are licensed from Nuance (parent company of Dragon), but it's not clear whether the engine is too.
Some better free-software offerings would really open this area up for experimentation, since the commercial offerings tend to be pretty black-box and targeted at very specific use-cases.
Last time I looked, the stock Festival voices were intentionally not production quality - they were demonstrations of how well the Festvox automated voice building scripts worked without manual tweaking. They're basically academic research that the researchers let everyone use. If you want high-quality voices you'll have to pay for them, because so far no-one that's put in the work is willing to give it away for free.
Egads! It was so obvious that I kept calling it GNUSpeak in my head, enjoying the cleverness of it all, and only now, following your comment, noticed it's actually GNUSpeech...
It's probably the best technology that the early 90s had to offer.
FTA:
gnuspeech is currently fully available as a
NextSTEP 3.x version in the SVN repository
along with the Gnu/Linux/GNUStep version,
which is incomplete though functional
Wikipedia[1]:
3.0 September 8, 1992
3.1 May 25, 1993
3.2 October 1993
3.3 February 1995
from what I can tell, gnuspeech is fully synthesized speech. Siri, for example, uses a voice actor to say tens of thousands of utterances, and in order to generate words/phrases that don't exist, they concatenates (more or less) segments from recorded phrases.
I think maybe it sounds worse because it's based on a more accurate and complex modelling of the human voice, so while it sounds worse now, it has the potential to get better and perhaps surpass implementations using more widely used methods.
Can the voice synthesis method used on Android sing, scream, or make other sounds that are not speech?
I wonder if we will see any of the deep learning libraries try to do voice recognition, a good open source implementation is lacking, or voice generation.
I am not sure of the source of the confusion. I meant TTS in the first part of my comment ("voice generation" isn't far from "speech synthesis", which is the title of the Wikipedia article for it in English) and both ASR and TTS in the second. Regarding ASR I'd not heard of Kaldi, thanks for the tip.
It might be just me but I can't figure out how to even start compiling this to test it out on linux. That said, I'm always happy to see alternatives to Festival TTS. Hopefully it'll become a bit more user friendly in time.
Interesting! Thanks! Also, I interpreted the statement about the v1.0 release as being the initial release; perhaps there was a v0.1 release before that...
This seems super interesting! They say that they are using a novel approach that models the actual way humans create sounds. Is this completely new? I would assume this implies that you can reuse so much more among languages, and have a more flexible system - like getting it to shout or sing, making strange noises, whistling, etc. But I don't know a lot about it. Is what I'm saying Science Fiction?
This is the kind of article in which i'd love (and I have almost come to expect!) to find comments that provide so much more breadth and depth of information that the actual link, here on HN; this is one of the things that make this community so great.
The idea is actually quite old (as they half-way admit, this is based on a program that was developed on NeXT machines when those were current; the research goes back to the late 60s and 70s). Wikipedia has some background naturally:
https://en.wikipedia.org/wiki/Speech_synthesis#Articulatory_...
There is also research into using this type of physical modeling for singing[1] and laughing[2], so its not far-fetched to imagine a model that can speak, sing, laugh, shout, and perform any of the various vocal articulations humans can make. The main obstacles seem to be figuring out how to control the model over time, how to smoothly transition between control states, and how to capture enough fine details to cross the uncanny valley.
> The idea is actually quite old (as they half-way admit, this is based on a program that was developed on NeXT machines when those were current; the research goes back to the late 60s and 70s).
I don't think linking the papers that form the basis of the work, and describing the history in fair detail (including the original development for the NeXT, and the technical change from real-time DSP-based synthesis to realtime CPU based synthesis enabled by CPU speed improvements since the NeXT was current) constitutes "half-way admitting" that the idea has been around for a while. Its pretty outright explicit.