I'm an author on a few of these papers referenced (the Deep Voice papers from Baidu). I'm happy to answer any questions folks may have about neural speech synthesis, as I've been working on this for several years now.
In general, it's a fascinating space. There are challenges in text processing (not even mentioned in the blog), such as grapheme to phoneme conversion, part of speech detection, word sense disambiguation, text normalization, challenges in utterance-level modeling (spectrograms), and challenges in "spectrogram inversion" / waveform synthesis. The NLP components of the pipeline are often overlooked but are no less important than they were a few years ago -- part of speech / word sense is the difference between "Time is a CONstruct" and "I'm going to conSTRUCT a tower", and is the difference between "Let's drop that bass" being about a DJ or about a fish. The acoustic modeling phase (e.g. Tacotron, Deep Voice 3) works fairly well, and can produce some awesome demos with things like style tokens ("GST-Tacotron"), but still has a ways to go until it can encompass the full range of human inflection and emotion. At the waveform synthesis level, models like WaveRNN (with subscale modeling) and Parallel WaveNet make it possible to deploy modern waveform synthesis models, but it's still a major issue to deploy them onto low-power devices due to compute restrictions. Overall, lots of interesting challenges to work on, and we're making a lot of progress quite quickly; and I haven't even started talking about voice conversion or voice cloning!
I don't think anything about the current set of tools is specific to sample rate; WaveNet, Tacotron, WaveRNN, etc, should work fine to generate 44.1Khz audio. They might just need slightly different hyperparameters or sizes to work well, or may take longer to train due to longer sequence lengths.
Cool! Does text-to-speech require AI, or is there any active work in non-AI methods? Which bits are the AI bits? Do "deep" methods substantially improve over whatever classical methods we might have had?
This one is a bit tricky to answer since it requires defining "AI". AI as a moniker has been used to describe deep neural networks, search algorithms, expert systems and logic systems, particle filters, SVMs, etc etc. Almost all text-to-speech (TTS) systems are based on a combination of some of these machine-learning methods and digital signal processing (DPS), so I would say yeah, text-to-speech is exactly what AI describes, even if it doesn't resemble human-like thinking like other AI applications do.
2. Is there any active work in non-AI methods?
This one again is a bit tricky for the same reason as before. However, there's a ton of pieces of the TTS pipeline that aren't AI in the current sense of the word (machine learning with neural networks or HMMs or other classifiers). For example, concatenative systems will traditionally take a large database of audio, divide it into chunks, and then recombine those chunks, using some interpolation method such as (OLA, PSOLA) to overlap those chunks. Choosing the chunks to overlap to create the target speech becomes an AI / search problem, using some sort of acoustic model to predict the acoustic parameters of each frame and then using a Viterbi search algorithm with target / join costs to find the optimal chunks. As another example of non-AI parts of the pipeline, text normalization tends to involve a lot of hand-written rules; for example, should you say "5/10/2019" as "May tenth, twenty nineteen", "the tenth of may twenty nineteen", "the tenth of may two thousand nineteen", or even "october fifth twenty nineteen". This decision and the conversion is often done with a ton of handwritten rules or grammars (see Kestrel, Google's text normalization system, and the open-source version, cleverly named Sparrowhawk). Anyways, the real answer is that TTS is always a combination of AI (machine learning) approaches with specialized text and audio processing algorithms.
3. Which bits are the AI bits?
The AI bits are the bits where you need to make some sort of heuristic decisions, and you'd like to make them by imitating some target speech. For example, things like part of speech detection, predicting acoustic parameters (spectrograms, F0, etc), more recently waveform synthesis as well.
4. Do deep methods significantly improve on the state of the art?
Yes, though they also come at a cost. For example, deep sequence-to-sequence networks make great frame-level models: Tacotron and similar models can do things like emotional and stylized voice synthesis much better than what I've seen HMMs and other non-deep models do. As another example, WaveNet / WaveRNN / etc are some of the only parametric speech models (that is, generating the waveform from scratch instead of copying it from a database of audio) that can match the quality of concatenative models (copying audio from a database), but they can be quite difficult to deploy due to high computational cost. Overall, though, yeah, deep methods and all the improvements to neural networks in the past few years are having a profound impact on the quality and naturalness of TTS.
Thanks very much for your reply, super helpful!! Sorry if that was difficult to answer. I guess I'm interested in how far we've gone from TTS engines like the LPC [1] engines we had in the 80s, or what you get from festival [2]. Maybe there isn't as clear a separation between their methods and the modern Google-scale deep-learning approaches as I thought.
There's a few recent papers actually that show minor improvements by integrating LPC prediction into deep methods ([0], [1]). In my experience (some of which comes from reproducing these, some of which comes from my own experiments), this isn't actually too useful, at at most offers a minor modeling benefit.
The main difference between something like Festival and what we have now is the amount of domain-specific engineering. (This is generally the promise of deep learning -- replace hand-engineered features with simple-to-understand features and a deep model.) If you go and read the Festival manual, you're going to find tons of domain-specific rules and heuristics and subroutines; for example, there's a page on writing letter to sound rules as a grammar [2]. Nowadays, we may have a pipeline that resembles Festival at the high level, but each step of the pipeline is learned as a deep model from data rather than being carefully hand-engineered by many people over the course of years. This yields much more fluid speech as well as much, much faster iteration and experimentation times, leading to faster progress as well.
Thanks for posting here! Do you see any chance of an open source framework, like Mozilla's tacotron, competing with something like Google WaveNet's quality?
First of all, it's important to note that Tacotron and WaveNet are responsible for different parts of the speech synthesis pipeline, so the comparison here isn't quite accurate. Specifically, Tacotron takes a representation of the text (characters, phonemes, etc) and converts it into a frame-level acoustic representation (spectrograms, log mel spectrograms, etc, spaced every 5-25ms). WaveNet takes a frame-level representation of the audio (for example, the output of Tacotron, or phonemes with frame-level timing information) and converts it to a waveform.
Second, I don't see any reason why there shouldn't be an open-source Tacotron or WaveNet implementation that's as good as Google's model implementations. Implementing and training these models is expensive but not prohibitively so (nowadays, you could probably do it with $5,000 - $10,000, including experimentation costs).
That said, quality of text-to-speech systems is determined only partially by the quality of these models -- much if not most of the work of building high quality text to speech systems goes into things like high quality data collection systems, good data annotations, good normalization and NLP tailored towards the domain of the TTS system, multilanguage support, optimized inference implementations for server or mobile platforms, etc.
In general, it's a fascinating space. There are challenges in text processing (not even mentioned in the blog), such as grapheme to phoneme conversion, part of speech detection, word sense disambiguation, text normalization, challenges in utterance-level modeling (spectrograms), and challenges in "spectrogram inversion" / waveform synthesis. The NLP components of the pipeline are often overlooked but are no less important than they were a few years ago -- part of speech / word sense is the difference between "Time is a CONstruct" and "I'm going to conSTRUCT a tower", and is the difference between "Let's drop that bass" being about a DJ or about a fish. The acoustic modeling phase (e.g. Tacotron, Deep Voice 3) works fairly well, and can produce some awesome demos with things like style tokens ("GST-Tacotron"), but still has a ways to go until it can encompass the full range of human inflection and emotion. At the waveform synthesis level, models like WaveRNN (with subscale modeling) and Parallel WaveNet make it possible to deploy modern waveform synthesis models, but it's still a major issue to deploy them onto low-power devices due to compute restrictions. Overall, lots of interesting challenges to work on, and we're making a lot of progress quite quickly; and I haven't even started talking about voice conversion or voice cloning!