I'm an author on a few of these papers referenced (the Deep Voice papers from Ba...

bravura · on Aug 29, 2019

What do you think is the best neural network currently for processing and possibly generating 44.1 Khz music audio data?

If we're stuck with downsampling to 16 Khz, my question still stands.

PieSquared · on Aug 29, 2019

I don't think anything about the current set of tools is specific to sample rate; WaveNet, Tacotron, WaveRNN, etc, should work fine to generate 44.1Khz audio. They might just need slightly different hyperparameters or sizes to work well, or may take longer to train due to longer sequence lengths.

nfoz · on Aug 28, 2019

Cool! Does text-to-speech require AI, or is there any active work in non-AI methods? Which bits are the AI bits? Do "deep" methods substantially improve over whatever classical methods we might have had?

PieSquared · on Aug 28, 2019

I'll try to answer these one at a time.

1. Does text-to-speech require AI?

This one is a bit tricky to answer since it requires defining "AI". AI as a moniker has been used to describe deep neural networks, search algorithms, expert systems and logic systems, particle filters, SVMs, etc etc. Almost all text-to-speech (TTS) systems are based on a combination of some of these machine-learning methods and digital signal processing (DPS), so I would say yeah, text-to-speech is exactly what AI describes, even if it doesn't resemble human-like thinking like other AI applications do.

2. Is there any active work in non-AI methods?

This one again is a bit tricky for the same reason as before. However, there's a ton of pieces of the TTS pipeline that aren't AI in the current sense of the word (machine learning with neural networks or HMMs or other classifiers). For example, concatenative systems will traditionally take a large database of audio, divide it into chunks, and then recombine those chunks, using some interpolation method such as (OLA, PSOLA) to overlap those chunks. Choosing the chunks to overlap to create the target speech becomes an AI / search problem, using some sort of acoustic model to predict the acoustic parameters of each frame and then using a Viterbi search algorithm with target / join costs to find the optimal chunks. As another example of non-AI parts of the pipeline, text normalization tends to involve a lot of hand-written rules; for example, should you say "5/10/2019" as "May tenth, twenty nineteen", "the tenth of may twenty nineteen", "the tenth of may two thousand nineteen", or even "october fifth twenty nineteen". This decision and the conversion is often done with a ton of handwritten rules or grammars (see Kestrel, Google's text normalization system, and the open-source version, cleverly named Sparrowhawk). Anyways, the real answer is that TTS is always a combination of AI (machine learning) approaches with specialized text and audio processing algorithms.

3. Which bits are the AI bits?

The AI bits are the bits where you need to make some sort of heuristic decisions, and you'd like to make them by imitating some target speech. For example, things like part of speech detection, predicting acoustic parameters (spectrograms, F0, etc), more recently waveform synthesis as well.

4. Do deep methods significantly improve on the state of the art?

Yes, though they also come at a cost. For example, deep sequence-to-sequence networks make great frame-level models: Tacotron and similar models can do things like emotional and stylized voice synthesis much better than what I've seen HMMs and other non-deep models do. As another example, WaveNet / WaveRNN / etc are some of the only parametric speech models (that is, generating the waveform from scratch instead of copying it from a database of audio) that can match the quality of concatenative models (copying audio from a database), but they can be quite difficult to deploy due to high computational cost. Overall, though, yeah, deep methods and all the improvements to neural networks in the past few years are having a profound impact on the quality and naturalness of TTS.

nfoz · on Aug 28, 2019

Thanks very much for your reply, super helpful!! Sorry if that was difficult to answer. I guess I'm interested in how far we've gone from TTS engines like the LPC [1] engines we had in the 80s, or what you get from festival [2]. Maybe there isn't as clear a separation between their methods and the modern Google-scale deep-learning approaches as I thought.

[1] https://en.wikipedia.org/wiki/Linear_predictive_coding

(e.g. as seen in https://en.wikipedia.org/wiki/Texas_Instruments_TI-99/4A)

[2] http://www.festvox.org/festival/

PieSquared · on Aug 28, 2019

There's a few recent papers actually that show minor improvements by integrating LPC prediction into deep methods ([0], [1]). In my experience (some of which comes from reproducing these, some of which comes from my own experiments), this isn't actually too useful, at at most offers a minor modeling benefit.

The main difference between something like Festival and what we have now is the amount of domain-specific engineering. (This is generally the promise of deep learning -- replace hand-engineered features with simple-to-understand features and a deep model.) If you go and read the Festival manual, you're going to find tons of domain-specific rules and heuristics and subroutines; for example, there's a page on writing letter to sound rules as a grammar [2]. Nowadays, we may have a pipeline that resembles Festival at the high level, but each step of the pipeline is learned as a deep model from data rather than being carefully hand-engineered by many people over the course of years. This yields much more fluid speech as well as much, much faster iteration and experimentation times, leading to faster progress as well.

[0] https://arxiv.org/abs/1811.11913

[1] https://people.xiph.org/~jm/demo/lpcnet/

[2] http://www.festvox.org/docs/manual-2.4.0/festival_13.html#Le...

kashprime · on Aug 28, 2019

Thanks for posting here! Do you see any chance of an open source framework, like Mozilla's tacotron, competing with something like Google WaveNet's quality?

PieSquared · on Aug 28, 2019

First of all, it's important to note that Tacotron and WaveNet are responsible for different parts of the speech synthesis pipeline, so the comparison here isn't quite accurate. Specifically, Tacotron takes a representation of the text (characters, phonemes, etc) and converts it into a frame-level acoustic representation (spectrograms, log mel spectrograms, etc, spaced every 5-25ms). WaveNet takes a frame-level representation of the audio (for example, the output of Tacotron, or phonemes with frame-level timing information) and converts it to a waveform.

Second, I don't see any reason why there shouldn't be an open-source Tacotron or WaveNet implementation that's as good as Google's model implementations. Implementing and training these models is expensive but not prohibitively so (nowadays, you could probably do it with $5,000 - $10,000, including experimentation costs).

That said, quality of text-to-speech systems is determined only partially by the quality of these models -- much if not most of the work of building high quality text to speech systems goes into things like high quality data collection systems, good data annotations, good normalization and NLP tailored towards the domain of the TTS system, multilanguage support, optimized inference implementations for server or mobile platforms, etc.

mwitiderrick · on Aug 28, 2019

Great work on the paper!