Interesting. They are not TTS like we are accustomed to, they are replicating a ...

modeless · on March 1, 2017

> you can make anyone say anything as long as you have some previous recordings of their voice.

That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.

The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.

To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.

PieSquared · on March 1, 2017

Thank you for clarifying this! We tried fairly hard to make this clear, because as you say, the hard part is generating inflection and duration that sounds natural. There's still a ton of work left to do in this duration – we're clearly nowhere near being able to generate human-level speech.

Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis!

mrmaximus · on March 1, 2017

My "Fake News" comment aside, I think what y'all are doing could be transformational for many reasons. Imagine a scenario where a person loses a loved one, and similar technology is able to allow them to "have conversations" with the deceased as a form of healing and closure. Not to mention, this could add a personal touch to assistant bots that will make them a pleasure to use.

mrmaximus · on March 1, 2017

>The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing.

Gotcha, now I understand.

phkahler · on March 1, 2017

>> They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection.

Yes, but imagine being able to take sound from one person and inflection from another. If you want to fake someone saying something you don't need to do pure TTS, a human can be used to fake another persons inflections.

mrmaximus · on March 1, 2017

Based upon what little is posted there, I thought they were taking the original recording, then training the model on that recording against the text of the recording... reproducing the recording. I would think next step is to sample enough audio and text to be able to produce new outputs entirely. It should in theory even be able to learn when/where/how to use inflection.

mbrookes · on March 1, 2017

> "Fake News" is about to get a lot more compelling hen you can make anyone say anything as long as you have some previous recordings of their voice.

Adobe has already developed that technology:

https://arstechnica.co.uk/information-technology/2016/11/ado...

Now imagine combining it with this:

Face2Face: Real-time Face Capture and Reenactment of RGB Videos https://www.youtube.com/watch?v=ohmajJTcpNk

Perhaps using the intonation from the face-actor's voice to guide the speech synthesis.

stevenh · on March 1, 2017

I agree and I've upvoted you, but I feel it's worth pointing out that Adobe's claim about their own progress in this field was fake news.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=2m34s

"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another. Nothing is being synthesized.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=3m54s

The word "Jordan" is not being synthesized. The speaker was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though it was synthesized on the fly. This is a scripted performance and Jordan is feigning surprise.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=4m40s

The phrase "three times" here was prerecorded.

This was a phony demonstration of a nonexistent product. Reporters parroted the claims and none questioned what they witnessed. Adobe falsely took credit and received endless free publicity for a breakthrough they had no hand in by staging this fake demo right on the heels of the genuine interest generated by Google WaveNet. I suppose they're hoping they'll have a real product ready by whatever deadline they've set for themselves.

To be clear, I like Adobe and I think it's a cunning move on their part.

mbrookes · on March 1, 2017

Thanks for the detailed breakdown. The irony is not lost!