Voice Synthesis for in-the-Wild Speakers via a Phonological Loop

bluetwo · on July 25, 2017

I still think emphasis on a word or syllable is important here as there is far more information than you realize being conveyed with inflection.

Consider:

I am going to eat the ham sandwich = Me, no one else

I am going to eat the ham sandwich = Nothing can stop me

I am going to eat the ham sandwich = On my way; got distracted

I am going to eat the ham sandwich = In case you doubt my intent

I am going to eat the ham sandwich = I will not be juggling it

I am going to eat the ham sandwich = The ultimate ham sandwich will be mine

I am going to eat the ham sandwich = Not turkey, not roast beef

I am going to eat the ham sandwich = Between two slices of bread is what I do

vosper · on July 25, 2017

This made me chuckle, and it's a great illustration of how much meaning is changed by different emphasis. However, I would read the to example like this:

I am going to eat the ham sandwich = The sandwich is the reason I am going (to the party, or wherever...)

azinman2 · on July 25, 2017

This has always been a vexing problem in TTS. Going from context-free to contextual. First the machine will need to actually be able to model it's own intentions, and then figure out how to verbally do that. Difficult problems without solving hard AI. Easier if you can pre-determine the world/responses, as most "conversational interfaces" currently do.

olegkikin · on July 25, 2017

Similar in quality to Lyrebird

https://soundcloud.com/user-535691776/dialog

Google WaveNet sounds almost perfect in comparison:

https://deepmind.com/blog/wavenet-generative-model-raw-audio...

ihsw2 · on July 25, 2017

Some of the generated speech clips are unsettlingly robotic while WaveNet sounds passable, but it's the piano compositions that I found unnerving. I can't explain how but randomly generated music sounds so hollow and cold.

tubian · on July 25, 2017

Note that WaveNet was not trained "in the wild" (like, on celebs) but rather on a speech style dataset

fny · on July 25, 2017

Methinks we should compose a musical Turing test...

abhishek0318 · on July 25, 2017

Mix this with AI creating video from audio (http://spectrum.ieee.org/tech-talk/robotics/artificial-intel...) and you can make anyone say anything.

Animats · on July 25, 2017

Coming soon, audio ads with your friend's voices.

booleandilemma · on July 25, 2017

Or your own :(

aknoob · on July 25, 2017

Isn't this scary ?

euyyn · on July 25, 2017

With how much people hate the sound of their own voices, I think that would backfire on the advertiser.

PhasmaFelis · on July 25, 2017

It they could make it sound like my voice sounds from inside my head, and not like the weirdo that other people apparently hear, it would be scary on several levels.

pjc50 · on July 25, 2017

Subvocalisation. Your own voice speaking quietly in the background to something else. Ideal for consumer indoctrination.

digi_owl · on July 25, 2017

The voice we hear in our head and the one everyone else hear is starkly different.

NTripleOne · on July 26, 2017

The voice I 'hear' in my head and the one I actually hear when talking is starkly different.

Udik · on July 26, 2017

So, ads spoken with your voice, as you would hear it from your head.

azinman2 · on July 25, 2017

To me this is very exciting. I'm already working on my own home digital assistant modeled as NeNe Leaks from the Real Housewives to add personality to otherwise boring conversations with a robot. I've been looking at various style transfer techniques, and having something a bit more plug & play will help me focus on the more unique parts. I predict that we'll see more celebrity voices used as conversational interfaces become more common.

Part of the complexity is going from 'context-free phonemes' to actually modeling personality. Having some way for the voice to know how to embed emotion, and ideally contextually from the sentences themselves. NeNe is an interesting example as she adds so many non-verbal sounds to her dialog (bleeps and bloops and eye rolls that she translates into affected speech). That's part of what makes her NeNe, and a big part of the entertaining value. Pursuing that is what will bring style transfer to the next level... total personality emulation. I fantasize about basic animatronics that can move her head side to side, twirl, and literally give eye rolls.

If anyone wants to work on this with me, give me a ping @azinman on twitter. I've currently been thinking about this as an open source project, but still holding out options as I continue development. I've got a ton more ideas she's integrating into with my bleeding edge smart home, far more than just personality emulation (including what I believe to be a breakthrough in passive context-sensing.. the real key to making the smart home actually smart).

stephengillie · on July 25, 2017

What language are you working in? I've been working in Powershell out of convenience, but am looking to port my speech bot to Node.

azinman2 · on July 26, 2017

I'm working in Go right now, but probably will end up with a mix of various things. What's nice about Go is that it's very portable, faster than Python, nicer RAM/storage usage than Node (not needing a JIT and all), and I can cross compile binaries and distribute them to Raspberry Pis or whatever.

johannkaupen · on July 25, 2017

There are too many example to do fraud with this to list here.

One example: Not too long ago I still did the rather more important banking stuff with a quick phone call (couldn't be done entirely online).

tyingq · on July 25, 2017

Would certainly make the phishing scheme where "fake CEO sends real CFO an email request to send a bank wire" more successful. Send the email, follow up with a voice mail.

digi_owl · on July 25, 2017

For some reason this page gives Firefox a fit, and that is with multiprocessing enabled...

placeybordeaux · on July 25, 2017

Anyone else having trouble with the audio samples?

m00dy · on July 25, 2017

I'm waiting for the code samples :)

Thanks