Hacker News new | past | comments | ask | show | jobs | submit login
Voice Synthesis for in-the-Wild Speakers via a Phonological Loop (ytaigman.github.io)
65 points by itamarb on July 25, 2017 | hide | past | favorite | 25 comments



I still think emphasis on a word or syllable is important here as there is far more information than you realize being conveyed with inflection.

Consider:

I am going to eat the ham sandwich = Me, no one else

I am going to eat the ham sandwich = Nothing can stop me

I am going to eat the ham sandwich = On my way; got distracted

I am going to eat the ham sandwich = In case you doubt my intent

I am going to eat the ham sandwich = I will not be juggling it

I am going to eat the ham sandwich = The ultimate ham sandwich will be mine

I am going to eat the ham sandwich = Not turkey, not roast beef

I am going to eat the ham sandwich = Between two slices of bread is what I do


This made me chuckle, and it's a great illustration of how much meaning is changed by different emphasis. However, I would read the to example like this:

I am going to eat the ham sandwich = The sandwich is the reason I am going (to the party, or wherever...)


This has always been a vexing problem in TTS. Going from context-free to contextual. First the machine will need to actually be able to model it's own intentions, and then figure out how to verbally do that. Difficult problems without solving hard AI. Easier if you can pre-determine the world/responses, as most "conversational interfaces" currently do.


Similar in quality to Lyrebird

https://soundcloud.com/user-535691776/dialog

Google WaveNet sounds almost perfect in comparison:

https://deepmind.com/blog/wavenet-generative-model-raw-audio...


Some of the generated speech clips are unsettlingly robotic while WaveNet sounds passable, but it's the piano compositions that I found unnerving. I can't explain how but randomly generated music sounds so hollow and cold.


Note that WaveNet was not trained "in the wild" (like, on celebs) but rather on a speech style dataset


Methinks we should compose a musical Turing test...


Mix this with AI creating video from audio (http://spectrum.ieee.org/tech-talk/robotics/artificial-intel...) and you can make anyone say anything.


Coming soon, audio ads with your friend's voices.


Or your own :(


Isn't this scary ?


With how much people hate the sound of their own voices, I think that would backfire on the advertiser.


It they could make it sound like my voice sounds from inside my head, and not like the weirdo that other people apparently hear, it would be scary on several levels.


Subvocalisation. Your own voice speaking quietly in the background to something else. Ideal for consumer indoctrination.


The voice we hear in our head and the one everyone else hear is starkly different.


The voice I 'hear' in my head and the one I actually hear when talking is starkly different.


So, ads spoken with your voice, as you would hear it from your head.


To me this is very exciting. I'm already working on my own home digital assistant modeled as NeNe Leaks from the Real Housewives to add personality to otherwise boring conversations with a robot. I've been looking at various style transfer techniques, and having something a bit more plug & play will help me focus on the more unique parts. I predict that we'll see more celebrity voices used as conversational interfaces become more common.

Part of the complexity is going from 'context-free phonemes' to actually modeling personality. Having some way for the voice to know how to embed emotion, and ideally contextually from the sentences themselves. NeNe is an interesting example as she adds so many non-verbal sounds to her dialog (bleeps and bloops and eye rolls that she translates into affected speech). That's part of what makes her NeNe, and a big part of the entertaining value. Pursuing that is what will bring style transfer to the next level... total personality emulation. I fantasize about basic animatronics that can move her head side to side, twirl, and literally give eye rolls.

If anyone wants to work on this with me, give me a ping @azinman on twitter. I've currently been thinking about this as an open source project, but still holding out options as I continue development. I've got a ton more ideas she's integrating into with my bleeding edge smart home, far more than just personality emulation (including what I believe to be a breakthrough in passive context-sensing.. the real key to making the smart home actually smart).


What language are you working in? I've been working in Powershell out of convenience, but am looking to port my speech bot to Node.


I'm working in Go right now, but probably will end up with a mix of various things. What's nice about Go is that it's very portable, faster than Python, nicer RAM/storage usage than Node (not needing a JIT and all), and I can cross compile binaries and distribute them to Raspberry Pis or whatever.


There are too many example to do fraud with this to list here.

One example: Not too long ago I still did the rather more important banking stuff with a quick phone call (couldn't be done entirely online).


Would certainly make the phishing scheme where "fake CEO sends real CFO an email request to send a bank wire" more successful. Send the email, follow up with a voice mail.


For some reason this page gives Firefox a fit, and that is with multiprocessing enabled...


Anyone else having trouble with the audio samples?


I'm waiting for the code samples :)

Thanks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: