Hacker News new | past | comments | ask | show | jobs | submit login
Deep Learning for Siri’s Voice (machinelearning.apple.com)
234 points by subset on Aug 24, 2017 | hide | past | favorite | 90 comments



The iOS 11 Siri sounds like it's a real person talking, it's amazing. Does anyone know if there's an open-source TTS library available with such quality (or if anyone is working on one, from this paper)?

I would love to have my home speakers announce things in this voice.


She sounds younger to me, but very natural sounding.

Will be interesting to see how Siri on Home Pod works out.


Yes. And the way it answers some questions, it also seems to be going for a more casual, enthusiastic sort of vibe. It wasn't clear to me to what degree the changes apply outside of the female American voice.


She does sound younger. I think it's because the 's' is more... lispy. She sounds a bit more valley-speak-ish, although that could just be a result of sounding more natural of course.


I'd love to have my Instapaper articles read to me in that TTS voice.

Hopefully it gets ported to MacOS's say CLI utility. I typically use that with `pbpaste | say` to read my articles.


I would assume that the OS X TTS engine is the same as Siri's as it would be available in High Sierra.

Is the one in Sierra different from Siri in iOS 10?


> I would assume that the OS X TTS engine is the same as Siri's as it would be available in High Sierra.

Nope, macOS uses a different set of voices for TTS, which are named differently as well. They work offline and are nowhere as good as Siri.


I haven't needed to use Siri on my Mac, but presumably that's using the same stuff as iOS?


We're talking about the MacOS TTS engine accessible via the `say` utility and (I believe) accessibility tools, not Siri which is presumably a different software stack and which uses remote processing.


A research paper published by Apple? About Siri?! Unheard of! Last time I was at an NLP conference wth Apple employees they wouldn't say anything about how Siri speech worked, despite being very inquisitive about everyone else's publications. Good to see some change.


It's probably safe to assume a lot of that was due to some/most of Siri being licensed from Nuance initially. I mean, who wants to talk about a new product, which most people think is brand new and entirely innovative, just to say "Oh yeah, we paid someone else to work with us to create it."

Not that there's anything wrong with that and it certainly seems like Apple has been investing in-house pretty heavily in recent years for Siri improvement.


I think it has more to do with the fact that they are finally starting to allow their researchers to publish. They were platinum sponsors at INTERSPEECH 2017 this week, and actually published a paper there. I'm pretty sure that was the first time _ever_ despite their recruiters showing up every year.


3 papers at Interspeech 2017, actually :)


The part licensed from Nuance is the speech recognition, not the voice.


Yes, the post I replied to was talking about NLP.

That said, in my limited experience, organizations doing NLP also have something to offer by way of TTS too.


You can read some more discussion on it here: https://news.ycombinator.com/item?id=14804018.

One of the ideas is that most ML researchers want to publish their work and Apple wasn't allowing it. Allowing ML researchers working at Apple to publish in this journal was they only way they were get more ML researchers to work for them.


Yeah, they do seem to be opening up a bit. They posted their first article on this ML blog a few weeks ago.


It's a quite unattractive proposition for ML researchers to work for a company without being able to publish.

I'm guessing Apple simply had to make this change in order to stay in the game.


There was a lot of internal pushing for publication, apparently. But Apple decided they don't want individual engineers to get credits for ML work that whole teams are working on, so they came up with their ML blog to post papers without individual credit. At least, that's how it was explained by some of the ML researchers to this year's summer intern class.


Why not credit the whole team then, by making them all authors?

Did you ever see the list of authors on publications of CERN? Or take this example: [1].

[1] https://www.nature.com/news/physics-paper-sets-record-with-m...


Right, I totally agree with you! That's just how it was explained a couple months ago. shrug


Also glad to see this. Still curious as to why they wouldn't post it as a research paper on arXiv -- what's the point in reinventing the wheel here? I suppose it's nice for publicity, but would be great if they also played nicely with the ecosystem.



This was presented at Interspeech '17 this morning. Maybe the paper is embargoed or something until a later date?


Yeah, Interspeech IS as traditional an ecosystem as it gets in this field.


My favorite part is that the runtime runs on device. I moved back to Android, but persistently one thing Apple does that I like is they don't move things to the internet as often as Google does. On Android, you get degraded TTS if the internet is shoddy.


It's two different philosophies. With Apple it's about providing sufficient value such that the consumer will pay a premium for the product. With Google it's about providing the minimum viable value such that the user will provide as much of their data as possible.


Apple also cares strongly about privacy, so there's a lot of stuff they refuse to do in the cloud.

Google also cares about privacy, but only in reverse. They don't want you to have any ;)


That makes sense. I always wondered why improvements to Siri's voice required updating the device. I figure it had to be running there, just getting the text from the web service and not the audio.


On iOS, by contrast, TTS quality depends on available disk space. If there's too little of it, iOS removes Siri's higher-quality voice files.


Their TV service does this too on Google Fiber. It sucks, it's a tire fire. It is laggy, wedges until you have to reboot the network and TV box. It's a horrible experience. I really wish they would stop trying to shove everything into the cloud. Seriously Google, STOP. JUST STOP!


Google's essentially a cloud company, any devices are just remote terminals. I doubt they'll ever change that.


In an increasingly more connected world, it makes sense. 2-3 years ago, Google Docs and Chromebooks were a pain to use for most purposes.


Fundamentally, offline TTS/ASR/NLP is going to be degraded because you can't fit cloud-sized models onto a mobile device.

Could offline models be better? Definitely. But they only way to make them as good as cloud models it to make the cloud models worse.


I couldn't read the paper yet, and also I know very little about this, but listening to the audio samples it seems that one of the most notable changes was the intonation in changing phrases. Did anyone else catch something like that? I'm not sure I'm doing a good job at explaining. If you listen to all iOS11 samples it'll stand out.

Anyway, it's the only way I can still identify this as a fake voice. The intonation always follows the same cadence (not sure if that's the word?). We really shouldn't have overused the word awesome before this kind of thing came along.

There's also a kind of dread too, tbh, this kind of seamless TTS has the potential to change a lot of things. First of all criminals are going to love this, youtube pranksters too. Eventually this will shake up the voice acting industry in a possibly not healthy way for the voice actors, while at the same time allowing projects with a shorter budget to have incredible voice work (also dubbing).

What I think is really important, tho, is that as we move away from the uncanny valley we change our relationships with those voices, our brains don't have the capacity to listen to a voice this real and not imagine it as a person, even for adults.

Ironically at this moment I'm using an old threadless sweatshirt that says "this was supposed to be the future" but nowadays I can honestly say we're getting there.


Regarding voice acting, I think there is something to be said about human expression/ad-lib. Sure, you could generate a natural-sounding voice computer voice, but in the context of arts we’re still a ways to go before a computer can go off script and add just the perfect amount of intonation on a certain word that turns a phrase into an iconic quote.

Similarly, we don’t see CGI motion capture replacing Andy Serkis any time soon.


I think this is less likely to hit major films or TV shows, but it will hit the audiobook and video game markets pretty hard.

I'm pretty excited about the video game side.


I think you're overstating things. On the one hand, a lot of applications where quality wasn't that critical switched over ages ago. And, on the other hand, any application that would have spent the money on voice acting is still going to pay for both the higher quality and for a sound that isn't the same as everyone else is using. (Note that Siri's new iOS voice is based on a new training set from a new person.)

I do think there are applications that we don't just have today because TTS just isn't good enough. I've had some ideas around Alexa apps related to content that would be TTSd. But the current Polly just isn't human enough. I don't think this is there yet either but it's getting close.


The difference between the Siri voices from iOS 9-11 is startling. I can still here some issues especially at the ends of phrases, but it's extremely good.


11 sounds almost as good as the wavenet demo. Considering it runs in real-time that's very impressive.


I hope someone makes a YouTube video going back to when Siri first launched to show just how much it's evolved.

Listening to those samples I remember how big an advancement iOS 10 felt, but it's nothing compared to 11.


This just made me realize that every time you see a strong AI in fiction, it still has a computer-sounding voice. If we ever develop strong AI, we will probably already have perfectly natural speech synthesis. And if not, the AI could develop it for us.

But I suppose an AI might choose to use a computer-sounding voice to remind us that it is a computer. Kind of like those inaccurate sound effects in movies - they have become so common that it seems more wrong to omit them. (TV Tropes calls this "The Coconut Effect".)


I recommend watching the scifi film "Her", it has a different take on this.


That was a great movie.

There is always the chance that as we get better at this stuff we'll start to find it creepy that it's so realistic (either due to the uncanny valley or because we crossed the valley) and we'll start to prefer devices that act robotic even though we know we could make the indistinguishable.

I'm trying to think of another example. I know I've heard a good one with Roombas but I can't remember it.

Basically we may try to avoid a Bladerunner situation where we're not sure when we are or aren't talking to a real person and prefer the 'computery' voices.


Well there is always Data, and how he was made to be less human after the researchers found Lore to be unsettling.


Excellent example. I'd forgotten about Lore.


Anyone else find themselves thinking about Data, and why he was portrayed the way he was?


The prosody and and continuity of the speech is dramatically improved. This is hard to do and very impressive (especially given that it is being done on-device).

Personally, I'm less pleased with the actual new voice itself, although that is more a subjective judgment. After listening to many hundreds of voice talent auditions for Alexa, it's hard to step back from that level of pickiness.


As I indicated in another comment, the visual that the voice (together with other tweaks in some of Siri's responses) suggests to me is a perky twenty-something.

I actually tend to generally prefer some of the female British accents in several current TTS systems. (Amy is probably my favorite Polly voice.) Perhaps as an American, the robotic-ness doesn't seem quite as obvious or grating.


I also prefer the female British accents but that's exactly what excites me and is so awesome about this. These aren't just samples that are being stitched together anymore. This "learning" that is being done can be applied later to any of the voices in Apple's catalog. Once they get the data of the synthesis out there, they'll more than likely update all the languages and intonations to match. I would imagine that the biggest hurdle with this is that different languages and accents have different nuances. As with most things, they're just starting with English and then will move everything over to all the other options, including the British accents. I don't think we're too far off from a future where you'll be able to pick the age, gender, and voice of your assistant in the same way that characters selection is done in most modern video games.


How'd you get to listen to many hundreds of voice talent audition for Alexa?


I led the product management and design team.


Story time?


Kinda sad to see that the names of the authors are omitted, although you can infer some of them from the quote:

> For more details on the new Siri text-to-speech system, see our published paper “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System”

[9] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt, J. Li, M. Neeracher, K. Prahallad, T. Raitio, R. Rasipuram, G. Townsend, B. Williamson, D. Winarsky, Z. Wu, H. Zhang. Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System, Interspeech, 2017.

Why not just add the names by default?


Because then it wouldn't be an Apple™ iNovation™.


Let's be honest, these people's names aren't displayed prominently for precisely the same reason as early Atari game developers names weren't.


I'm not familiar with this example. Could you elaborate on the Atari thing, please?


Atari wanted people to want Atari games, not Frank Jones games. They considered their programmers replaceable cogs and refused to give them credit.

That's why the Easter egg in Adventure with the programmer's name exists. It was the only way to get his name out there.

What happened was the developers didn't like this and left to start their own company, Activision, which made some of the best remembered games on the 2600.

Apple already 'compromised' by letting their researchers publish at all. Maybe names will be allowed in the future but it's kind of surprising we're even getting this.


Thanks for posting the explanation -- I thought that Atari's early policies were pretty widely known. Anyhow, here's some further references for the interested: https://en.wikipedia.org/wiki/Adventure_(Atari_2600)#Easter_... .


Oh I see! I didn't know about that. Thanks for the explanation!

I totally agree that this was an unprecedented move by Apple, considering their past stance on such things. I'm hopeful for the future, though! They seem to have realized (at least a little bit) that community cooperation is valuable.


The reports from a few months ago word that they basically had to do this because no one was willing to work for them if they weren't able to publish because their career basically stalled.


It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.


> It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.

Not crazy at all. At least some therapies provide benefits even with simple non-AI processes: "A meta-analyses of 15 studies, published in this month’s volume of Administration and Policy in Mental Health and Mental Health Services Research, found no significant difference in the treatment outcomes for patients who saw a therapist and those who followed a self-help book or online program."[0]

[0] https://qz.com/1057345/researchers-say-you-might-as-well-be-...


The one grey area though is that when the patient actually does the self-help program. Adherence is a big problem. Meaning a “self-helper” needs to be more disciplined because they don’t have the same pressure/accountability they might have with an actual therapist.


At my little company, iCouch, we have experimented with such things, but to actually make it effective — that requires a good amount of capital — capital that is very difficult to raise. I would need to hire 3 full time people just for the AI project and potentially more.

The VC world is interested in “traction” and not novel tech which means we have to divert effort into growing customers for our mental health practice management system to get “traction” before we can spend any notable time building AI therapists. As much as VCs talk about “looking for innovation” they really aren’t. They are just looking at current growth/revenue. The days of building something amazing and monetizing later seem to be over for all except for founders with marquee names.

We could launch AI therapists within a year, but in the meantime, I have to pay my team. So we are forced to subsidize moonshot R&D with our existing sales — but that is hard to do since existing sales have to finance customer acquisition. Finding an additional $500k per year to make AI therapy viable is impossible for us.

We are in a catch 22. The first question from nearly every investor’s mouth: “how many paid users do you have?” Not, what technology do you have or can develop that is truly disruptive. We could start preparing AI therapy tomorrow for a Summer 2018 launch if we could afford it. But if we diverted resources to that, we’d be out of business long before launch. Clinically effective AI therapy isn’t a weekend side project.


I've looked into this space as a second project, in collaboration with psychologists and pyschiatrists at a teaching institution with a dedicated, well funded institute for similar work.

The biggest challenge I saw was gaining the trust of sufficient therapists and patients and then an IRB. If you've solved those problems and can actually demonstrate data collection and a computational workflow with any evidence of results whatsoever, I suspect money would not be a problem.


ELIZA was already there.


Yeah, just 50 years ago: https://en.wikipedia.org/wiki/ELIZA


Now we only need to attach some TTS to it!


The Dr. Sbaitso program that shipped with Sound Blaster cards in the 90s covered that angle decently. :)



You can try a version of it on your Echo device through this Alexa skill:

https://www.amazon.com/Asimov-Eliza/dp/B0184NR4P8

Not sure what their stance on privacy is though.


The other day I was thinking about Uber Therapy, get a mini therapy session on the way to your destination!


There are ongoing efforts in this direction, e.g. this paper from the just concluded Interspeech 2017: http://bit.ly/2wBgLKC


Heard of Emacs's M-x doctor?


M-x psychoanalyze-pinhead


Good blog post and audio samples notwithstanding, annoying that they don't put the paper on Arxiv. As they themselves point to in the blog post, the learning architecture was introduced in 2014's "Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis" so it's not clear how much of this is just good engineering vs novel research.


The paper was more than likely embargoed until the talk they gave about it was over. They're introducing some new things that they probably didn't want to release details on before they publicly made a statement.



The big difference here is the application to unit selection synthesis as opposed to parametric synthesis.


The obvious question would be a head-to-head qualitative comparison vs. WaveNet. It seems that they have advanced siri vs. siri prior, but does this work advance the field?


in terms of being feasible to actually use in production? Yes. It runs realtime locally on a mobile device at 48khz 16bit. WaveNet doesn't run realtime even on a desktop GPU at 16khz 8bit.

The WaveNet method of predicting the output sample by sample yields great results but at a very high computational cost


There's no question the diction of iOS 11 is much improved. But I liked the voice & timbre of the old speaker better - it sounds more authoritative.


Yes, it's a shame they didn't hire her to do the iOS 11 voice.


Now if only it didn't feel like when I'm asking Siri to do a task it has a very small pool of pre-set options I get to choose from. It still feels rather restricted, but I'm excited they're really investing into it.


The new voice sounds a lot like Google's current TTS voice.


I don't like the higher pitch/sharp tone from iOS 11. I like a warmer and deeper tone in iOS 10. I feel like having a more mature/experience assistant.


It's also interesting how they made the pitch higher for the new voice, like Google has had all along.


This is amazing, and also how beautifully it is written and presented!


Siri's voice update and not allowing apps to use location always were two of my favorites in iOS 11!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: