The iOS 11 Siri sounds like it's a real person talking, it's amazing. Does anyone know if there's an open-source TTS library available with such quality (or if anyone is working on one, from this paper)?
I would love to have my home speakers announce things in this voice.
Yes. And the way it answers some questions, it also seems to be going for a more casual, enthusiastic sort of vibe. It wasn't clear to me to what degree the changes apply outside of the female American voice.
She does sound younger. I think it's because the 's' is more... lispy. She sounds a bit more valley-speak-ish, although that could just be a result of sounding more natural of course.
We're talking about the MacOS TTS engine accessible via the `say` utility and (I believe) accessibility tools, not Siri which is presumably a different software stack and which uses remote processing.
A research paper published by Apple? About Siri?! Unheard of! Last time I was at an NLP conference wth Apple employees they wouldn't say anything about how Siri speech worked, despite being very inquisitive about everyone else's publications. Good to see some change.
It's probably safe to assume a lot of that was due to some/most of Siri being licensed from Nuance initially. I mean, who wants to talk about a new product, which most people think is brand new and entirely innovative, just to say "Oh yeah, we paid someone else to work with us to create it."
Not that there's anything wrong with that and it certainly seems like Apple has been investing in-house pretty heavily in recent years for Siri improvement.
I think it has more to do with the fact that they are finally starting to allow their researchers to publish. They were platinum sponsors at INTERSPEECH 2017 this week, and actually published a paper there. I'm pretty sure that was the first time _ever_ despite their recruiters showing up every year.
One of the ideas is that most ML researchers want to publish their work and Apple wasn't allowing it. Allowing ML researchers working at Apple to publish in this journal was they only way they were get more ML researchers to work for them.
There was a lot of internal pushing for publication, apparently. But Apple decided they don't want individual engineers to get credits for ML work that whole teams are working on, so they came up with their ML blog to post papers without individual credit. At least, that's how it was explained by some of the ML researchers to this year's summer intern class.
Also glad to see this. Still curious as to why they wouldn't post it as a research paper on arXiv -- what's the point in reinventing the wheel here? I suppose it's nice for publicity, but would be great if they also played nicely with the ecosystem.
My favorite part is that the runtime runs on device. I moved back to Android, but persistently one thing Apple does that I like is they don't move things to the internet as often as Google does. On Android, you get degraded TTS if the internet is shoddy.
It's two different philosophies. With Apple it's about providing sufficient value such that the consumer will pay a premium for the product. With Google it's about providing the minimum viable value such that the user will provide as much of their data as possible.
That makes sense. I always wondered why improvements to Siri's voice required updating the device. I figure it had to be running there, just getting the text from the web service and not the audio.
Their TV service does this too on Google Fiber. It sucks, it's a tire fire. It is laggy, wedges until you have to reboot the network and TV box. It's a horrible experience. I really wish they would stop trying to shove everything into the cloud. Seriously Google, STOP. JUST STOP!
I couldn't read the paper yet, and also I know very little about this, but listening to the audio samples it seems that one of the most notable changes was the intonation in changing phrases. Did anyone else catch something like that? I'm not sure I'm doing a good job at explaining. If you listen to all iOS11 samples it'll stand out.
Anyway, it's the only way I can still identify this as a fake voice. The intonation always follows the same cadence (not sure if that's the word?). We really shouldn't have overused the word awesome before this kind of thing came along.
There's also a kind of dread too, tbh, this kind of seamless TTS has the potential to change a lot of things. First of all criminals are going to love this, youtube pranksters too. Eventually this will shake up the voice acting industry in a possibly not healthy way for the voice actors, while at the same time allowing projects with a shorter budget to have incredible voice work (also dubbing).
What I think is really important, tho, is that as we move away from the uncanny valley we change our relationships with those voices, our brains don't have the capacity to listen to a voice this real and not imagine it as a person, even for adults.
Ironically at this moment I'm using an old threadless sweatshirt that says "this was supposed to be the future" but nowadays I can honestly say we're getting there.
Regarding voice acting, I think there is something to be said about human expression/ad-lib. Sure, you could generate a natural-sounding voice computer voice, but in the context of arts we’re still a ways to go before a computer can go off script and add just the perfect amount of intonation on a certain word that turns a phrase into an iconic quote.
Similarly, we don’t see CGI motion capture replacing Andy Serkis any time soon.
I think you're overstating things. On the one hand, a lot of applications where quality wasn't that critical switched over ages ago. And, on the other hand, any application that would have spent the money on voice acting is still going to pay for both the higher quality and for a sound that isn't the same as everyone else is using. (Note that Siri's new iOS voice is based on a new training set from a new person.)
I do think there are applications that we don't just have today because TTS just isn't good enough. I've had some ideas around Alexa apps related to content that would be TTSd. But the current Polly just isn't human enough. I don't think this is there yet either but it's getting close.
The difference between the Siri voices from iOS 9-11 is startling. I can still here some issues especially at the ends of phrases, but it's extremely good.
This just made me realize that every time you see a strong AI in fiction, it still has a computer-sounding voice. If we ever develop strong AI, we will probably already have perfectly natural speech synthesis. And if not, the AI could develop it for us.
But I suppose an AI might choose to use a computer-sounding voice to remind us that it is a computer. Kind of like those inaccurate sound effects in movies - they have become so common that it seems more wrong to omit them. (TV Tropes calls this "The Coconut Effect".)
There is always the chance that as we get better at this stuff we'll start to find it creepy that it's so realistic (either due to the uncanny valley or because we crossed the valley) and we'll start to prefer devices that act robotic even though we know we could make the indistinguishable.
I'm trying to think of another example. I know I've heard a good one with Roombas but I can't remember it.
Basically we may try to avoid a Bladerunner situation where we're not sure when we are or aren't talking to a real person and prefer the 'computery' voices.
The prosody and and continuity of the speech is dramatically improved. This is hard to do and very impressive (especially given that it is being done on-device).
Personally, I'm less pleased with the actual new voice itself, although that is more a subjective judgment. After listening to many hundreds of voice talent auditions for Alexa, it's hard to step back from that level of pickiness.
As I indicated in another comment, the visual that the voice (together with other tweaks in some of Siri's responses) suggests to me is a perky twenty-something.
I actually tend to generally prefer some of the female British accents in several current TTS systems. (Amy is probably my favorite Polly voice.) Perhaps as an American, the robotic-ness doesn't seem quite as obvious or grating.
I also prefer the female British accents but that's exactly what excites me and is so awesome about this. These aren't just samples that are being stitched together anymore. This "learning" that is being done can be applied later to any of the voices in Apple's catalog. Once they get the data of the synthesis out there, they'll more than likely update all the languages and intonations to match. I would imagine that the biggest hurdle with this is that different languages and accents have different nuances. As with most things, they're just starting with English and then will move everything over to all the other options, including the British accents. I don't think we're too far off from a future where you'll be able to pick the age, gender, and voice of your assistant in the same way that characters selection is done in most modern video games.
Kinda sad to see that the names of the authors are omitted, although you can infer some of them from the quote:
> For more details on the new Siri text-to-speech system, see our published paper “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System”
[9] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt, J. Li, M. Neeracher, K. Prahallad, T. Raitio, R. Rasipuram, G. Townsend, B. Williamson, D. Winarsky, Z. Wu, H. Zhang. Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System, Interspeech, 2017.
Atari wanted people to want Atari games, not Frank Jones games. They considered their programmers replaceable cogs and refused to give them credit.
That's why the Easter egg in Adventure with the programmer's name exists. It was the only way to get his name out there.
What happened was the developers didn't like this and left to start their own company, Activision, which made some of the best remembered games on the 2600.
Apple already 'compromised' by letting their researchers publish at all. Maybe names will be allowed in the future but it's kind of surprising we're even getting this.
Oh I see! I didn't know about that. Thanks for the explanation!
I totally agree that this was an unprecedented move by Apple, considering their past stance on such things. I'm hopeful for the future, though! They seem to have realized (at least a little bit) that community cooperation is valuable.
The reports from a few months ago word that they basically had to do this because no one was willing to work for them if they weren't able to publish because their career basically stalled.
It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.
> It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.
Not crazy at all. At least some therapies provide benefits even with simple non-AI processes: "A meta-analyses of 15 studies, published in this month’s volume of Administration and Policy in Mental Health and Mental Health Services Research, found no significant difference in the treatment outcomes for patients who saw a therapist and those who followed a self-help book or online program."[0]
The one grey area though is that when the patient actually does the self-help program. Adherence is a big problem. Meaning a “self-helper” needs to be more disciplined because they don’t have the same pressure/accountability they might have with an actual therapist.
At my little company, iCouch, we have experimented with such things, but to actually make it effective — that requires a good amount of capital — capital that is very difficult to raise. I would need to hire 3 full time people just for the AI project and potentially more.
The VC world is interested in “traction” and not novel tech which means we have to divert effort into growing customers for our mental health practice management system to get “traction” before we can spend any notable time building AI therapists. As much as VCs talk about “looking for innovation” they really aren’t. They are just looking at current growth/revenue. The days of building something amazing and monetizing later seem to be over for all except for founders with marquee names.
We could launch AI therapists within a year, but in the meantime, I have to pay my team. So we are forced to subsidize moonshot R&D with our existing sales — but that is hard to do since existing sales have to finance customer acquisition. Finding an additional $500k per year to make AI therapy viable is impossible for us.
We are in a catch 22. The first question from nearly every investor’s mouth: “how many paid users do you have?” Not, what technology do you have or can develop that is truly disruptive. We could start preparing AI therapy tomorrow for a Summer 2018 launch if we could afford it. But if we diverted resources to that, we’d be out of business long before launch. Clinically effective AI therapy isn’t a weekend side project.
I've looked into this space as a second project, in collaboration with psychologists and pyschiatrists at a teaching institution with a dedicated, well funded institute for similar work.
The biggest challenge I saw was gaining the trust of sufficient therapists and patients and then an IRB. If you've solved those problems and can actually demonstrate data collection and a computational workflow with any evidence of results whatsoever, I suspect money would not be a problem.
Good blog post and audio samples notwithstanding, annoying that they don't put the paper on Arxiv. As they themselves point to in the blog post, the learning architecture was introduced in 2014's "Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis" so it's not clear how much of this is just good engineering vs novel research.
The paper was more than likely embargoed until the talk they gave about it was over. They're introducing some new things that they probably didn't want to release details on before they publicly made a statement.
The obvious question would be a head-to-head qualitative comparison vs. WaveNet. It seems that they have advanced siri vs. siri prior, but does this work advance the field?
in terms of being feasible to actually use in production? Yes. It runs realtime locally on a mobile device at 48khz 16bit. WaveNet doesn't run realtime even on a desktop GPU at 16khz 8bit.
The WaveNet method of predicting the output sample by sample yields great results but at a very high computational cost
Now if only it didn't feel like when I'm asking Siri to do a task it has a very small pool of pre-set options I get to choose from. It still feels rather restricted, but I'm excited they're really investing into it.
I don't like the higher pitch/sharp tone from iOS 11. I like a warmer and deeper tone in iOS 10. I feel like having a more mature/experience assistant.
I would love to have my home speakers announce things in this voice.