I'm most interested in its potential for audiobooks, especially of the vast array of old and less-common books that don't have human-made audiobook equivalents. I find myself constrained by the limits of audiobook choices, which tends toward best-sellers lists or pop-sci. Current attempts to use text-to-speech to generate audiobooks results in something that's frankly unlistenable. If TTS could get to a "good enough" point for audiobooks, that would open up a huge range of less-common content.
It has so many voices available in the voice store, and I've had no trouble reading books with it. I'm constantly amazed by the quality of some of these voices when played at high speed... They take breaths, express emotion, everything. My only problem is they occasionally mispronounce brands, names & acronyms: Quite amusing to learn about "Sharr-E-poynt" instead of "SharePoint", "Amd" not "A.M.D", or use an incorrect pronunciation of a character in a book when talking about the book... Can't really see how WaveNet would fix that, but look forward to seeing these voices turn up in Voice Dream.
One of my co-workers told me in 2012 that he was doing exactly this. He used an ebook reader to download free ebooks from Gutenberg and then the IVONA text to speech engine for Android to listen to them on his drives. He had already finished a few classics like Treasure Island this way.
I'm sure things have improved significantly since 2012 so what you're looking for is probably easily done.
It also works reasonably well for academic papers as PDF. Equations get butchered, of course, but overall it's not bad (you can set PDF margins so that headers/footers are not read out aloud on every page). I use it to proof-read my own writing, it helps you spot things that spelling/grammer checkers miss.
I have been doing this on Android for a few years using Ivona, FBReader with the FBReader TTS+ Plugin. I got Ivona (beta) from the Play Store before they were bought by Amazon - it looks like Ivona voices have since been pulled and are no longer available.
Google's "cloud-based" TTS (not the local one, which was terrible) on the Play Books app was pretty good even years ago.
However, it was unusable because it would stop as soon as the screen would turn-off. Other ebook readers' read-aloud features worked with the screen off. So thanks for nothing Google!
This was kind of the target of the 2016 Blizzard Challenge (http://festvox.org/blizzard/blizzard2016.html), as the training data was children's books (clean audio, 'reading voice' prosody, etc).
Some of the voices that came out of that were incredible.
I'm not sure - the exact voices themselves generally aren't published, but the papers are detailed enough that you could recreate the system if you wanted.
Wow, hadn’t thought about that possibility. Too bad for professional narrators of Audible (and other) audio books. I am a happy customer of Audible, and have some favorite narrators. Sad to think that they will lose work.
I am the tech manager for a machine learning team and potentially eliminating jobs is the bad side of my field. That said, I think AI, in some sense of the term, will also help us be more efficient. I imagine everyone having effective assistants that help us in our work, help with communications and scheduling, etc.
Yeah, I was thinking about exactly this. I'm actually going to try to reach out to the team to see if they're going to be looking into this. Google Books on the rise! Hopefully!
Also music generation - the original post (September of last year) had some really interesting music created by training the system on Chopin or something.
I don't have time at the moment to search it out, but in a previous speech-synth thread there's an in depth discussion of this. In short, people were pessimistic on wavenet as audiobook because audiobooks are all about knowing the story and speaking accordingly
Ancient Kindle Keyboard can text-to-speech most Amazon ebooks with zero hassle (I think authors can disable it). 'Unlistenable' is in the ear of the beholder! My only gripe personally is having a single "narrator" for all content.
I agree "unlistenable" is a personal judgement, but there are very rare cases when buying an audiobook, even with a terrible reader, is not a better value/cost ratio than the original Kindle TTS, it was convenient, but not great.
I am wondering what their baseline is. They call it "Current Best Non-WaveNet". Quite frankly, Apple's most recent deep learning-based speech synthesis sounds superior, but there aren't enough samples to for a proper comparison: https://machinelearning.apple.com/2017/08/06/siri-voices.htm...
It could just be a matter of opinion, but I prefer both Google's unit selection synthesis, and their WaveNet synthesis. The prosody in Apple's latest method is still annoying, nowhere near as good as the Google models of 2015 and 2016, and not remotely comparable to the WaveNet models.
Apple's change in voice talent is an improvement though, and they may have more units than before, which is helpful. I believe their model also works offline, which is a huge plus (though I think Google's prior model works offline as well).
I think the voice for the samples in your link still has the problems they talk about in that article.
There are noticeable blips in the speech that sound unnatural, particularly when certain sound combinations are used.
The very first sample with "Bruce Frederick" is clearly off. The intonation and timing between the end of bruce and the beginning of frederick is... mechanical.
There's a similar problem in the OPs link with the non-wavenet English voice 1 when it says "Wavenet".
Those issues are much less apparent in the wavenet voices. Timing problems are less noticeable, intonation problems are less noticeable.
Frankly, the voices there sound VERY good, compared to anything I've heard.
That said, I completely agree that there's not enough samples there to make any real judgement.
I think I read "commercial" somewhere in there. So it'd be "the best you can buy", though not necesarily "the best other competing companies use" (ie: Apple).
Still, they picked one that makes theirs look vastly superior.
It's a huge improvement, but still nowhere near the state of the art for Japanese TTS which has always been ahead of English (since the phonetics are much simpler).
Mm, I see what you mean. It seems like voicetext.jp has more "correct" prosody, and seems to reproduce vocal fry correctly, I'm noticing that's missing from the WaveNet sample now. There's still some work to be done around filtering with voicetext, I hear pretty glaring artifacts between the units, whereas WaveNet doesn't produce any such artifacts.
I like the sound of Risa in http://voicetext.jp better. Try it out the Japanese Google used in their example. If someone knows of a better one it would be great to find out.
this was what I noticed too, I don't really have an ear for Japanese, but it sounded as natural as when I have heard it spoken by others. Huge huge improvement.
I've always wondered why companies don't just take all the close-captioned TV streams and use that as training data for their voice models. Seems like it would create a much more natural sounding voice model (at least as far as humans are accustomed to).
I think a better data source would be professionally produced audiobooks. Better, clearer voices, nearly perfect "transcription", enormous supply already recorded on a wide range of topics potentially available from one or two sources, etc.
Even if there is no background noise present, the quality is nowhere near that of professional studio recordings and would be very noticeable in the output.
Also, for traditional systems you need a lot of data from one speaker only, they can't take advantage of other speakers' recordings (although WaveNet does that now).
And "TV natural" might not be the style of natural you want from a TTS system.
Strictly speaking, the text is never 'entirely in sync' because spoken words inherently blur together and are seamless; individual letters in the text do not start and end at precise intervals. This is one of the things that makes speech recognition so hard: letters, syllables, and words do not really exist as discrete things on the raw audio level. So this problem exists for any speech transcription dataset. To provide a loss function, then, you would use something like CTC: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2... Fortunately, NNs are good at handling noisy data, and in practice they work very well for speech recognition/transcription.
Recordings are force-aligned to the transcriptions anyway (using essentially a speech recognition system) to obtain phone-level alignments. You don't need explicit timing information beforehand.
I still don't understand the difference between Google Assistant and Google Now (or whatever), other than when I accidentally launch Assistant instead of Now, my commands are never understood.
I don't understand the difference either other than Assistant seems to be replacing Now. The issue, I'm finding, is that Google Assistant requires that you turn on the recording of usage data and history to your account, such as Web & App activity, Location History, Voice & Audio activity. All recorded and tied to your account. I don't believe this was the case with Google Now.
Google, to their credit, normally allows you to erase all this data and turn recording off, but with Google Assistant, they require it all to be on and recording. I avoid using it because of this restriction.
IIRC one could use Google Now with voice recordings off and the extra-spying setting of "web and app activity" off (though web and app activity had to be on for many features). All this must now be on for the assistant, as you say.
I did try out Google Now for a time due to having an Android Wear watch and found that as time went on more functionality required the turning on of these settings. One particular oddity for a time was that "OK Google, Navigate Home" would produce only a complaint about web and app activity being off, but "OK Google, navigate to $HOME_STREET, $HOME_TOWN" was fine.
To be honest, it still obviously sounds machine generated. I guess it's a slight improvement, but the examples shown do not include any challenging words or phrases. We've been able to generate adequate sounding speech for simple phrases for quite some time. I bet this was really fun to work on however.
If you ever listen to people try to record sound that's clear and precise, they actually sound fairly robotic. See this Google 20% Project where they explore Google Assistant's voice creation: https://youtu.be/qnGNfz7JiZ8?t=5m23s
WaveNet is probably modeling the source data very well. It sounds like they just need more data with emotion and inflection, rather than having source data that is optimized for monotonicity and precision.
If you work in radio or voiceover you learn really quickly that the voice is so much more complex than people give it credit for. Subtle changes in delivery, timing, inflection, syllabic emphasis, pauses etc... make a massive difference.
Anyone can "talk like a robot" but speaking naturally is way more dynamical than just making the sounds of the words transition smoothly.
I'm not sure if it's way easier or way harder than we're doing it now.
Perhaps I'm just less demanding but I was pretty blown away. Perhaps it sounds robotic but it's coming from the same voice recording. The main difference was how perfectly the words blend together. Now I'm just waiting to get Stephen Fry samples instead.
I guess the hope is that even if it doesn't sound much better, this will open new doors, such as being able to iterate faster on new languages, new voices and new speaking styles (singing, whispering, etc).
EDIT: A big example of that is the Japanese example at the bottom. English has had far more effort put into the old model, so it's already pretty good. But the difference between the old and new Japanese voice is really striking, and they were most likely able to make the new one much faster.
I thought it's just me. I remember the first WaveNet demo sounded significantly more natural than the non-WaveNet stuff. But now they sounded almost the same. The main difference is that with WaveNet you don't hear that robotic tone at the end of a phrase, that's typical of old TTS technologies. But the way the phrase is spoken still sounds like a machine said it, rather than a human.
I wonder what compromises they made to improve the performance by 1,000x. There have to be some.
Oh man, I can already see the court cases of a certain robot's voice sounding a little too similar to a deceased human's that it should make royalty payments.
(choose English(USA) - "Little Creature". Borked on Firefox but works in Chrome)
"Little Creature" my ass.
Apparently, names are IP, voices not so much. But we can't have nice things because like you said - eventually, someone's going to figure out how to file an effective lawsuit.
I don't see it as being very different than using someone's face to promote something.
For example, should a lifelike digital model (or even a still image) of an actor's face still generate royalties for the actor's family after their death? Both cases are using a unique attribute of the person to promote something.
I sure hope so, although I am extremely angered that they don’t release the model.
Google is only having any success with this because they are getting training data from the public for free, they should also return the model to the public for free (at least for noncommercial use)
The amount of tweaking and finessing that goes on under the hood for specific scenarios can change outcomes quite a lot.
And from a product perspective, it can be taken quite far - for example, most of the 'common' things Siri says are not synthesized, it's literally are recording of the voice over artists. The more arcane stuff is synthesized.
It's always comparing apples to oranges to bananas unless you really know what they're doing, even then it's hard.
I would love something like this for computer notifications, or any sort of automated system notification. For a concrete example, my RC controller has the ability to do voice prompts (e.g. "landing gear down"), and it would be great if I could get a high-quality voice to speak all this.
Hell, I'd settle for an API where I could send text and get high-quality voice back. Maybe I can somehow hack the Google Assistant app to do it...
Mainly because that price is over the "can be bothered" threshold. I'm not going to find someone on fiverr and contract them to record a few words so I can have aural notifications for finished downloads, for example, but I might write a simple script to take the text and store an MP3 with the audio if it tacks a cent on my AWS bill.
While the 100x speedup sounds impressive, a raw speed number without details regarding hardware is pretty meaningless. I'm guessing they got the 20x realtime speed from running it on their new TPU hardware, which they say can do 180 Tflops. That means you would need 9 Tflops of computing power to run this in realtime -- still pretty far away from running on a phone, or PC for that matter.
He also missed the speedup by a factor of 10, it's 1000x faster now :) I doubt it's just more hardware, they must have optimized the networks seriously.
It runs on the phone afaik, while speech recognition on the other hand is supported by stuff Google runs in the cloud.
Edit: I might be wrong, at the end of a paragraph they say it runs on Google’s TPU cloud infrastructure, though it isn't clear to me whether they just use that for training.
Edit 2: I just tried it on my phone. At least stuff like asking it to "Turn on WiFi" works without an internet connection, and yields a TTS response.
> I just tried it on my phone. At least stuff like asking it to "Turn on WiFi" works without an internet connection, and yields a TTS response.
But this is the status quo. You would not expect Google to disable offline TTS just for slightly improved quality. The real question is, is it running Wavenet offline or the previous version of its TTS engine offline?
Is the offline one improving also? Google Maps often falls back to it (much more often than necessary for some reason) and it sounds completely different and far worse.
Google TTS supports both. Maps prefers the cloud one so that you're not at the mercy of whatever TTS your manufacturer has installed (Samsung uses their own, right?), but will fall back to the local one it there's no connectivity. I can't remember if the Google TTS implements that or if it's custom to Maps.
I'm interested in using similar generative adversarial networks to reduce artifacting in video streams. For example, highly compressed streams tend to show blocking artifacts on dark scenes, gradients, and static that could be smoothed in the decoder.
I haven't actually done much about it yet, but I'm interested.
Adding matching of the laplacian to the optimization could be really helpful for this. I found this paper the other day and really enjoyed it https://arxiv.org/abs/1707.01253
But I'm imagining, but don't the wavenet versions sound very formal and a little depressed? As if it were a reflection of our society. Listen to them again. Very formal, lifeless and depressed.
It has been known for some time that WaveNet can be greatly sped up, and exponentially so. See https://arxiv.org/abs/1611.09482 (Fast Wavenet) for an example.
Guys, for those of you who would like to see how Microsoft Cognitive service’s TTS fares when compared to Google’s TTS.
We had launched a Bot which gives voice summary of web content on Messenger, Slack, Telegram & Twitter with In-line audio player on first three. It’s great for sharing audio summary to our visually impaired friends.
Wonder -- in all seriousness -- if the BBC will do Attenborough. If you're British and you've grown up with his documentaries, all other nature voices simply sound wrong.