One of the domains where realistic and fully customizable speech synthesis could usher in a renassaince of innovation, would be video games developement.
I’ve always been curious about how so seemingly little work has gone into speech generation, compared to graphics and physics and other technologies.
Problems with voice actors:
• The costs, often unaffordable by solo developers and small teams.
• All the text and the scripts have to be finalized and set in stone. You can’t change much, if at all, after you’ve finished recorded everything. Being able to re-hire the same actors for future sequels is also not guaranteed.
• Complete lack of real-time customization. For example, in MMORPGs and other online games. So we’re stuck with a few set phrases or using our own voice (not desirable all the time for everyone). Imagine if you could fully customize your voice as well as you can do with your character’s appearance, and have it realistically speak any text you type, or NPCs actually speaking your name instead of using generic placeholders.
• Players expect voice acting in all “major” titles. Plenty of games with great stories get skipped over if they don’t have voice acting.
> Being able to re-hire the same actors for future sequels is also not guaranteed.
Wow, it was not in my radar that AI could replace voice actors' jobs, but your comment has opened me to this idea.
To give an example, in Spain (I'm Spanish), in the Simpsons series after several seasons the much beloved voice actor for Homer Simpson died. They had to replace it ofc with another one. Yet, I know that many people in my circle started stopping watching the Simpsons at that time because they couldn't stand the new voice. Regardless of whether the new voice was "worse" or "better", it was certainly different, and we humans get weirded with that.
In light of this, I could see animated pictures with fully AI-rendered voices. Incredible.
I seem to recall a company doing this for NPR personalities. It makes a lot sense -- you have one person doing a boatload of speaking in front of a QA controlled mic, and you generally have their transcripts. Pretty straightforward dataset for training. Sadly, can't find the podcast that discussed this right now.
Yes, just tried to record my voice, but the service didn't finish. Can be a temporal glitch.
The fake voice by Donald Trump is terribly close.
This definitely has a repercussion for fake news and overall trust. We will just not be able to trust what we see (computer-rendered images) or hear (rendered voice).
Related: I've been looking for a good solution to voice synthesis with style transfer. One obstacle many potential Fallout 4 mods will face is that the protagonist is fully voiced. Imagine a mod author being able to create new voice lines in the same tone+style as the original voice actor.
That does beg the question though; would a voice actor have legal grounds to stop such reproduction?
> I’ve always been curious about how so seemingly little work has gone into speech generation, compared to graphics and physics and other technologies.
The issue is coming first from the lack of assets. Voice samples that you can use for speech synthesis are not available easily, apart from your voice and a few friends' maybe. Then there's always been the issue that speech synthesis was not very good (we had working speech synthesis back in 1985 on the Amiga Workbench, but it was pretty much what you'd expect in terms of quality), and "not good enough" can break the experience.
On top of that, speech synthesis is not an easy problem, it's definitely in the realm of machine learning, and video games development does not touch so much on machine learning so far. I think there's a ton of applications for machine learning in games (such as making less stupid AI, or more human-like AI), and speech synthesis is definitely one of them.
Machine learning really doesn't have many uses for games outside of the development cycle (or non-critical gimmicks). If it's exposed to player input - or especially; random input - it absolutely must have robust, stable and controllable behaviour, at least on paper. Most of the time getting any of those out of an NN is either difficult or blatantly impossible.
Though speech synthesis specifically might be a good use case for some simpler ML, such as to form the final sound from processed phonemes (i.e acting as a fuzzy lookup table).
It may not be applicable in all games. But there are many games where the AI skill matters a lot to players. Or where they just get annoyed or bored with stupid AI. Games with bad AI (which is most games) frustrate me a great deal.
AoE 2 let players make their own custom AIs. They took one of the best fan made AI scripts and made it the default AI in the new release. And players loved it. Because it provided a much harder and human like challenge, and didn't cheat like the old AI did.
Obviously you shouldn't just train an NN to instant headshot players with optimal strategy. You can add realistic handicaps to the AI like noisy controls and human reaction times, so it isn't superhuman.
Enemy AI is most certainly the perfectly wrong place to apply ML due to the wide range of possible inputs (scenarios) and being in the utterly wrong complexity class (I mean, it's ostensibly doable with an RNN, but definitely not worth it).
GOAP is still pretty much the end-all of game AI, but it can be pretty tricky to work with.
I feel like once there's something open source/free which works serviceabley for a big enough range of voices, you might see some mod projects jumping in to it. Certainly the idea of being able to put together new single player content fully voiced solo sounds appealing as hell to me.
https://grail.cs.washington.edu/projects/AudioToObama/ is a nice example. The tech is already there, we can expect the psyops people of any major nationstate and/or any political campaign with significant financial resources to be able to generate such fake content in 2018.
Huge kudos to authors for being upfront about what doesn’t work. I am getting pretty tired of people not doing this more consistently and only put out the most attractive results even when they would have observed many not so good results.
While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises.
Might it have been Lyrebird[0]?
They're (as far as I can tell) they have the best text-to-speech that you can actually use.
It's kinda annoying when Google makes these announcements about their advances in TTS only to reveal that it's not actually something you can make use of, and no, the dataset they used to train their model is not available.
I am constantly surprised by how robust and versatile Mel spectrograms (and Mel frequency cepstrum) are, despite the filterbanks and the transformations being relatively arbitrarily engineered and performed on evenly spaced frames.
These results seem fantastic nonetheless! I anticipate they'll be able to optimize the system to real-time generation or better within the next year, looking at WaveNet.
If you listen carefully, it's possible with some of the samples to hear that the human is stressing a different word (e.g. "that girl" vs "that girl", or "too busy for romance" vs "too busy for romance"), but I couldn't tell which was the real recording based on that alone.
Haha, try again, the human is 1,2,2,1 according to the filenames (I was fooled too).
I do think the difference would become obvious with a paragraph or more of speech, though. It's difficult to judge what the correct intonation should be on these single sentences without context. Ultimately, correct intonation requires a complete understanding of meaning which is still out of reach. An audiobook read by tacotron 2 would still sound strange.
Depends on the audiobook. I think technical docs would be alright, which is what I want this mostly for. Lots of technical docs I'd like to listen while I work out.
I thought the first one was the clearest once you've read that the synthesised voice attempts to guess which words should be stressed from syntax: sentences beginning with the word "that" often should stress "that" because they're distinguishing that choice from some other, but probably not for this particular instance where it's an off hand reference to some girl from some video...
I've been wondering lately, it seems like audio books might be an amazing training resource for models like these, if you could get the script that the reader was working from!
It's 1000 hours of audio book readings, segmented by sentence, with transcripts. All from project Gutenberg, so maybe a little bit heavy on Victorian bodice rippers and such, but certainly a great trove of training data...
That data is no good for this purpose, as it’s from a lot of different speakers and does not have speaker labels, i.e., you can’t tell which sentences were spoken by which speaker.
I wonder how hard it is to get this effect in other languages. "The google translate lady" is really helpful for my foreign studies but I wonder how robotic it sounds to a native
Arstechnica mentioned this research in an approachable article [1] on the challenges of getting speech recognition and generation working for more languages. I found it fascinating.
I suppose an API will come to Google Cloud platform to use this service sooner or later ? Or, is it still on the research side (not production ready) ?
It's still a research project and not a production system:
"We manually analyze the error modes of our system on the custom 100-sentence test set from Appendix E of [11]. Within the audio generated from those sentences, 0 contained repeated words, 6 contained mispronunciations, 1 contained skipped words, and 23 were subjectively decided to contain unnatural prosody, such as emphasis on the wrong syllables or words, or unnatural pitch. In one case, the longest sentence, end-point prediction failed."
Apparently not? I imagine it would have to be replicated by someone else. One also assumes that the volume of training data - I don't actually know - is not insignificant!
They say they split the audio in "80-dimensional audio spectrogram with frames computed every 12.5 milliseconds". The picture from the post supports that.
For 24 hours that would be 7 mil frames of 80 values ~= 500 mil data points, or 2 GB of raw data (assuming floats)
So, something that easily fits into RAM, One might even keep a copy of the whole dataset in GPU memory to avoid copying to and from even if matrix operations are done on smaller batches.
E.g. ImageNet is 50 gb of compressed data, and there are many much larger datasets in practical use.
Can the research team find a way to display a blog entry without requiring javascript? Perhaps some sort of mark up language would be ideal. The current version requires all sorts of crap from 30 different domains and at least 2 trackers. By default with uBlock I got just the header.
I’ve always been curious about how so seemingly little work has gone into speech generation, compared to graphics and physics and other technologies.
Problems with voice actors:
• The costs, often unaffordable by solo developers and small teams.
• All the text and the scripts have to be finalized and set in stone. You can’t change much, if at all, after you’ve finished recorded everything. Being able to re-hire the same actors for future sequels is also not guaranteed.
• Complete lack of real-time customization. For example, in MMORPGs and other online games. So we’re stuck with a few set phrases or using our own voice (not desirable all the time for everyone). Imagine if you could fully customize your voice as well as you can do with your character’s appearance, and have it realistically speak any text you type, or NPCs actually speaking your name instead of using generic placeholders.
• Players expect voice acting in all “major” titles. Plenty of games with great stories get skipped over if they don’t have voice acting.