Wowsers. This is a step change in quality compared to SOTA. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.
And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:
The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.
The recorded human samples vary more between samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).
In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.
Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.
[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.
> distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.
The synthesis still doesn't know where to place emphasis. You may be unable to distinguish between POOR human voice work and TTS, but not GOOD human reading.
You can hear that the human readers place emphasis based upon an understanding of the meaning of the text that they're reading, and also based upon an understanding of the humans at the receiving end. It seeps through that they're human. The AI generated samples are good, but they're bland in comparison. The human emphasized words are typically not emphasized in the AI generated samples.
That might be true. But I think a good reader, such as a news anchor or a voice actor, will know where to put the emphasis and the pauses, in order to help the listener along. It's value-adding. I think most people who do it professionally will have this capacity.
I'd be nice to be able to tag specific words for emphasis in a sentence, where the tagging process would be made via semantic NLU tasks and the voice alteration by the TTS model
I think there's alrrady research for "TTS after NLG" that does this, since a NLG system can export meta-info about emphasis, in addition to the text (at least in case of non-end2end NLG systems).
Whether that makes a big difference in practice, I don't know.
> The synthesis still doesn't know where to place emphasis.
True. And yet, these samples aren't in a monotone! That's an enormous improvement.
> You may be unable to distinguish between POOR human voice work and TTS, but not GOOD human reading.
I think these are indistinguishable from AVERAGE human voice work. Keep in mind that the POOR voice work you may have in mind is probably still being done by someone who is, at least nominally, a paid professional.
It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.
This reminds me of voice dictation history. I remember using dragon and other upstarts, on a 486dx notebook. I could discuss with the laptop, ask instructions, receive answers, and control verbally all options, actions, and programs. At the same time, do dictation in different fields of study - training required in geology words or engineering words for example - which was remarkably accurate.
So you could turn on your notebook; start running programs, verbally type (dictate) a report , save, format, print, it email it - then turn on a music player or watch a movie.
All without touching a mouse or keyboard.
This was in a busy room, with a party by the way.
Everyone said, just wait until computer speeds are faster!
But all the software was bought out by Apple, Microsoft. It didn't get improved on for over 20-25 years, and still not really f functioning. (Except siri Google etc.) the
In case it's of interest, when I last explored this topic in terms of the Free/Open Source ecosystem I was very impressed with how well VOSK-API performed: https://github.com/alphacep/vosk-api
I agree so much, that I've started learning ML to make a decent opensource many-languages TTS working on smartphones.
But really, the situation is pretty good, with a lot of code and dataset available as opensource. Notably, if you're not constrained to smartphones and the like, you can run on your computer quite a number of modern models, see for instance https://github.com/coqui-ai/TTS/ (which itself contains many different models).
The work that needs to be done is """just""" to turn those models into something suitable for smartphones (which will most likely include re-training), and to plug them back into Android's TTS API.
Larynx in particular has a focus on "faster than real-time" while OpenTTS is an attempt to package & provide common REST API to all Free/Open Source Text To Speech systems so the FLOSS ecosystem can build on previous work supported by short-lived business interests, rather than start from scratch every time.
AIUI the developer of the first two projects now works for Mycroft AI & is involved in the development of Mimic3 which seems very promising given how much of an impact on quality his solo work has had in just the past couple of years or so.
Just one more example of copyright stifling progress, innovation, and accessibility. The Authors Guild doesn't screw over the public by exploiting our insane copyright laws as often as the MPA or RIAA, but they've had their moments.
Or publisher could license the audio right from the content creators, so they get compensated for the increase in profits that the tech/publisher is making?
If it's audiobook, then yes, it's sort of a performance and you could argue they need different licence. But with tts you basically create the audio with your own means on your device. But you still bought the book, so there is you profit.
If I read a book to my child, would you want me to buy audio license?
I know it’s a rhetorical example, but given how stingy the copyright regime is: yes, yes they would. You maybe could manage a 20% off credit for the child’s listening rights.
Whats the difference between Audiobook and realtime text to audio translation, isn't the outcome/experience the same?
The difference between you reading and the software is one is 'mechanical', whether you want to constrain someone doing that in the rights you grant is debatable.
Copyright enables you (gives the creator the 'freedom' to choose) to make such choices, its what people choose to limit is the problem, as they tend to be very stingy.
> Whats the difference between Audiobook and realtime text to audio translation, isn't the outcome/experience the same?
The "outcome" (text is read aloud) is the same if you read it aloud to yourself. Really the difference is that when a publisher releases an audiobook they hire someone (sometimes multiple someones) often an actor or the original author to sit in a recording studio and recite. They pay for things like studio time, sound engineering, editing, the narrators time, sometimes music or foley, etc. It's very much a different product.
If you have a book and you recite it, or if you pay someone to come into your home and read it to you, or you get a bunch of software and have your computer read it for you, that's your right and at your own expense in terms of time, money, and effort. Some kind of text to speech software is expected on pretty much every device. Including such features in devices or using those features (especially the accessibility features) of your own devices isn't copyright infringement, shouldn't open you up to demands for payments from publishers, and is in no way comparable to a professionally produced audiobook. Maybe one day the tech will advance to where a program can gather the context needed to speak with and convey the correct emotion for each line and will be capable of delivering a solid performance, but right now we're lucky if more than 2/3 of the words are even pronounced correctly and the inflection isn't bizarre enough to make you question what was being said or distract you from the material.
As a thought experiment, how about if some new fancy feature/AI could take your book/text and generate a movie from it. The technology/publishing platform only licensed the right to publish the text/book, but they now have a way that you can watch a movie (in the past they had to get actors, director, sometimes multiple, pay people, studios etc)...
As a follow up it would be cool, TTS did use different voices for Narrator and characters in a book ... if someone patents that your welcome!
I'd guess that even a movie, made by AI using nothing but a book you paid for, would still be okay for personal use. Again, we already have the right to hire an entire theater troop to come to our homes and perform whatever we want. As long as you weren't making your AI movie commercially available you'd probably be alright, that kind of thing might be transformative enough to be covered under fair use, although I'm guessing somebody would still object to you posting it online for free.
Honestly, if AI ever gets good enough at crafting films from literary source material that the AI movies has any chance of competing with a hollywood production the entertainment industry is screwed. I'm positive that by then whatever crazy stuff that AI is putting out will be everywhere and playing with it would be way more fun than a movie theater ticket.
Different voices would be cool, but who is speaking which line can sometimes be ambiguous even for human readers. I'd be happy with just one voice that didn't sound like a robot or like a human voice spliced together from multiple sources.
One of the problems is if TTS becomes comparable to an actor's performance.
It's interesting that we are also now moving in the direction where you can copyright your voice.
yeah, actors are going to have to secure all kinds of "likeness rights" or something to prevent all kinds of media being made with them, but without them. There have already been cases like this:
https://www.theverge.com/2013/6/24/4458368/ellen-page-says-t...
I still expect it'll mean a lot of very amusing outbound voice mail messages.
Oh absolutely, the same way Amazon removed physical buttons so they could use them as leverage for their overpriced Oasis reader. My twelve year old Kindle 3 has TTS, page turn buttons, and a headphone jack, and it cost a hundred bucks.
Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.
I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.
Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.
That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.
[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.
Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?
Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.
Well, that's no surprise, because governments/corporations (in fact, The Govporation these days) strive to keep the status quo of being way beyond the capabilities of the people, hence the monopoly on weapon-grade tools - and I seriously think that AI-produced speech that is indistinguishable from the human one is a weapon-grade tool.
Your answer is in deep conspiracy territory. The reality is in my opinion more banal: money.
Microsoft and particularly Apple used to make cutting edge (at the time) TTS available as a marketing gimmick, but then did not develop this further because TTS was only relevant for screen readers and people with impaired vision weren't really their primary customers. Then TTS made advances and companies started selling high quality voices at high price points. Now companies want to make as much money with it as they can. Improving "free" TTS for ordinary customers is not really a top priority of Microsoft and Apple. Moreover, since network speeds have increased, any improved end-consumer TTS will send the text to a server and the audio back, so that a company can collect and analyze all the texts and make money with this spy data. That's how Google's free TTS server works.
Parse it as "seemingly independent governments of the world and international big business blended and interlocked together to the point of being indistinguishable" to ease the torture.
You vote what you're told to vote by the media that the state and capital dominate entirely (Think about it: how would enough people even hear about an alternative idea or candidate that they don't deem in their interest?) If you're American you get to choose between two millionaires (or sometimes billionaires) with virtually identical economic politics chosen in advanced for you who can safely ignore you once they get elected. "Democracy."
> Oh also it's land, labour, capital and entrepreneurship in classical economics.
First, no you're wrong, there are only three factors of production in classical economics which you would know if you read any of those authors or at least bothered to check Wikipedia[1]. Second, the state is not a factor of production, so you got the wrong list anyway. "Entrepreneurship" which isn't an institution, was made up by noeclassical economics to justify profit in the 30's, over 250 years later. It wasn't even popular until much later.
Nice pitch envelopes. But it's a bit uncanny that natural human pitch envelopes encode and express what you understand and intend to convey about the meaning of the words you're saying, and what you want to emphasize about each individual word, emotionally. Like how you'll say a word you don't really mean sarcastically. It can figure out it's a question because the sentence ends in a question mark, and it raises the pitch at the end, but it can't figure out what the meaning or point of the question is, and which words to emphasize and stress to convey that meaning. (Not a criticism of this excellent work, just pointing out how hard a problem it is!)
For example, compare "rebuke and abash": in the NaturalSpeech, one goes down like she's sure and the other goes up like she's questioning, where in the recording, they are both more balanced and emphasized as equally important words in the sentence. And the pause after insolent in "insolent and daring" sounds uneven compared to the recording, which emphasizes the pair of words more equally and tightly.
Jiminy Glick interviews (and does an impression of) Jerry Seinfeld:
I had always thought a necessary step along the way to natural speech synthesis would be adding markup to the text. But I guess the use case being chased is reading factual information, where you don't use sarcasm or other emotional color. Trying to convert text to emotive speech requires markup IMO. Even the best actors will read their lines the wrong way and need correction by the director.
> Even the best actors will read their lines the wrong way and need correction by the director.
And sometimes even the director doesn't notice. Even when they also wrote the screenplay! My favourite example is from The Matrix, @1:01:40, where Cypher says:
> The image translators work FOR the construct program, but there's way too much information to decode the Matrix
Stress on the "for", which makes no sense; he sounds like he's revealing an employee/employer relationship. The stress should be on the start of "CONstruct", since he's saying the tech works for that but not for the other. Subtle, but it changes the whole sense of the line.
Wow, I've seen that movie dozens of times and I've always thought that was the stupidest line of meaningless technobabble because of how it was delivered. It made me question whether I knew what was meant by the phrase "construct program". It's clearly the name for the training/utility simulation used by the crew, but this line always had me questioning whether it also referred to the "front end" of the Matrix itself or something. Now it makes sense!
That's a good point. There are various xml markup formats for synthesizing text to speech that let you tag words for emphasis and pitch, but there's not granular or expressive enough to mark up individual syllables of words, and it's not useful for singing, for example.
It would get really messy of you had to put tags around individual letters, and letters don't even map directly to syllables, so you'd need to mark up a phonetic transcription. At that point you might as well use a binary file format, not xml.
But for that kind of stuff (singing), there are great tools like Vocaloid (which is a HUGE thing in Japan):
Here's a much simpler and cruder tool I made years ago (when xml was all the rage) for editing and "rotoscoping" speech pitch envelopes, called the "Phoneloper" -- not quite as polished and refined as Vocaloid, but it was sure fun to make and play with:
Phoneloper Demo
The Phoneloper is a toy/tool for creating and editing expressive speech "Phonelopes" that Don Hopkins developed for Will Wright's Stupid Fun Club in 2003, using Python + Tkiter + CMU's "Flite" open source speech synthesizer. It modified Flite so it could export and import the diphone/pitch/timing as xml "Phonelopes", so you could synthesize a sentence to get an initial stream of diphones and a pitch envelope. Then you could edit them by selecting and dragging them around, add and delete control points from the pitch and amplitude tracks to inflect the speech, stretch the diphones to change their duration, etc. It was not "fully automatic", but you could load an audio file and draw its spectrogram in the background of the pitch track, so you stretch the diphones and "rotoscope" the pitch track to match it.
DECTalk (famously, the TTS used by Stephen Hawking) lets you mark up individual syllables, and it even had a singing feature circa 1985.[1] It is indeed messy though.
This is the example code to make it sing Happy Birthday:
How to use Eddie and Eedie to make free third party long distance phone calls (it's OK, Bellcore had as much free long distance phone service as they wanted to give away for free):
>My mom refused to get touch-tone service, in the hopes of preventing me from becoming a phone phreak. But I had my touch-tone-enabled friends touch-tone me MCI codes and phone numbers I wanted to call over the phone, and recorded them on a cassette tape recorder, which I could then play back, with the cassette player's mic and speaker cable wired directly into the phone speaker and mic.
>Finally there was one long distance service that used speech recognition to dial numbers! It would repeat groups of 3 or 4 digits you spoke, and ask you to verify they were correct with yes or no. If you said no, it would speak each digit back and ask you to verify it: Was the first number 7? ...
>The most satisfying way I ever made a free phone call was at the expense of Bell Communications Research (who were up to their ears swimming in as much free phone service as they possibly could give away, so it didn't hurt anyone -- and it was actually with their explicitly spoken consent), and was due to in-band signaling of billing authorization:
When you called (201) 644-2332, it would answer, say "Hello," pause long enough to let the operator ask "Will you accept a collect call from Richard Nixon?", then it would say "Yes operator, I will accept the charges." And that worked just fine for third party calls too!
>Peter Langston (working at Bellcore) created and wrote a classic 1985 Usenix paper about "Eedie & Eddie", whose phone number still rings a bell (in my head at least, since I called it so often): [...]
>(201) 644-2332 or Eedie & Eddie on the Wire: An Experiment in Music Generation. Peter S Langston. Bell communications Research, Morristown, New Jersey.
>ABSTRACT: At Bell Communications Research a set of programs running on loosely coupled Unix systems equipped with unusual peripherals forms a setting in which ideas about music may be "aired". This paper describes the hardware and software components of a short automated music concert that is available through the public switched telephone network. Three methods of algorithmic music generation are described.
Coincidentally enough I pretty much only know any of this because a few years ago I created a GUI for a client which enabled an assistive technology researcher to "draw" in the pitch contour required for a word/phrase (not unlike the project demonstrated in your video :) ) from which the SSML was then generated.
Given that a human is going to be listening to each sample, I wonder if a style-transfer style algorithm could be used to map the intent of a sentence to a simulated voice.
I tend to view most of these things through the perspective of what would help mod-maker's for video games: VA is the one thing you basically can't do yourself, but you also tend to have a decent data set to pull from (and I suspect various open source voice sample sets would become pretty popular).
> I wonder if a style-transfer style algorithm could be used to map the intent of a sentence to a simulated voice.
There's definitely research/proprietary software that can enable a person speaking in desired manner to have their voice control the expression of the generated speech.
> I tend to view most of these things through the perspective of what would help mod-maker's for video games
Yeah, I think there's some really cool potential for indie creatives to have access to (even lower quality) voice simulation--for use in everything from the initial writing process (I find it quite interesting how engaging it is to hear one's words if that's going to be the final form--and even synthesis artifacts can prompt an emotion or thought to develop); to placeholder audio; and, even final audio in some cases.
> (and I suspect various open source voice sample sets would become pretty popular).
That's definitely a powerful enabler for Free/Open Source speech systems. There's a list of current data sets for speech at the "Open Speech and Language Resources" site: https://openslr.org/resources.php
Encouraging people to provide their voice for Public Domain/Open Source use does come with some ethical aspects that I think people need to be made aware of so they can make informed decisions about it.
There's no way I'll find it but somewhere along the way there was a collection of samples in which one of these contemporary model-based speech synthesizers (possibly wavenet or tacotron) was forced to output data with no useful text (can't remember if it was just noise or literally zero input). The synthesizer just started creating weird breathy pops and purrs and gibberish utterances. Some of them sounded like panic breathing and it was one of the more jarring things I've heard in quite some time.
At the University of Maryland VAX Lab in the 80's, we had a DECTalk attached to the VAX over a serial line that we'd play around with, but I think the protocol must have used two byte tokens, because some times it would get one byte out of sync and start going "BLLEEGH YAAUGH RAWGH BRAGHK SPROP BLOP BLOP GUKGUK BWAUGHK GYAAUGHT BLOBBLE SPLOP BLAP BLAP BEAUGH GUWK SPLAPPLE PLAP SPLORPLE BLAPPLE"! (*)
Just like it was channeling the Don Martin Sound Effects from random Mad Magazines.
I played with demos of various cloud providers, a lot of it comes down to difference of opinion it seems as I've had people exclaim on here something was fantastic that sounded like a robot.
My opinion is Microsoft's Azure TTS is the best and IBM's Watson TTS was a pretty close second. I remember finding Google's disappointing as well.
Industry research lab claims human parity on end-to-end text-to-speech and releases a web page with five samples as proof? Microsoft, you're a little late to the party - Google has been using this playbook for 5 years!
It's clear that their dataset contains a lot of newscasts. I wouldn't call this "natural" speech. But it certainly has an application for replacing newscasters/announcers.
As far as I can tell the choir instrument in that video is just playing back samples, which are admittedly very high quality samples, but still nothing particularly groundbreaking.
I’d love to see something that could actually do choir synthesis using a method like the one in the article.
While every sample they provide is suspiciously similar to the human version,(indicating overtraining, either on the samples or on a single voice), where I would have expected a different if still human quality voice from a fully functional system, this tech is coming, and soon. And when it does, voice acting will no longer prevent videogames from having complex stories, and we will find out if the industry is still capable of making them. Looking forward to it :)
One of the samples even has a breath intake at the same point as the recording. Not sure how they do that (didn’t read the paper) but I first thought it was the recording and compared the two to find the breathing wasn’t quite as natural as an actual person with lungs.
I don't suppose anyone could recommend a good text-to-speech for Linux?
Command line is fine but it would be much better if it could trivially take clipboard content for input. The last time I looked I found stuff that wasn't that great and was pretty inconvenient.
Very subtle differences, can be heard, but I have my headphones on. For example, in the last example, "borne" and "commission" seem to have some kind of artificial noise inside the "b" and "c" sounds. The "th" in "clothing" sounds artificial too. Still, it's extremely amazing, and probably in 90% of settings, people won't be able to find a difference at all. It even does breaths: "scientific certainty <breath> that".
This is pretty impressive work, except for this one:
"who had borne the Queen's commission, first as cornet, and then lieutenant, in the 10th Hussars"
Both the NaturalSpeech and the human said pretty much every word in that sentence completely incorrectly for the context of the words. It is the difference between "the car Seat" and "the car seat". "It's pronounced Ore-garh-no" to paraphrase the insufferable Hermione Granger.
Good quality overall, though it's difficult to tell from a small, hand picked set of examples (which appear to come from the training data, too — have the corresponding recordings been included in the voice build or held out?).
There is a rather obvious problem with the stress on "warehouses", and a more subtle problem with "warrants on them", where it's difficult to get the stress pattern just right.
I think they’re held out but all data comes from the same speaker and since they have like 10k samples times say 10 words per sample, practically every word will have been in the training data.
The Text-To-Speech service by https://vtts.xyz is the perfect choice for anyone who needs an instant human sounding voiceover for their commercial or non-commercial projects. Got a product to sell online? Why not transform your boring text into a natural sounding voiceover and impress your customers. What about adding a voiceover to your animation or instructional video? It will make it sound more professional and engaging! Our human sounding voices add inflections in the voice that make them sound natural, and our custom text editor makes it easy to get exactly what you want from both Male & Female voices included over 30 different tones, including: Serious, Joyful & normal
I do, but I don’t like many text to speech voices at the moment because they don’t always enunciate in a way that makes sense. So I’ve been looking forward to the day when I can use custom human voices to read me news articles.
I'm in the same boat. I have tried GCP and AWS, and they sound too robotic. Do you convert articles you wrote or random articles? The reason I'm asking is because I'm thinking about new project that works like Descript and that is much easy to use.
TortoiseTTS might be the closest https://github.com/neonbjb/tortoise-tts
It's a few shot multi speaker model so you need just 3-4 little clips to train new voices.
Microsoft/Nuance has been doing great in this area. I am very impressed with TTS on Windows. It makes proofing documents that much easier. I do think there is a need for some type of markup (akin to sheet music) for supervised learning.
You could tell the difference in that the AI pronounced "Hussars" correctly where as the human reader did not. Without adding in our human error, our AI-trained version will be the more educated one for certain going forward.
I imagine that our concept of what a villain sounds like tends to be extremely personally biased but here's a couple of options [Advisory: Contains threatening language.]:
I created these samples in a relatively short time using the Free/Open Source (which I think is an important factor for indies) text-to-speech project Larynx & an narrative editor I finally released the other weekend:
Now, I would really like to link you directly to audio of the next two but considering it's currently in beta behind an (automated response) email address, I think that may not be appropriate, so, instead...
It's definitely a noticeable step up again in quality.
There's an alternate pair of voices if you move the "_" from one "name" attribute to the other in each "voice" element.
I intentionally didn't edit the text to remove some of the artifacts both to give a realistic impression of the current state & because sometimes they add interesting texture. :)
You can click play on any/all of the samples simultaneously, resulting in a neat sonic effect vaguely reminiscent of Steve Reich's famous "Come out." [1]
I wish for the "naturalspeech versus recording" comparisons they'd used a different voice for the synthesized speech. Otherwise the fact that we may not be able to tell them apart by ear (in a blindfold test) doesn't tell us much about how good it is as a speech synth engine with that evidence alone.
As a TTS daily user, sometimes I'm even fine with espeak quality for system messages. But one thing concerns me more than beauty of the voice - the ability to process mixed language text and abbreviations. And I don't see these problems addressed in this project. (
This is definitely human-level quality. In fact, the synthesized versions pronounce some words better than human. Kudos to MSFT! I think they've been longest in the game too...
I have a vague recollection that there was at least one (I think) Eastern European country that managed to get government funding in order to support creation of local language assistive device text to speech.
So, it doesn't seem like an impossible task but certainly a non-zero amount of work to collect & process appropriate audio data.
Hope you get your dream at some point in future. (As everyone deserves assistive devices in their own language.)
Most of Ex-Yugoslavian languages (The southwest Slavic group) are basically the same. Much along the lines of US vs UK English. Slovenian and Macedonian are also quite understandable.
Hopefully advances in AI will allow for a more general approach which will accelerate the development of these technologies.
I have the same with other languages. One thing that I find fascinating is that while all the western TTS and language understanding frameworks allow only one at the time. The Chinese ones happily do Chinese and English at the same time.
I recently tried out the open source android TTS engines and they don't seem all that great even though there's been years of development on them.
Can anyone that knows comment on what the complexity here is?
As a native Japanese and sort-of English speaker, I need to pause flush audio pipeline to switch between languages. Konnichiwa spoken in the middle of an English sentence and こんにちは in native Japanese, are separate expressions and have to be pronounced differently. From my experience there likely isn't a single unified language model inside a human head, and if so it's not a surprise to me that one cannot be effortlessly made for computers.
Nothing BS about it. In ML, end-to-end means feeding raw data (e.g. text) to the model and getting raw data (e.g. waveform audio) out.
This is on contrast to approaches that involve pre- and postprocessing (e.g. sending pronunciation tokens to the model, or models returning FFT packets or TTS parameters instead of raw waveforms).
It's a common and well understood technical term in this context.
And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:
The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.
The recorded human samples vary more between samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).
In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.
Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.
[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.