But I'm very curious what the emotional "parameters" are? There are literally at least a thousand different ways of saying "I love you" (serious-romantic, throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full of gratitude, irritated, self-questioning, dismissive, etc. ad finitum). Anyone who's worked as an actor and done script analysis knows there are 100's of variables that go into a line reading. Just three words, by themselves, can communicate roughly an entire paragraph's worth of meaning solely by the exact way they're said -- which is one of the things that makes acting, and directing actors, such a rewarding challenge.
Obviously it's far too complex to infer from text alone. So curious how the team has simplified it? What are emotional dimensions that you can specify? And how did they choose those dimensions over others? Are they geared towards the kind of "everyday" expression in a normal conversation between friends, or towards the more "dramatic" or "high comedy" of intense situations that much of film and TV lean towards?
I've always imagined that this tech would need a markup language. Instead of a script that an actor needs to interpret, the script writer (or an editor, or a translator) would mark up the text.
There is Speech Synthesis Markup Language (SSML). Amazon Polly and Google text-to-speech supports it, although the best neural-model based voices only support a small subset.
So that's not markup along "emotional" lines, but rather along "technical" attributes such as speed, pitch, volume, pause between words, and so on.
Obviously coding those things in XML manually would be a nightmare. Now I find myself wondering if 1) these technical parameters can be used to synthesize speech that does sound like a reasonable approximation of emotion (or if they're insufficient because changes in resonance and timbre are crucial too), and 2) if there are tools that can translate, say, 100 different basic emotional descriptions ("excitedly curious", "depressed but making effort to show interest", etc.) into the appropriate technical parameters so it would be usable.
I hear same expression with different "strength". There is no play. No motion. Expression should change after response. It does not. There is no dialog. For me it sounds bald, boring. It'd better not to participate in such dialog.
We can express emotions without words:
xxx: Distress
yyy: Support
xxx: Hope
It maps on music and we have dictionary to describe it. The one I'm listening to is Sorrow and Hopeful - entire track. May be a good start. Write first (classification).
Examples you gave I feel live on same scale but extreme values. So even harder.
I'd imagine it work like autotune - enhance human input
Hey HN - Zeena Qureshi (Co-Founder and CEO at Sonantic) here.
Thanks for your thoughts and feedback thus far! I'd be happy to answer questions (within reason) about our latest cry demo / emotional TTS! Feel free to fire away on this thread.
Saw your YouTube videos a few days ago and was very impressed.
Clearly you can't give away too much on your "secret sauce" but is there any insight you could share on two questions:
1. Do the individual voice talents need to express the emotion types you use or can you layer it on after? (ie do they have to have recorded say "happy" to get happy outputs or can that be added to neutral recordings retrospectively)
2. What are the ball park audio amounts you need per voice? 10 hrs, 20 hrs or more?
Hey! Thanks so much. Yea, can't go into too much detail here, but I will say that more def isn't always better when it comes to the size of datasets. :) We aim for quality over quantity in order to achieve natural expressiveness from our actor recording sessions.
> more def isn't always better when it comes to the size of datasets
This sentiment definitely gives you lots of credibility, only those who have seriously endeavored in this space are able to acknowledge just how true this is.
It's quite antithetical to how some ML folks like to think.
Are you looking to make this accessible (read: affordable) for small time content creators / hobbyists? What sort of pricing model can we expect (one time license fee / subscription)?
Hey sorry for the delay on this. Our pricing model hasn't been published as of yet, but yes, we do aim to make the technology accessible to all levels of creators in the future.
From the first moment of our life we express emotions with voice. Not only that - adults understand them. I can express my own emotions without words. And I can change my mood by singing.
So the question is - what's there? Is it formants? Is it universal? Can we map them like syllables?
And music, it touches same emotions. Does it use same mechanism?
Edit: found "Emotional speech synthesis: Applications, history and possible future" [1], looks like melody is part of emotion processing.
If mapping is possible I'd love to see application in dubbing. Both as translate and TTS with mapped emotions and dubbing actors evaluation/autotune.
Thank you! We think our name is pretty great too :) Sonantic only creates AI voice models with the consent of the artist/voice owner. We take misuse and copyright infringement very seriously and therefore never train on data (voice recordings) where the original source is unaware of its repurpose. That said, we aim to include more recognisable voices on our platform in the future.
Is there really a copyright issue? Mimicry artists and tribute bands have duplicated original voices without a problem. Technically, you are not reproducing any recording owned by anyone. It's brand new synthesis! It would be interesting to see if any courts rules this is not the case. You can own your spoken speech, but can you really own spectrogram of wave patterns?
"No offense but" - really? Again with that? Just be straight with what you mean.
I don't know either of the co-founders, but it seems like a logical, good idea to have a pair of co-founders where one is technical and the other is non-technical (maybe marketing, or sales, or very strong soft skills, etc).
Hence, I don't see the issue you (obviously) have with only one person having done the technical work. Is there any context you're not telling?
The burden of getting a startup off the ground is centered around engineering, design, and actually getting your hands dirty and building the product.
As an engineer, I've been approached by those experienced in sales and offer me to build a mutually agreed upon product that we agreed would make money for equity. I would like to know if this model works. If so, how does it work? Is this common?
You can take offense if you like, but if you meet a guy at an entrepreneur conference who already built a prototype, you're not a co-founder.
You're making a mistake many engineers make in thinking that getting a startup off the ground is all about engineering. Frankly, engineering is often the easy part, and I say that as an engineer who co-founded a tech heavy successful venture backed startup. The hardest thing is choosing a compelling product and then selling it to paying customers. If you think business people undervalue techies, don't make the same mistake by under-valuing business people.
My daughter is dyslexic and would love to play things like stardew valley, pokemon or even animal crossing but being text only makes them such a slog for her.
The same goes for sub titles, she'd be perfectly fine with a robot voice for the actors if they sounded real enough like this.
Thank you for your comment. One of the reasons we founded Sonantic was to improve accessibility so we are right there with you! We plan to do this by reducing the barriers (both financial and logistical) of voiced content for everyone from indie developers to big AAA studios. We've already begun to see progress on this through partnership with initial customers during our beta.
I have a print-related disability causing severe convergence insufficiency with my eyes, due to a rare neurological disease affecting my peripheral nervous system.
I generally use Kurzweil 3000 (http://KurzweilEdu.com) which is made by Kurzweil Educational Systems, as a screen reader. You should definitely considering partnering with them in particular, as it would be very strategic.
That's the holy grail right there, isn't it? :) We're definitely working towards runtime but still some work to be done there to account for additional complexities and balance trade-offs re: speed, quality, accuracy etc. of the rendered output.
OT, but if you are looking/curious about programming environments for her, please check out BlockStudio [0]. It's a text-free visual programming language for children (ages 8-13).
Touché! It is pretty bad, huh? Our new site is only a few days old (rolled out on Wednesday / around the same time as our demo) so we haven't got eveything optimised just yet. Stay tuned for better cookies in the future!!!
thanks for catching this! as mentioned, the site is newly rebuilt. We're still doing some link fixing but if you'd like to read the Privacy Policy right now it can be found at: https://www.sonantic.io/pages/privacy-policy
Hi Zeena, I love this! I just filled out your form.
I was just mucking around with Nvidia's latest, called flowtron, and I know from that experience there's a significant amount of work between getting a tech demo out and launching a usable product, whether API-based, or with some visual workflow like your video shows.
One thing I think worth considering on the commercialization front is whether or not the core offering is the workflow niceties around your engine, the engine-as-API, or both. I'm just a random person on the internet, so take these thoughts with a large grain of salt, but thinking about it, it seems to me that prioritizing integration with say unity, unreal engine, video compositing tools, blog posting tools are all interesting and viable market paths. The underlying networks are going to keep improving for some time, so you're really trying to buy some long term customers.
Some stuff that's obvious, but I can't resist:
I could off the top of my head imagine using this for massively reducing the cost to develop games, for script writers pulling comps together, for myself to create audio versions of my own writing, for better IOT applications inside the home... I'd really love to be able to play with this.
There still isn't a truly non-annoying virtual assistant voice; when the first tacotron paper came out, I was hopeful I would see more prosody embedded in assistants by now, but the longer we live with siri and google, the more sensitive I think we are to their shortcomings. I have a preference for passive / ambient communication and updates, so I would place a really high value on something that could politely interrupt or say hello with information.
Impressive next step for text-to-speect. Wish there was some simple real demos. I also work on the same thing using DL- and hope to open source the "emotional part" of it.
We soon can create emotionally expressive youtube videos with synthetic actors..
Thanks for your comments and nice to hear you're also working on TTS! We have a few more samples (without background music) further down on our homepage and plan to add a full dedicated subpage in time!
The comment: I noticed that your demo video also had "emotional" video layered on top of the dialogue. This could be considered manipulative; perhaps consider sharing a naked version so we could attempt to interpret the emotion based solely on the text to speech engine.
The question: You mention you met at EF. I was wondering if, beyond bringing you together, you found EF to be worth the cost of admission?
> The comment: I noticed that your demo video also had "emotional" video layered on top of the dialogue. This could be considered manipulative; perhaps consider sharing a naked version so we could attempt to interpret the emotion based solely on the text to speech engine.
The music is still there and has an obvious effect.
I thought the demo was impressive, but these things do seem like an effort to distract from (or more accurately bolster the effect of) the core technology.
Though maybe the right call since this is less a strict technical demo and more a way to drive interest/marketing.
The 'high levels of expressivity' comment was more of a flag to me, it's a meaningless phrase alone but it's suggested as an obvious answer. It feels like a mysterious answer [0].
I recognize though this is a marketing video, the core tech demo is cool, and I'm probably being unfairly critical. Flags like that make me more skeptical than I would otherwise be by default.
Hi there - thanks for both your comment and question!
As others have said, this is first and foremost a marketing video aimed at attracting target customers. We've got additional clean samples (without background music) further down on our homepage and we plan on adding even more on their own subpage of the site in the future. We've also done a few technical demos at conferences over the past year and will continue to do so.
We did meet at EF and it was totally worth it! There is no cost of admission for EF, they actually pay you to complete the program! Granted the monetary funds they provide could be a heck of a lot less than what you are earning at a full time job, so everyone's opportunity cost is different. EF's biggest selling point is their world-class network of highly ambitious individuals, so if you're interested in founding a company (pre-team / pre-idea) I would absolutely recommend looking into it.
The prosody sounds nice. But two of the longer samples have a lot of vocal fry, and the third sounds like the voice has a stuffy nose and/or a slight lisp. I wonder whether those mannerisms were chosen to camouflage artifacts inherent in their current implementation.
Yep. Each of our TTS models is based on a real actor's voice that has its own nuanced characteristics. Some voices are naturally rougher / croaky while others smooth. As in real life, our differences are what make us unique. Some voices will work better for certain character profiles / scenes - it's up to the user to decide.
Between this and Lyrebird there seem to be a high number of cutting edge TTS solutions being worked on in the private sector. Does anyone know why there haven't been much advancement with the FOSS libraries?
I think a lot of the remaining gap is due to a lack of high-quality training data -- most of the open-source models are trained on public-domain audiobooks (e.g. LJ Speech).
However, good training data (large amounts of annotated recordings by professional voice actors) is expensive to create, and unlike code, there's not a tradition of people sharing it.
I think there are a couple factors here. This is an incredibly difficult problem space. The solutions going forward involve ML techniques, which require ML experts (currently being hired away by industry) and the resources to create models, which includes not only a large amount of computational resources but a big chunk of training data which needs to be sourced somehow.
I’m not really an expert. From what I understand, the “cutting-edge” stuff requires pushing past the point where we are splicing segments of speech together. Splicing segments together is hard enough.
There are a couple open-source efforts like Mozilla’s, but if you want something like Lyrebird, well, that technology isn’t even really productized commercially yet.
We have done Mean Opinion Score tests on Mozilla TTS [0] and gotten similar scores to real humans. The main problem for open sourcing higher quality models is licensing the dataset.
I’m convinced that the barrier to entry in this field in terms of technologic and financial investment is too high for FOSS projects to compete with the commercial solutions
We don’t see FOSS pharmaceutical research for instance, I believe for the same reason. The amount of coordination needed and the impossibility to separate TTS projects into sub-parts could also factors.
The problem with Common Voice is the same resource problem plaguing open source efforts in general - for TTS data, it's not just the size that matters, it's also the quality.
For something like Sonantic, you need clean recordings from professional actors in proper recording environments (not to mention the in-house expertise to then filter these down to curate the training/test datasets). That costs money. A million people with laptop microphones will just never get there.
Sure, open source TTS seems to be lagging behind recent commercial offerings, but pharmaceutical research is actually an excellent example of a field with massive FOSS software usage. TTS is also "purely virtual" which makes it significantly different, and I would say significantly more approachable to open source collaboration.
Recommending editing the video down to 43-60 seconds.
It would be nice to try with actual text inputs right on the page, that this doesn't exist is tiny flag.
A great choice to work with voice actors, because there isn't any 'pure' TTY that's good enough in the most general sense, having the actual voice actor as a working basis will help.
Perhaps for small game houses, they can just use something off the shelf, big houses can use a customized voice, and then not worry if they have to make tweaks or changes, they don't have to do a whole production.
Thanks for your feedback! We felt that this storyline / length was best in order to showcase the two different actor's artificial voices and build up to the actual cry.
As you've mentioned, we do work with real actors to create our TTS and take misuse of their (artificial) voices very seriously. Because they sound so lifelike, we've made the decision not to allow public access/personal use at this time.
Lastly, your assessment is spot on regarding standard vs custom voices. Lots of interest for both!
TTS=text-to-speech, so it's quite reasonable to showcase that chain instead of an edited video.
Not diminishing the quality of your product, just pointing out an obvious expectation of the audience that it's presented to. Perhaps, there could be a way to test-drive it directly, with limited choices or combinations of the input text.
Very cool demo, but the quality of the vocoding is not state of the art, and it's audibly artificial, which is probably why you covered it up with the obnoxiously loud music.
Next time be honest about what you have when presenting it; every human with functioning ears is attuned to the sound of speech. This sort of technology would be amazing for narrative video games even with the less than perfect vocoding.
I get the criticism (and "you need to find out for yourself" at ~0:44 sounds somewhat robotic), but given that it's aiming at the entertainment industry, where you will have background music most of the time, it also seems like a fair choice of representing real-world usage (where background music might always hide the imperfections a bit).
Is there any pay-to-use or open source voice for Hebrew?
Amazon's Polly English voice, Matthew is pretty nice. But they don't have Hebrew. Also Google doesn't have Hebrew.
Bing has some attribution requirement that I haven't fully investigated.
I wonder if attaching this to a modern-day Elisa will improve the Turing test scores? Emotional load can reduce the requirement for semantic coherence.
As a non-native-speaker, I understood exactly four words from the monologue in the vid. Which might be on par for some movies, often having actors whisper and breathy-voice through the whole thing (ahem House of Cards cough). However, for actual TTS like webpages and audiobooks, the ‘Dina’ voice works much better.
Hey Zeena- will there be options to make the voices more unreal? The use case I imagine is for a character with a damaged vocoder or a broken speaker. Other glitchy affectations could be useful too.
It needs a human to annotate the text with the desired emotion.
Ideally, it would be able to infer the emotion from the text itself, but I think that level of sophistication is a long way off.
Edit: Actually, this might be a perfect candidate for some sort of crowdsourcing. Imagine Wikipedia pages containing hidden annotations for the proper text-to-speech "tone/cadence/whatever" of each sentence or paragraph.
This is a really cool idea— amazing to think about something like goodreads taken to the next level where people are sharing their emotion markups for texts. Imagine how you could try different mappings to see which ones you liked...
I hear them if I concentrate to isolate the voices out of the music. And you can pick up on quite noticeable flaws, especially in cadence and intonation. What somebody described as vocal fry sounds more like synthesis artifacts. The bare samples further on the page also highlight issues in cadence.
This is obviously an early demo, but this isn't yet to the level you could narrate an audio book - those little problems will quickly become noticeable.
I could see this being used for RPG games to fix the choice deficiency that has been caused by going for fully voiced dialogue. Also, making Hitler read copypastas even more convincingly.
As good as even not-top-shelf voice actor talent is a really high bar. I keep my eye on this space because there are a number of things I do where having even just decent "radio voice" TTS would be useful (and better than I can do myself). But nothing is really there today. In some respects, it's better than I can do myself but certainly not consistently.
The bar really isn't "has to be good out of the box", if it requires some tweaking on a line to line basis that would probably be ok and still much, much cheaper and much quicker to iterate on than voice actors for these high volumes of speech. In a lot of these games the existing voice acting is often consistently poor (literally everything Bethesda ever released comes to mind); certainly quite a few notches below the average AAA voice acting (which is occasionally bad, but on average good).
One of the other challenges with using outside voice talent is that it can be inconvenient/expensive when you need to add/change something. I've been involved with podcasts using an external host and one of the negatives with that process is that if you discover a minor mistake/glitch in the narration late in the process you can't easily fix it.
But I'm very curious what the emotional "parameters" are? There are literally at least a thousand different ways of saying "I love you" (serious-romantic, throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full of gratitude, irritated, self-questioning, dismissive, etc. ad finitum). Anyone who's worked as an actor and done script analysis knows there are 100's of variables that go into a line reading. Just three words, by themselves, can communicate roughly an entire paragraph's worth of meaning solely by the exact way they're said -- which is one of the things that makes acting, and directing actors, such a rewarding challenge.
Obviously it's far too complex to infer from text alone. So curious how the team has simplified it? What are emotional dimensions that you can specify? And how did they choose those dimensions over others? Are they geared towards the kind of "everyday" expression in a normal conversation between friends, or towards the more "dramatic" or "high comedy" of intense situations that much of film and TV lean towards?