Hey - developers behind ElevenLabs here. Thank you so much for the constructive and positive feedback - we’re taking it onboard!
We’re currently focused on researching and deploying a different way for speech synthesis that can generate nuanced intonation and emotions by understanding text and taking context into account. Additionally, we provide creators with a way to clone their own voice based on very short samples. With the published blog post, we are now deploying a way to help them design entirely new ones!
Anyone will be able to generate that level of quality just with a copy-paste. We are planning to open up Beta later this month. Our goal is to let you convert any written content into high-quality, compelling audio.
To address a few questions that frequently came up:
- Latency for our streaming TTS is <1s with quality results available above, which is the usual problem with existing good TTS models (like tortoise-tts)
- We can clone voices instantly, based just on 5s of speech, without training required
- We are working on adding SSML-like support for better control; speed controls will be coming as part of that too
- API is directly available as part of Beta; we are preparing the infrastructure to scale easily for the release!
We are hiring researchers, frontend and full-stack developers! If you are interested, send over your GitHub account and short message to founders[at]elevenlabs.io.
Hey Piotr - just wanted to say congratz for the awesome work so far man. The quality is genuinely unbelievable. I don't know if you guys are ready to take clients at scale, but I don't see any reason why all newsletter creators wouldn't use your tech right now to address whole new markets. I'll be following the journey, excited for what's to come.
Maybe I'm late to the party -- but this [1] graphic is great in the linked article.
Could the designer share a little about how it was made? Does it represent one of the generated voices, or is it just 'artistic'? (both are cool, I think).
The voices are really amazing, I couldn't really tell that they are synthetic and I was looking for it.
The only issue is that the actual recordings sound like they have been overcompressed, or poorly recorded - is there any way to improve this? Something like superresolution, but for voice?
We are offering both Speech Synthesis (/TTS) and Voice Lab (Rapid Voice Cloning and Voice Design) as a standard SaaS model (w/ fixed quota of characters you can voice per month). API is directly available on the platform. Outside of standard package that flips to usage-based model and we do tailored deals for custom needs and discounts for high-volume usage.
Currently testing Beta with a range of storytelling and publishing use-cases, tackle relevant feedback and make sure the infrastructure supports it. We are planning to open up Beta to everyone by end of this month.
Voice Design interface is currently set of sliders and toggles but currently iterating on what is most accessible.
They will be multi-lang, the tech scales to any language and we are working to add more (it is relatively easy). Here is the demo in Polish TTS:
https://www.youtube.com/watch?v=ra8xFG3keSs
What are the odds of this kind of thing being open source so I can use it at home. So far, most of the "good" text-to-speech systems are all commercial services
I tried using tortoise-tts on my M1. Generating a 7 minute speech took 3 days and, while better than the 15 yr old text-to-speech built into the OS it wasn't close to the quality of the services above. Maybe I don't know who to use it but of course it's not as simple as text-to-speech. You need the system to ideally understand the text it can act out parts
Of course see my username. I want to generate personal adult content so I'd prefer not to upload it to a service.
Any time I see AI model news on hn nowadays, my first question is whether I can run it locally, and if not, what are the alternatives that I can run locally.
The speed of progress on this front is increasing. These days even "cheap" rockchip MCUs are packing 5TOPs AI accelerators. And both AMD and Intel are working on much more powerful ones for their cpus. Heck, I recently wrote a mobile (android) app that runs pretty powerfull AI for intensive image processing locally on mobile phones thinking improved privacy would be more in demand than sending everything "to the cloud". I was mildly surprised to discover most people don't care (after writing the app). Still, I wouldn't be surprised if in 10 years the majority of AI people use rums on end user devices.
Yeah, most people don't care, but it might also be the case that many people who care use iOS, since that's the platform where all photo machine learning provided by the system happens on device.
That’s because you’re running tortoise on a CPU. It does about a sentence a minute on my 3090 gpu. It’s also quite good if you pick “high quality” and train it with 10 sec clips at the framerate and bitrate it asks for.
I can't tell if I'm starting to get that old person "new things are scary" instinct or if my gut level of fear about the implications of these things is warranted.
As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse. We're already drowning in ad dominated cynical soulless computer generated search results. Are all online forums going to end up being drowned out by cynical pumped out super cheap to produce simulacrums of creative content now too?
If I want people to buy more Triscuts next year what's stopping me from writing a bunch of prompts to insert subtle marketing cues to buy Triscuts with entire fake ecosystems of users, fan art, radio call ins, user stories, etc in like every niche community in existence and flooding them with soulless fake interaction?
That exists to a certain extent already, but I don't see how this stuff won't make it way easier, way more effective, and way more widespread.
My YouTube feed is currently filled with videos of whitehats hacking into Indian scam call centers.
Most of the time, the giveaway is the callers' Indian accent. If you could simply type into a box and speak with an American accent, it would be really hard to get caught.
We're opening a pandora's box here if I'm honest. I'm hardly one for pro-regulation, but good God, we're playing with things here that can really hurt us down the line.
Yes, however if that were a problem in the scenario above, I'm pretty sure LLMs could fix that as well.
They're already very good at translation today, it stands to reason that they could do the needful when it comes to turning regional English into American English. Or Bri'ish English, if that's the accent you want your TTS model to have.
There are a lot of words and phrases that indicate that you are speaking Indian English, separate from the accent. Using "learning" as a noun is a very common one in tech.
>As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.
My sentiments exactly. I think it's a bit of column A and a bit of column B. I'm reminded of the quote "everything has its pleasure and its price". The more expensive things are to produce, the the less of it there will be, but what is produced will be higher quality across the board. The less expensive it becomes to produce, the more of it will be, and the aggregate quality will be lower.
It's not always a bad thing, but the downsides are plain to see when you look at the amount of spam and low-effort content out there. That said, we've all massively enjoyed the upsides too, so it's a balancing act. I think where things were at before the recent wave generative AI tools was perhaps right on the sweet spot of "it's democratized enough that anyone can have a go, but still requires effort and a degree of talent to do well". The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.
These new tools potentially push that effort/reward ratio to the point where the signal/noise ratio simply gets too high. Of course the "make money online" community is all over this stuff and today I watched a video of a guy showing how you could supposedly clone courses on Udemy using ChatGPT and other tools etc. The problem is the "course" would literally consist of generic advice, high-level information on a particular topic that suffices only as a very surface level introduction and isn't enough to help you build any functional skills in that domain, so it's effectively useless. The only person it's not useless to is him and as he would pocket a cool $5-ish per sale. It was somewhat sad and somewhat sick to hear him cackling away about being able to con people out of money while passing himself off as an expert.
And yet, it's entirely what I would expect would happen.
>The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.
youtube lets you tell them channels you don't want recommended... I don't know how well it works, I usually just say I'm not interested in a single video
I suppose the optimistic view--such as it is--is that there is already a vast amount of low quality content out there that was created for pennies and plastered with ads and/or hoping someone will pay a modest amount. So I'm not sure that things like ChatGPT make things that much worse than they already are--and we can mostly live with things today. The pessimistic view of course is a whole new cohort of grifters decide to give it a run whether they ultimately make money or not.
I agree with this completely. Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.
The most dangerous aspect of this is that each step seems relatively harmless: right now, ChatGPT and DALL-E are amusements, but each small step is building a monstrous and as you say, soulless machine that overloads us so much that we will forget what it's like to even be human.
I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.
If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.
And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.
Nature is SHIT, that is why people created technology. There is nothing preventing you from going to the middle of nowhere
and reject modernity, no one is forcing you, but you are that because you wanted and liked it. You talk people should have an "instinctual revulsion" towards technology, but not even you yourself has this reaction towards technology because it is a stupid idea that not even luddites like you commit to it.
If anything the technology we have nowadays is not even 0,01% of what we should have. We should have the technology to make any movie anyone ever wanted to see in a blink of an eye, all done in the best quality ever imagined. We should have the power to build a Dyson Sphere around the sun to harness its energy. We should be able to construct fully immersive virtual reality, like San Junipero from the Black Mirror's episode, we should have the power extend human life indefinitely.
Why are you so hostile? What sense does it make, to attack him because he does not already have, what he is wishing for?
Nature is not "SHIT", for whatever that should mean. Neither the blanket statement "Technology is evil" nor "Nature is shit" make sense. We are humans. We need nature - it is what we evolved to and our technology is not able to replace it without loss. Specific technology is great to overcome existential limitations, but most technology is not.
Sure, there is great technology out there that improves our lives. On the other hand, there is so much technology that makes our lives worse (because of how it is used: e.g., by being of advantage to few people, while being bad for everyone else or by helping individuals now but having severe effects lateron), it can hardly be ignored, that a better process for selection or containment of technology would be necessary to improve everybodies life. But mankind is bad at forgoing.
Current technology seems to be great at generating convenience and excitement. And the examples you mention (movies, infinite energy, VR and eternal life) feel like the wishes for more excitement of a teenager (and this is not meant condescending), but life is so much more than excitement. Excitement is just the cherry on top. I'd rather see more tech that is wholesome - but that area seems to be left to nature.
> And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.
I don't demonize all technology. There must be an optimum somewhere, and I would like to engage in open discourse in order to understand where that optimum is. I believe advanced AI takes us away from the optimum.
Extending human life indefinitely is a terrible idea. We have a natural lifespan and we need to function within it. We should not proceed towards being saturated in technology as that will surely destroy the natural life on this planet.
There's no lack of revolutionary tech that has made life overall better with higher quality.
Even like, a bic light is so much better quality than a flint and steel or fire sticks.
Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand
Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism
Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely
Advancements in technology are mostly quite good, and improve both quality and convenience
Seems like for every advantage you list there's also a disadvantage.
> Even like, a bic light is so much better quality than a flint and steel or fire sticks.
And is part of the disposable society creating immense amounts of waste.
> Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand
Smartphones reduce the quality of social interaction. People often check them when they should be paying attention to their friend, and they make cancelling last-minute easier thereby making people more flaky.
> Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism
It's hard to argue with you there, though I suspect that all these "time-saving" inventions also make it more likely that we will spend more time on other things like more work and on electronic devices.
> Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely
And electricity has also made it easier to stay awake at night, staying up later and reducing the quality of sleep. Hundreds of people get worse sleep by being exposed to devices at night. I think it's actually nice to wind down activities when the sun goes down though obviously that is not as easy in the latitudes closer to the pole.
Basically, I think there are a lot of hidden dangers that people accept because in the short term they don't realize that technology makes life less fulfilling.
> If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.
Time to go live in a cabin in the woods and go write your manifesto on a typewriter...
I think what gets lost in these doom and gloom predictions is that there is a large healthy portion of young adults that do not engage in internet forums or social media.
It is perfectly viable in the modern day, to work a job, have passionate hobbies, regularly meet for social events, volunteer, etc and spend minimal to zero time engaging on the internet, besides pragmatic things like map directions
I would have died long ago without modern technology, and the many surgeries I have needed. It's hard to take your argument seriously when I consider the consequences of what you're advocating for.
Yeah but you have to balance the positives and negatives. Sure you being alive is all very well, but sometimes GP has to overhear teenagers talking about TikTok, and that is unacceptable.
>I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.
I've more or less come to a pretty similar conclusion. I wouldn't characterize it as evil per se, but it's a fools errand at best. My line of thinking goes somewhat like this - before the Neolithic revolution humans had an extremely small set of problems. The main problem being "what am I going to eat?", and to a large degree, life must have revolved around this problem almost entirely. There weren't that many people, there weren't that many problems, we somehow persisted in that state for hundreds of thousands of years with literally nothing to write home about. Any advance in technology has literally been trading one problem for at least three more. Now there are loads of problems, loads more people, and the standard approach to solving all the problems is to invent new technologies, which in practice seem to actually exacerbate the problems. So, I just sort of view the current state of things as "somewhere around the turn of the Neolithic Revolution we took a wrong turn, and it has widely been regarded as a bad move."
It's a weird sort of defeatist, nihilistic, melancholy worldview, but to be honest, I don't think we're wrong. I mean... what's the endgame of technology?
I would put the optimal state around native American level of technology. At least some sense of medicine and first aid, food is largely figured out, but no real oppressive technologies figured out yet.
> Technology has always made us trade quality for low-quality quantity in exchange for convenience.
Technology evolves. Even if it may start with some low quality aspects, it doesn't need to stay that way.
> People now interact more through technology which removes a lot of body language and other enriching experiences.
Which is just different communication, not better, nor worse in general. Of course this kinda sucks for people who do not know the new communication-code well enough. But people do evolve communication to replace relevant missing parts. Body language for example was mostly replaced with emojis and memes, which can be better, or worse.
> we will forget what it's like to even be human.
You can't forget what you are. You are you everyday, ever minute, every second of your existence. What you speak about is people having a different culture from the one you know and understand. That's something completely different.
> technology is ultimately evil
Technology is a tool, is can't be evil or good. It's up to the users how they handle it.
> Technology is a tool, is can't be evil or good. It's up to the users how they handle it.
I fundamentally disagree with this premise. I believe evil is roughly equivalent to the inevitability of bringing about evil, and I believe AI falls under such a classification.
> Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.
I went to the mall today and you can tell malls are dying. I lived in a small town where the mall died and it had a zombie like existence a long time before it finally cratered. The mall here in this larger town has that feeling. I also thought about how nice it is to go to the mall just to be out among people. The same is true of the downtown. If the endgame is for everyone to stay home and shop online that's going to be a very soulless existence.
Or don't shop at all and use that extra time to walk with friends in nature. Or when you really do need to shop, avoid the commute and use that extra time to spend with friends in nature. Being forced to be around strangers to get chores done doesn't put soul into my life.
I also avoid laundromats and do laundry at home and it doesn't feel soulless.
Sometimes circumstances mean that going to a mall is the only way some folk can get to meet their fellow human beings. And that doesn't mean it doesn't have other advantages such as conversing with people one might not normally come across.
I don't hate all technology. Rather, I advocate a specific approach to technology, which is a cautious one. Such an approach is antithetical to the classic tech company, and so I hate the approach.
I don't believe all technology is bad. Rather, I believe that all technology needs to be handled in a specific way so that it does not overwhelm us. Though, I do believe that some technology is fundamentally evil.
I would consider myself a hacker, but I do not believe in the capitalistic approach to technology advancement for the sake of short-term profit. I think technology can be used wisely and I do not believe we are doing so.
In fact, I started out as a mathematician and programmer and I still appreciate the beauty of those fields, but I think we need to treat STEM knowledge like we treat knives: useful but dangerous.
I think ,it's not technology that's real problem ,it's that loss of ethics in domain of knowledge ,since industrial and scientific revolution ,we put more emphasis on reductionism and objectification ,even human are being objectified ,this disease of over rationalism plaguing to every domains of knowledge ,I think it's always been constant battle of rationalism vs romantics .
In what language do you put a space before comma but not after? Even without that the usage is puzzling enough to make me genuinely curious what their native tongue is. It almost reads like a haiku.
If we stop pursuing technological progress, we'll never be able to reach humanity's true potential. If we keep pursuing technological progress, those futures are still possible. We need to be wiser and mature about the way we pursue it but we still need to pursue it.
But at the same time, there are tons of positive uses for things like this too. Imagine being a creator who wants to share their interests with the world, but hates their own voice or doesn't have the confidence to speak on camera. You could make a lot of people's lives better by creating content for YouTube, Twitch, TikTok, Instagram, etc, but you wouldn't be brave enough to otherwise.
Something like this could be incredible for those people. A natural sounding alternative to text to speech for people who dislike how they sound.
And it could also be used to anonymise people in documentaries about serious topics (like say, organised crime) without actors, letting people bring the atrocities of said folks to light without need to trust others or the risk of being found out.
Other examples could include vTubers, artists creating characters for TV shows, films and video games, etc.
All technology can be abused, and sadly with how humanity acts, like will by a small percentage of the population. But for every person abusing it for dubious purposes, there are dozens or hundreds or thousands of others who can make the world better with it.
I think a great tool for this would be a cross over voice changer AI, so you could still speak naturally but then sound like the model voice, that way it would be a little less soulless.
Honestly, that would be incredible for so many purposes! vTubers and amateur media creators would love to be able to just speak and have it translated into their voice of their characters in a more natural way!
Would also be an interesting one for theme parks, since it could led the costumed characters speak in the voices of the relevant characters rather than remaining silent, which would add a lot to the sense of immersion there too. (something like the website on the other hand could let the animatronics, CGI characters and others hold conversations with guests too, which would also be neat)
Yes, I’m sure there are many positive uses as well, I just have a hard time seeing how that’s not going to be outweighed by the bad given the current environment. There’s going to need to be some sort of social/cultural/technological adaptation when the negative starts hitting with force to curb it towards positive uses. People need to start thinking about mitigation strategies now.
I am with you on this one. What defines us as a people is the ability to enjoy shared social experiences. The more tailored and personalized an experience becomes the more it isolates us. We don't (at least I am my social circle) speak about TikTok's the way we speak about YouTube videos.
But more importantly, boredom triggers innovation. As we are consuming ourselves to death, we might lose the ability to truly create. Maybe that's why the last 20 years of content feel quite generic and sterile.
> As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.
Eh. I’ll take a MAYBE over the past 10 years or more of the human driven social media manipulations and scams and poison. We’ve made almost literally fucking nothing of value in a decade. It’s been ads, Ponzi schemes, and a race to the bottom of tolerance.
I’ll take the democratization of content. Knowing that it will allow the good and the bad.
… so how it different from the radio or TV or “influencers” now? I have limited time to consume media and am not going to be less picky when it gets easier for people to make garbage.
There was some innovation but 2010-2020 had some dead air as investors lavished ponzi scheme SaaS companies with cash and big firms poured the profits of early internet into VR, AR, AI, drones, self-driving, etc.
The last year and a half things are starting to pop off. OpenAI, SpaceX, Comma, Helion, many more…that doomer “everything sucks and is collapsing” mentality is on the way out in my opinion. The time for talk is over and it’s time to build, or so they say.
Assuming the internet will soon mostly be generated content, and assuming this content is dull and soulless as you describe it, I'm wondering if it's not going to make the real world and interactions in person more interesting?
I could do with cutting my screen time and the best way to do that might be to make everything boring.
I'm optimistic. I think the progress in AI will make people more aware where the soul really is, as they will learn to distinguish. I think the human spirit will be faster in learning to recognize that which is not really interesting than AI will be able to make improvements faking it.
The ideal use case is someone who wants to be an influencer, but is neither pretty, intelligent, nor has a good voice, could simply use face filters, GPT text, and a voice filter to make themselves sound and look beautiful.
I don't follow influencers but my guess is that they already do this, at least they use filters. If someone can use all these tools to gain considerable amount of fame and fortune, is (s)he really not intelligent? Of course, all these online personas will be lies, even bigger lies than today but I don't think it really matters. I'd argue that most people following these contents are not looking for reality.
I want to agree with you, but I have to admit I hate most human narrators of audiobooks. I would actually much prefer this company‘s voices to most of the humans reading books that I have encountered.
That's a pretty high bar. Even most Hollywood productions can't afford Meryl Streep, let alone a new site, podcast, or video game.
From wikipedia:
Mary Louise "Meryl" Streep [is] often described as "the best actress of her generation." Streep is particularly known for her versatility and accent adaptability. She has received numerous accolades throughout her career spanning over five decades, including a record 21 Academy Award nominations, winning three, and a record 32 Golden Globe Award nominations, winning eight. She has also received two British Academy Film Awards, two Screen Actors Guild Awards, and three Primetime Emmy Awards, in addition to nominations for a Tony Award and six Grammy Awards.
Maybe it's because I haven't heard the source material, but that Conversational voice really appeals to me. I wish my phone and assistants used that voice.
(and also I can't wait for a "real" ChatGPT-era AI to go with it, to put those braindead jokes of an "assistant" Siri, Alexa, and Google Assistant out to pasture)
When I listened to it, my first impression was that it must be the real actor they included for comparison purposes but that they failed to label it correctly. I thought it is not machine-generated. I couldn't tell the slightest artifact except what sounded like a low-bitrate sound encoding (maybe using a codec geared toward speech). Can you tell anything "off" about it?
As for the encoding artifact such as a tinny sound or low-bitrate sound, that is the type you hear on an MP3 or low bitrate codec for speech. For example, when I record a message on https://vocaroo.com/ the "premier" voice recording service it sounds 10x worse. Here is a sample I just recorded of my own speech: https://voca.ro/18oSJ1sHU5w5
After my first impression that the narrative example might be a real human mislabelled for comparison purposes, I listened to the next two, labelled News and Conversational. I found these very easy to tell as AI-generated.
Thinking back to why I found the narrative example so compelling, I thought perhaps the issue is that the first example is in British English which I'm less used to than American English. I grew up in the United States. Perhaps since the accent doesn't match my own, it is harder for me to perceive it as generated.
-> Can a native speaker of British English tell us whether listening to the first example you can tell in any way that it is a robot? Maybe it is as obvious to you as the next two are to me.
Still, I've listened to a fair amount of British English in my life so perhaps there is an alternative explanation for why the first one was better. For example, it could have been trained on a reader's voice who has narrated thousands of hours in very high studio quality in a fairly consistent way, leaving this type of text much easier to synthesize than the other two examples due to more training data or higher-quality audio.
For me, the first one is really indistinguishable from a narrator's true voice, though it does sound a bit tinny which could also happen as an artifact of the recording process.
In terms of "how confident are you that this is a real person" the second two examples I would put at 0 - it's totally obvious that it is not a real person, whereas the first one sounds like a 10 to me: obviously a real narrator. (With a bit of artifacting that sounds like an mp3.)
Hey! ElevenLabs here, confirming that all 3 samples (including the Narrative one) were AI-generated! We'll be opening up our platform later this month and would love for you test it yourself!
I'm a native British English speaker and can confirm the first example is incredibly good. It would be very difficult/impossible for most people to tell that the voice is generated from that clip alone.
Okay can I ask a question that has been bothering me for a long time?
Why do seemingly all these text-to-speech programs attempt to produce spoken voice based solely on raw text? Why don't they consume a MIDI-like text-markup language where you can write phonetic pronunciations along with markup about the emotion, volume, speed, etc.? I feel like this is a huge unnecessary roadblock holding back this kind of technology. It'd be like if every music composition program rendered a wave file not by MIDI or VST, but by trying to visually read sheet music. I totally understand why TTS solutions that have to consume arbitrary content, like screen-readers, need to read purely raw text. But content creators don't need to be limited to raw text! Why is everyone doing it that way? Where is the TTS markup language for content creators?
In practice they are next to useless, the expressions are not very...expressive (just try it in the AWS editor). I suspect a LLM would be able to infer the context or we can use prompt engineering to generate the appropriate tokens encoding emotions for the intermediate neural codecs directly (Mel spectrograms are so passé now post Vall-E).
Something I always noticed is that they get Morgan Freeman to do voiceovers for science shows, but he’s not a scientist so he has a sort of generic inflection when he talks about the various ideas in the script. And then you watch Carl Sagan’s COSMOS, where he co-wrote the material, and there is so much depth and expression to his delivery. There’s a lifetime of public speaking, specifically delivering complex scientific topics to a general audience, that Sagan drew from when recording his show.
Sagan would have learned this through conversation with people, and careful updates to his expression and delivery as he matured.
I guess an LLM could improve upon previous methods but I would also say there is a gap that even humans struggle with, which requires really complex knowledge both of public speaking and of the material. It may be a long time before we can really master that with AI systems.
ElevenLabs dev here - we believe this is a 2 step process and agree it is needed!
First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.
For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!
Your context-aware TTS is already sounding very good. If I were using it to produce a narration that other people would be listening to, I would want to make at most couple of minor adjustments every few sentences. Most of those adjustments would fall into a few categories: stronger or weaker stress on a particular word, rising or falling intonation on a phrase, longer or shorter pauses between words, and correction of the phonemes in a word. A half dozen toggles for those adjustments might be enough for most cases.
I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.
I feel like it would be much harder to create a set of hard controls, like MIDI, to affect the voice acting vs. trying to do a co-embedding space of voices and descriptions of the voices and just saying "Say this quietly and meanly". Thoughts?
> I feel like this is a huge unnecessary roadblock holding back this kind of technology.
There are speech synthesis markup languages, like SSML.
And targeting even lower level has always been possible with commercial speech engines.
Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.
The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.
The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.
Just my 2 cents, but it seems to me that too little focus in the tech world has been spent on understannding what speech is. Tonality, mood, facial expression and body language all is ignored or people pretend like there are no such thing. I believe this is broadly true in western society by now- people went digital but do not yet realize why communication went to hell in the last decade.
I used to work in automotive navigation. Other colleagues handled our voice systems, but I do remember all our prompts were written in SSML[1] with varrying amounts of specificity. We would use Lua to configure and customize the SSML, including some custom extensions for different voice renderers.
Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.
There's lots of text and audio already without this, that's probably the key factor practically. Similarly then for use cases, converting text that already exists is much more approachable than creating new marked up text.
Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.
But not nearly as manageable. Imagine saying the same thing about music, for example. Musical notation is clearly more work than just humming a tune, but there's still a need for it.
I remember in the 80s of last century there was a speech synthesis software I had on an 8 bit computer that accepted either normal text, or phonetic notation that had extra modifiers for basic things like "make this a question" etc.
Do you remember what that was? Dectalk was around in the 80s and so might’ve been that, but it wasn’t a generally available thing. Dr Sbaitso was common, but that wasn’t until 91/92.
Yes I do, it was a Commodore 64 cartridge called "Black Box 8". And it spoke Polish with the right accent with all the sounds not present in English etc.
I read back then that it was domestic Polish make, but back then there was no such thing as IP protection so it is very likely it was based on work of Denic Klatt(same as DECtalk). When I heard some DECtalk recordings in a youtube video not long ago it immediately reminded me nded me of Commodore 64 Black Box 8. Although DECtalk spoke in English and black box 8 spoke in Polish there is some similarity that can be heard in their voices(not pitch - this was a user setting, but more of a rhythm if it makes sense)
There are solutions that let you use curves like in an audio program to define inflection and pitch, speed of speaking, etc. Some of the competitors of this post's service do that.
I wonder if it would be possible to automate this by pairing the speech synthesis with a ML model that understands the context of the text it is parsing.
The examples are insanely good. Insanely good. I can barely believe we really live in a world where this is possible. I don't have anything constructive to add.. just wow.
I work in TTS and i just dont believe this. If these really are random text and not trained on literally the copy they are
reading, with no correction I would be surprised. Also, our competitors have good voices but they also take ages to produce. Maybe these really are legit but take like 1 minute to produce or something. So while this is impressive, i doubt that in practice this would be this high quality and could even approach real time
Thanks! ElevenLabs dev here - these are generated 6x faster than real-time, with latency of <1s. No corrections required.
We are working on long-form speech synthesis too, needless to say, the audio reading the article has been also synthesized by a voice that does not exist.
I want to agree, but I searched on their website and found their narration service with 2 full book examples. I listened to the first one for a while and it's the first time an Ai narrator was good enough to keep me listening: https://www.audiostory.ai/2065785/11707800-alice-s-adventure...
Yeah, as I mentioned I work in TTS and agree with you. If this is legit it is pretty amazing. Certainly would put them as one of the top providers especially given that they could ramp up voice selection. Also, if they truly are training on random stuff they would not have to pay royalties to voice actors since these voices don't exist. This is on par or better then most competitors i am aware of.
I'm listening to an audiobook whose reader is not as good as some of these voices. At one level, I'm impressed but at an another I'm sadden since we are heading towards uncharted territory. We are looking at a future where we'll have content, video,audio, and text by the truckload. More does not mean better. It just means more blah stuff. I don't think that's the future I'm looking forward to live in.
The key will be authenticity and trust. And in the world where the percentage of online content that contains this ends up in the vast minority of content, in person expertise and meetings will have to make a return out of sheer necessity.
It's starting to very much feel like we're entering the age of information manipulation outlined in the Ghost in the Shell TV series. Except it isn't a 90's/00's depiction of the future, it's just with far less robots and prosthetics and a lot more mundane.
I just keep coming back to the scene where they have satellite video footage of a nuclear submarine preparing for a nuclear attack and the discussion lamenting that it's just video, nobody will believe it as evidence.
I think you are overestimating the capabilities of ai to create novel content ,high genuine quality content will be always there ,but amount of bs content will increase
Imagine if in-game voice chat automatically converts player speech into the voice of the character they're playing -- this would resolve a lot of the gender-based harassment problems arising from competitive games requiring vocal communication, since now _everyone's_ default is hiding the actual player's voice, contrasting the "just use a voice changer if you're a girl playing" suggestion which themselves draws attention by being out of the ordinary.
I feel like if Bethesda really wants another industry defining game, this is the path they should be taking. AI generated conversation with AI generated voice acting with voice-to-text recognition. You can literally have microphone-voice conversations with NPCs that have rich, AI generated backgrounds and personalities.
Even bigger than that (I think at least) is the potential for fully voiced mods. There’s nothing stopping modders at that point from adding content indistinguishable from the base game.
I'd love to see that. Voice acting for mods where you want to include new NPCs requires either someone to donate some voice lines, or paying for it to be recorded. If you want to patch existing NPCs, that's even harder because getting the original voice actor to do the new lines would require both persuading them to do it, and complying with any agreements they might have with the publisher that could prevent that.
I doubt Bethesda would facilitate this. They'd likely use voice actors to train the voice, and having a famous voice actor saying saucy kink/bdsm/violent things that you tend to see in some mods wouldn't be great PR
How would the union take to that, though? This is not meant as an anti-union comment. I'd just be really surprised if Bethesda ever got to work with union VAs ever again if they went all in on an all-AI voiced game.
A galaxy scale exploration game on the scale of Elite Dangerous where you could have more complex and varied interaction would be pretty amazing. The way you could apply these new AI models to video games has some wild potential. I think video games are one of the areas where I see the most potential for positive impact rather than negative impact.
Most of it is high definition audio these days, and then that just gets replaced by a 10gb training set, or maybe the training set becomes a shared resource on the console
Generating quality voice is sufficiently compute-intensive that it would increase the file size, as they would still ship all the audio (instead of computing locally) but there would just be so much more of it.
I'm working on a VR space game that actually uses Ssml azure cloud generated voices for dialog, but I've ditched the rogue-like procedural elements which are wickedly hard to implement
This would be incredible, especially with the thousands of unique characters games often have nowadays. Imagine every NPC having a unique voice, and the ability to dynamically respond to the players?
Even customisable like your character's appearance.
This was one of my criticisms of Fallout 4, the voice actors weren't bad, it just didn't fit very well some player characters.
Imagine if in-game voice chat automatically converted a % of guys voices into girls voices, so they would start getting harassed, realize how awful that is, and then over time stop doing it.
I have an AI service from my mobile company that talks to scammers. Idea is to hold scammer on call as much as possible. Then you can listen or read transcribe of those calls.
I'd like to see this technology become cheap and ubiquitous enough that everyone can choose for themselves what voice they would like to hear right at the moment of consumption. It's always a huge bummer when there's a book I want to listen to on audible with terrible narration. Somebody must have liked that voice for the person to be hired, but people's tastes differ and sometimes the people they've selected just really grate on my ears.
It would also be cool if celebrities / existing voice talent could somehow license the synthesis of their voice. I read something about James Earl Jones doing this with Disney for future Star Wars projects. I'm sure there are people out there who would love to have every work they listen to be in the voice of their favorite narrator/celebrity.
This is cooler than ChatGPT and image generation as far as I'm concerned. If they're able to bring out the emotional connectivity and purposefulness of the human voice, it will be revolutionary...
Awesome, I think a few years we'll hit levels of AI generative media tech where you can produce, as a lone greybeard, a Cyberpunk 2077 tier title. Same # of bugs too ;)
Still sounds pretty fake to me. There’s a hurriedness to the speech and a monotonic uniformity in enunciation that is uncannily machine. Good to know that voice actors will have jobs for a while longer…
> Good to know that voice actors will have jobs for a while longer…
They don’t have to work anymore, just sell their voice and sit at home collecting royalty payments is the future according the TFA.
And they’ve been making progress on the roboticness with every new model that comes out. Just a matter of time (and data) for the AIs to figure out how words string together naturally.
This assumes that legislation/ajudication won't tell AI companies that grabbing any content they can find and not reimburse the original author is "fair use" or something equivalent in other jurisdictions. Here's to hoping.
The random voice generator is pretty bad but sometimes you actually get a reasonably good voice except you can hear clicking sounds that interrupt the voice.
I'm both scared and peeking through my fingers at the thought of the evolution of vocal-tuning plugins like Melodyne. Currently you can basically draw the pitch of a vocal performance, however using AI you could re-render the wavefile and adjust more parameters than simply pitch - such as timbre, inflection, vibrato, dynamics, distortion, openness, softness, breathiness, or a bunch of other vocal attributes.
Voice synthesizer plugins, such as Vokaloid or Synthesizer V, can already do that quite convincingly, so it is only a matter of time before it can be applied to existing voice recordings.
I have only ever listened to one audio book and that was "Hitchhiker's guide to the galaxy" by Stephen Fry. This is nowhere close to that.
It does mimic the ups and downs of voice but they don't add up. The don't make sense. They don't really have any connections with what is being spoken.
But since it can do expressions, it probably only needs special markers in text to tell it how to really read a sentence.
Stephen Fry is considered one of the best audiobook readers of all time. This AI voice is still better than 100% of AI audiobooks in the market, and likely better than a good portion of HUMAN readers as well.
Thanks (ElevenLabs dev here), we are constantly working on improving our model, we do out own research and train it completely from scratch.
We do support Polish already and the quality is actually better IMO than English as we use a newer generation model: https://www.youtube.com/watch?v=ra8xFG3keSs
Some people think it is fake and we hired a real voice actor to read.
I’ve been reading up on this the last couple of days because…oh, look, squirrel!
This seems to me where The Big Guys are going to dominate because it comes down to a big data problem. For example, whisper (admittedly speech to text) was trained on 480,000 hours of speech data scraped from the web. The next ‘contender’ used something like 48,000 hours. Who can compete with that who doesn’t own a whole cloud?
As someone working on singing synthesis, I know how hard it is to get that last 10% quality that makes a human listener instantly recognise if the voice is real or generated.
These are really impressive results! For anyone interested, my team’s singing work: https://youtu.be/LPy20zSWhZA)
If you are going to have such an intensive particle effect in your videos at least bother to upload a 4k version so there is a tiny chance that not every single frame consists of nothing but artifacts.
Also don't put gumi and English in the same search query on YouTube. I don't know how they did it but the voices from six years ago sound better than SOTA TTS based on deep learning today...
Clearly the point of the video is its AUDIO content, not the visuals. The lack of a "4k version" does not make any difference other than saving you bandwith :-)
Sounds damn good. Would it be possible to use your own voice for training, and replicate it?
Obviously that could come with some serious security risks, but it would also make content presentation much easier for many people. Gone are the days of doing voiceover recordings for videos.
Hey! ElevenLabs dev here - yes exactly! We do rapid voice cloning (just on few seconds samples) that for American accents works really well - which is already available in Beta. We can also do a professional near-identical copy with longer samples too.
This is awesome for any kind of situation where you need a (human) speaker. No tripping over words, mumbling, mispronouncing --all fluid and audible with perfect enunciation!
Nice timing as I'm looking for a way to replace espeak. Are there any pretraines text-to-speech models available? Or, some dataset that could be use to train a model?
I found that it's much easier for me to read and remember when reading with voice assistant for which I need real-time. Ages ago I bought Ivona Text-to-speech and was serving me very well for many years. The last few years I used AWS Polly and espeak (using this https://github.com/laszukdawid/cracker) but thought that there must be something better.
There seems to be a fairly wide selection between state of the art and just glueing together a bunch of phonemes it’s just that tortoise-tts is up there with the state of the art.
I haven’t looked into the mid range stuff but there’s probably something out there with pretty good quality if you don’t mind doing some coding, end user applications seem to be mostly in the startup SaaS charge by the character domain.
Thanks! That was the first search and has nicely written colab, so will definitely give it a try. However, I've seen in readme that generating a sentence takes quite a long time.
> On a K80, expect to generate a medium sized sentence every 2 minutes.
This, and tools like it, could revolutionize video game voice acting. Have any video game engines integrated tools like this so developers can use them?
I think a great use case for this technology could be to preserve dying languages. I'm sure a lot of work has already gone into preserving the written form of these languages, but training models on data sets of native speakers could be a way to preserve pronunciation.
Waiting? Talk to some small business owners, they’re already being bombarded. One common tactic is to ask the A.I. to do maths and see it breakdown and say it’s going to ask it’s “supervisor”
Say you're an indie game developer. In 2022 you'd pay someone on Fiverr to do a 'trailer' voiceover on your game trailer. This year, you'd use this - and also get a few more languages in there. Next year is gonna be an interesting year.
Is that even a thing? You can't copyright a voice. There can be a personality right under state law, but the main case on that was someone hired to sound like Bette Midler for a commercial.
> At Eleven, we're fully committed both to respecting intellectual property rights and to implementing safeguards against potential misuse of our technology
Unlike Stable Diffusion trampling over the copyright of artists without their permission and OpenAI doing the same for code mangled with incompatible licenses and monetizing it and outputting the trained data verbatim whilst opening a pandora's box and then attempting to write detectors and watermarks afterwards. I'm skeptical on Eleven Lab's statement on adding their detectors before release, but we'll see.
Should there eventually be an open source version of a competing model by someone, it should be trained on public domain sources. This was the case with Dance Diffusion as Stability AI would have been sued to the ground by the RIAA had they done that. [0] [1]
It will only be a matter of time before the legal system catches up with AI generated content and scrutiny over the trained data on copyrighted content without permission and how it was trained. Any output generated by an AI is automatically public domain and un-copyrightable. [2]
This AI hype is another VC scam to unload their investments in AI startups to big tech once again and then pretend how AI is making the world better but when they know it is actually doing the opposite with far reaching consequences. Of course it can't be stopped, but it also cannot go unchecked and unregulated forever.
Just like the clamping down of cryptocurrency markets and enforcement of regulations, a similar set of rules and regulations will be set for AI companies for complying with existing copyright laws.
The VCs know it is a scam and they are also smart enough to know that this won't go unchecked forever and they will have to unload their investment at the peak of the hype cycle.
Good to see that authors/maintainers of AI models are beginning to think about attribution. But it seems like this will be a hard problem to solve. For example, say my voice was part of the training data set, to what degree can I lay claim to the newly created voices? Also, will there be some sort of grading/ranking (e.g. it could be argued that some of the voices used in the training set are more desirable than others, and therefore their "owners" deserve better fees etc.)?
The text to speech function at the top of the article is the actual product but they are not going the extra mile and record it again for the other speed multiplier like x.7 or x2.0. You can clearly hear the mp3 struggling, especially at 0.7 speed.
It would have been interesting how they perform in comparison. The fact you are able to adjust the voices is even one of their selling points. I really wonder why they haven't done that
About a month ago, I made a toy bot that listens to your voice with OpenAI Whisper, generates a response with GPT-2 and vocalizes the response using the Eleven Labs. The TTS quality produced by the Eleven Labs algorithm was mind-blowing to me. The API that they provided was super easy to use. Good product!
I always wondered why those generative voices dont capture the feeling of the text per segment and incorporate it to the output e.g. news, narration, first person hunted by vampires, whatever. Seems like a kind of low hanging fruit.
Disclaimer: I use tons of audiobooks so that might not be what people need in general
By the way if anyone is in this thread due to working on AI speech synthesis for any company, I am interested in AI as well as audio production and I would love to talk about joining the team as an AI researcher. Just send me some mail, my email is in my profile.
This is the major issue with the majority of this technology at the moment. Theres a plethora of options available and soon to be unveiled by several startups who are talking up their tech... but they are almost all for "editing"/"after the recording" work. You have to have a complete recorded track you can pass into their software (usually by uploading to their service) and then it will crunch away at the file and work their magic.
The current real time options I've found are... lacking, they are mostly fake/toys (not actually using voice cloning, just old school pitch shifting) or tech demo videos, with a scattering of research papers which are highly variable in terms of "how easily can i reproduce this", ranging from "sure if I want to waste money on a google colab instance, to "only works with specific model of video card due to reasons"
If you know of any real-time (audio stream in -> audio stream out) voice cloning/transform/replacement tools, feel free to post about them in a reply, this is an area of tech I'm trying to keep on top of and I'm only human so I have no idea what new company or research I might miss.
Hey - ElevenLabs dev here. The quality above works with <1s latency that for some real-time apps is already sufficient. On smaller chunks of text it can be as quick as ~500ms.
They need to take this and similar AI and come up with better dubbing for movies in other languages. Netflix should really lead the way here with the amount of dubbed content that they currently possess.
If dubbing is where you are going... does that mean you're also going to pair it with deepfaking the videos to make the facial movements match the new vocalizations? Because that'd be a wild product.
I've been using Azure to generate speech audio for my game and it's extremely good. These samples seem even better. I'm wondering how less cherry picked clips will turn out
More advanced scams potentiated by technology advancements are an arms race hard to keep ahead of. Despite all the possible positives, this seems almost inherently dystopian.
Absorbing information through your audio input while the visual input is busy with a mindless task is amazing. You can listen to an article or an audiobook while doing laundry.
Interestingly, some of the robot styles take a very obvious and dramatic fake breath. I say "fake" since a robot doesn't need to breathe and it's not exactly considered a phoneme. The fake breaths don't really make the robot sound more convincing.
When you listen to the first example labelled "Narrative" you can tell where a human speaker would have inhaled (which is something the AI could have picked up on from copious training data) though the inhale itself could be muted in post-editing, e.g. after the long 24-word first phrase[1] ending in "special magnificence", and then again at the end of the sentence. It could just be the way the AI reads the comma but it is very convincing.
The "News" and "Conversational" examples don't include that pause effect. In the cerulean monologue, there is no pause after "for instance" despite it being in the monologue.
However, the robot takes a deep dramatic breath after the word "I see"[2]. " Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet and you select I don't know that lumpy blue sweater for instance because you're trying to tell the world that you take yourself". There is no pause on the comma around "for instance" though the script has one. I decided to check whether the robot is just copying the original film exactly and that's not it either.[3]
Comparison:
Robot: "Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet [no breath] and you select I don't know that lumpy blue sweater for instance [QUICK HALF BREATH BY ROBOT] because you're trying to tell the world [no breath] that you take yourself too seriously to care about what you put on your back but [no breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."
Original: "Oh, okay. I see [no breath] you think this has nothing to do with you. [loud long breath] You go to your closet [breath] and you select I don't know that lumpy blue sweater for instance [no breath] because you're trying to tell the world that you [breath] take yourself too seriously to care about what you put on your back but [breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."
Text:
"Oh, okay. I see, you think this has nothing to do with you.
You… go to your closet, and you select… I don’t know, that lumpy blue sweater for instance, because you’re trying to tell the world that you take yourself too seriously to care about what you put on your back, but what you don’t know is that that sweater is not just blue, it’s not turquoise, it’s not lapis, it’s actually cerulean.
"
I've annotated the breaths in the "conversational" robot sample vs the original film:
Robot Original Same/different?
I see... [Loud breath] [no breath] Different
with you... [Loud quick breath] [loud long breath] Similar
your closet... [no breath] [breath] Different
for instance... [QUICK half breath] [no breath] Different
that you... [no breath] [breath] Different
back but... [no breath] [breath] Different
The robot's loud dramatic breath is unmistakable, but it's clear it's not copying the source exactly, since it occurs at different places.
(ElevenLabs dev here) The generative voices and the way they sound is very much a function all the training data, sampling and interpolation as you also pointed out. As a lot of these do involve deep breaths, that why synthesized voice will also have it present albeit at sometimes different times than human. Interpunction is the biggest influence on where those pauses will happen.
From the users so far they found it actually enjoyable to listen to and that the breathing and pauses are accurate!
I agree - the pauses in the first sample called "Narration" are incredibly accurate and pleasant to listen to.
As a developer, can you tell the difference between "Narration" and the human speaker? What can we listen for or what gives it away? For me I listened to the "Narration" clip many times and as a native British English speaker also confirms in another comment, it seems very difficult/impossible to tell the first clip is generated. Congratulations on such an achievement!
I noticed a breath in the demo audio in the linked article and while it stood out, I was impressed by it rather than thinking it felt forced. I'm sure if I listened to enough AI voice it would stand out more and feel forced.
Did you find the whole clip it was in convincing? For me, I didn't even notice the breath but the entire second and third clip felt obviously AI-generated. But the first clip sounded absolutely real (maybe with some compression artifacts - see my other comment.)
Later when I went back and listened carefully for why the first clip felt so "real" I noticed it had pauses. (No breaths per se but they are sometimes removed from edited audio.) However, I then noticed that the conversational clip, which felt unnatural to me, had very obvious breaths. The entire effect of the conversational clip didn't sound like a human at all. It sounded like an AI.
Did you find the whole conversational clip "convincing"? (Did it sound like a human to you?) How about the narration clip?
> Not only can they be more cost-effective without compromising on quality...
That feels dishonest. Even if this AI is just as good at speaking as a professional voice actor (which I'm not sold on), a voice actor does more than just read the line. In ideal circumstances, they have a lot of context for what their character is doing and feeling.
Is this potentially a good option for saving money on video game voices? Quite possibly yes. Is there no compromise on quality? No, not yet.
Past that, the whole "Ethical AI" section's arguments seem ridiculous. Of COURSE it puts the livelihoods of voice actors at risk. Your product's whole point is that fewer man hours are needed for voice work. Just accept that you're making those jobs obsolete. There's a perfectly good argument that it's okay to do that. Throwing bullshit at us to convince us that "no, the voice actors will still have lots of work, and they won't even have to talk!" just makes you sound like snake oil salesmen.
> Even if this AI is just as good at speaking as a professional voice actor (which I'm not sold on), a voice actor does more than just read the line. In ideal circumstances, they have a lot of context for what their character is doing and feeling.
How long do you think that advantage will last for. Years? Months? Weeks?
This would put companies like Audm out of business, but it seems like they already only employ one voice actor for most gigs (ya gotta respect how much she gets done though!). I wish there was more work for professional voice actors, audiobooks done by the likes of Roy Dotrice are an absolutely fantastic ride
We’re currently focused on researching and deploying a different way for speech synthesis that can generate nuanced intonation and emotions by understanding text and taking context into account. Additionally, we provide creators with a way to clone their own voice based on very short samples. With the published blog post, we are now deploying a way to help them design entirely new ones!
Anyone will be able to generate that level of quality just with a copy-paste. We are planning to open up Beta later this month. Our goal is to let you convert any written content into high-quality, compelling audio.
To address a few questions that frequently came up:
- Latency for our streaming TTS is <1s with quality results available above, which is the usual problem with existing good TTS models (like tortoise-tts)
- We can clone voices instantly, based just on 5s of speech, without training required
- We are working on adding SSML-like support for better control; speed controls will be coming as part of that too
- API is directly available as part of Beta; we are preparing the infrastructure to scale easily for the release!
We are hiring researchers, frontend and full-stack developers! If you are interested, send over your GitHub account and short message to founders[at]elevenlabs.io.