Hacker News new | past | comments | ask | show | jobs | submit login
This Voice Doesn't Exist – Generative Voice AI (elevenlabs.io)
448 points by goleary on Jan 12, 2023 | hide | past | favorite | 260 comments



Hey - developers behind ElevenLabs here. Thank you so much for the constructive and positive feedback - we’re taking it onboard!

We’re currently focused on researching and deploying a different way for speech synthesis that can generate nuanced intonation and emotions by understanding text and taking context into account. Additionally, we provide creators with a way to clone their own voice based on very short samples. With the published blog post, we are now deploying a way to help them design entirely new ones!

Anyone will be able to generate that level of quality just with a copy-paste. We are planning to open up Beta later this month. Our goal is to let you convert any written content into high-quality, compelling audio.

To address a few questions that frequently came up:

- Latency for our streaming TTS is <1s with quality results available above, which is the usual problem with existing good TTS models (like tortoise-tts)

- We can clone voices instantly, based just on 5s of speech, without training required

- We are working on adding SSML-like support for better control; speed controls will be coming as part of that too

- API is directly available as part of Beta; we are preparing the infrastructure to scale easily for the release!

We are hiring researchers, frontend and full-stack developers! If you are interested, send over your GitHub account and short message to founders[at]elevenlabs.io.


Hey Piotr - just wanted to say congratz for the awesome work so far man. The quality is genuinely unbelievable. I don't know if you guys are ready to take clients at scale, but I don't see any reason why all newsletter creators wouldn't use your tech right now to address whole new markets. I'll be following the journey, excited for what's to come.


Maybe I'm late to the party -- but this [1] graphic is great in the linked article.

Could the designer share a little about how it was made? Does it represent one of the generated voices, or is it just 'artistic'? (both are cool, I think).

[1] https://blog.elevenlabs.io/content/images/2023/01/Sequence-0...


The voices are really amazing, I couldn't really tell that they are synthetic and I was looking for it.

The only issue is that the actual recordings sound like they have been overcompressed, or poorly recorded - is there any way to improve this? Something like superresolution, but for voice?


What is your business model? How are you deciding who gets Beta access? What does the voice generation interface look like?


We are offering both Speech Synthesis (/TTS) and Voice Lab (Rapid Voice Cloning and Voice Design) as a standard SaaS model (w/ fixed quota of characters you can voice per month). API is directly available on the platform. Outside of standard package that flips to usage-based model and we do tailored deals for custom needs and discounts for high-volume usage.

Currently testing Beta with a range of storytelling and publishing use-cases, tackle relevant feedback and make sure the infrastructure supports it. We are planning to open up Beta to everyone by end of this month.

Voice Design interface is currently set of sliders and toggles but currently iterating on what is most accessible.


Hi! Are your models english only, or do you plan on tackling other languages?


They will be multi-lang, the tech scales to any language and we are working to add more (it is relatively easy). Here is the demo in Polish TTS: https://www.youtube.com/watch?v=ra8xFG3keSs


What are the odds of this kind of thing being open source so I can use it at home. So far, most of the "good" text-to-speech systems are all commercial services

https://aws.amazon.com/polly/

https://cloud.google.com/text-to-speech

https://azure.microsoft.com/en-us/products/cognitive-service...

And now one is also a service.

I tried using tortoise-tts on my M1. Generating a 7 minute speech took 3 days and, while better than the 15 yr old text-to-speech built into the OS it wasn't close to the quality of the services above. Maybe I don't know who to use it but of course it's not as simple as text-to-speech. You need the system to ideally understand the text it can act out parts

Of course see my username. I want to generate personal adult content so I'd prefer not to upload it to a service.


Any time I see AI model news on hn nowadays, my first question is whether I can run it locally, and if not, what are the alternatives that I can run locally.


> what are the alternatives that I can run locally

...you will be disappointed by the answers to that question for the foreseeable future.


I'm the opposite of disappointed. The amount of public pretrained models that have been popping up recently is crazy.


Same model with random tweaks applied.

Just because there is a new toy doesn’t mean capitalism gave up.



There is much more than stable-diffusion out there :)

Of course capitalism doesn't give up, I wouldn't even want it to.


The speed of progress on this front is increasing. These days even "cheap" rockchip MCUs are packing 5TOPs AI accelerators. And both AMD and Intel are working on much more powerful ones for their cpus. Heck, I recently wrote a mobile (android) app that runs pretty powerfull AI for intensive image processing locally on mobile phones thinking improved privacy would be more in demand than sending everything "to the cloud". I was mildly surprised to discover most people don't care (after writing the app). Still, I wouldn't be surprised if in 10 years the majority of AI people use rums on end user devices.


Yeah, most people don't care, but it might also be the case that many people who care use iOS, since that's the platform where all photo machine learning provided by the system happens on device.


That’s because you’re running tortoise on a CPU. It does about a sentence a minute on my 3090 gpu. It’s also quite good if you pick “high quality” and train it with 10 sec clips at the framerate and bitrate it asks for.



Effectively superseded by https://github.com/coqui-ai/TTS


What kind of personal adult content do you generate? We are curious for details


I can't tell if I'm starting to get that old person "new things are scary" instinct or if my gut level of fear about the implications of these things is warranted.

As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse. We're already drowning in ad dominated cynical soulless computer generated search results. Are all online forums going to end up being drowned out by cynical pumped out super cheap to produce simulacrums of creative content now too?

If I want people to buy more Triscuts next year what's stopping me from writing a bunch of prompts to insert subtle marketing cues to buy Triscuts with entire fake ecosystems of users, fan art, radio call ins, user stories, etc in like every niche community in existence and flooding them with soulless fake interaction?

That exists to a certain extent already, but I don't see how this stuff won't make it way easier, way more effective, and way more widespread.


My YouTube feed is currently filled with videos of whitehats hacking into Indian scam call centers.

Most of the time, the giveaway is the callers' Indian accent. If you could simply type into a box and speak with an American accent, it would be really hard to get caught.

We're opening a pandora's box here if I'm honest. I'm hardly one for pro-regulation, but good God, we're playing with things here that can really hurt us down the line.


> If you could simply type into a box and speak with an American accent, it would be really hard to get caught.

Not really. If they say "kindly do something", they are Indian scammers.


Yes, however if that were a problem in the scenario above, I'm pretty sure LLMs could fix that as well.

They're already very good at translation today, it stands to reason that they could do the needful when it comes to turning regional English into American English. Or Bri'ish English, if that's the accent you want your TTS model to have.


"Hey chatGPT, write a short script convincing someone that I'm from a small town in America"


And here I thought the giveaway was just them trying to blatantly scam you.


There are a lot of words and phrases that indicate that you are speaking Indian English, separate from the accent. Using "learning" as a noun is a very common one in tech.


If you can use AI to create a fake voice, you can also use AI to create a prompt for the voice


I'm sure that could be corrected by even a very basic language model


>As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.

My sentiments exactly. I think it's a bit of column A and a bit of column B. I'm reminded of the quote "everything has its pleasure and its price". The more expensive things are to produce, the the less of it there will be, but what is produced will be higher quality across the board. The less expensive it becomes to produce, the more of it will be, and the aggregate quality will be lower.

It's not always a bad thing, but the downsides are plain to see when you look at the amount of spam and low-effort content out there. That said, we've all massively enjoyed the upsides too, so it's a balancing act. I think where things were at before the recent wave generative AI tools was perhaps right on the sweet spot of "it's democratized enough that anyone can have a go, but still requires effort and a degree of talent to do well". The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.

These new tools potentially push that effort/reward ratio to the point where the signal/noise ratio simply gets too high. Of course the "make money online" community is all over this stuff and today I watched a video of a guy showing how you could supposedly clone courses on Udemy using ChatGPT and other tools etc. The problem is the "course" would literally consist of generic advice, high-level information on a particular topic that suffices only as a very surface level introduction and isn't enough to help you build any functional skills in that domain, so it's effectively useless. The only person it's not useless to is him and as he would pocket a cool $5-ish per sale. It was somewhat sad and somewhat sick to hear him cackling away about being able to con people out of money while passing himself off as an expert.

And yet, it's entirely what I would expect would happen.


>The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.

Adblockers are amazing


Are there "influencer" blockers?


youtube lets you tell them channels you don't want recommended... I don't know how well it works, I usually just say I'm not interested in a single video


I suppose the optimistic view--such as it is--is that there is already a vast amount of low quality content out there that was created for pennies and plastered with ads and/or hoping someone will pay a modest amount. So I'm not sure that things like ChatGPT make things that much worse than they already are--and we can mostly live with things today. The pessimistic view of course is a whole new cohort of grifters decide to give it a run whether they ultimately make money or not.


I agree with this completely. Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.

The most dangerous aspect of this is that each step seems relatively harmless: right now, ChatGPT and DALL-E are amusements, but each small step is building a monstrous and as you say, soulless machine that overloads us so much that we will forget what it's like to even be human.

I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.

If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.


And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.

Nature is SHIT, that is why people created technology. There is nothing preventing you from going to the middle of nowhere and reject modernity, no one is forcing you, but you are that because you wanted and liked it. You talk people should have an "instinctual revulsion" towards technology, but not even you yourself has this reaction towards technology because it is a stupid idea that not even luddites like you commit to it.

If anything the technology we have nowadays is not even 0,01% of what we should have. We should have the technology to make any movie anyone ever wanted to see in a blink of an eye, all done in the best quality ever imagined. We should have the power to build a Dyson Sphere around the sun to harness its energy. We should be able to construct fully immersive virtual reality, like San Junipero from the Black Mirror's episode, we should have the power extend human life indefinitely.


Why are you so hostile? What sense does it make, to attack him because he does not already have, what he is wishing for?

Nature is not "SHIT", for whatever that should mean. Neither the blanket statement "Technology is evil" nor "Nature is shit" make sense. We are humans. We need nature - it is what we evolved to and our technology is not able to replace it without loss. Specific technology is great to overcome existential limitations, but most technology is not.

Sure, there is great technology out there that improves our lives. On the other hand, there is so much technology that makes our lives worse (because of how it is used: e.g., by being of advantage to few people, while being bad for everyone else or by helping individuals now but having severe effects lateron), it can hardly be ignored, that a better process for selection or containment of technology would be necessary to improve everybodies life. But mankind is bad at forgoing.

Current technology seems to be great at generating convenience and excitement. And the examples you mention (movies, infinite energy, VR and eternal life) feel like the wishes for more excitement of a teenager (and this is not meant condescending), but life is so much more than excitement. Excitement is just the cherry on top. I'd rather see more tech that is wholesome - but that area seems to be left to nature.


> And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.

I don't demonize all technology. There must be an optimum somewhere, and I would like to engage in open discourse in order to understand where that optimum is. I believe advanced AI takes us away from the optimum.

Extending human life indefinitely is a terrible idea. We have a natural lifespan and we need to function within it. We should not proceed towards being saturated in technology as that will surely destroy the natural life on this planet.



There's no lack of revolutionary tech that has made life overall better with higher quality.

Even like, a bic light is so much better quality than a flint and steel or fire sticks.

Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand

Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism

Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely

Advancements in technology are mostly quite good, and improve both quality and convenience


Seems like for every advantage you list there's also a disadvantage.

> Even like, a bic light is so much better quality than a flint and steel or fire sticks.

And is part of the disposable society creating immense amounts of waste.

> Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand

Smartphones reduce the quality of social interaction. People often check them when they should be paying attention to their friend, and they make cancelling last-minute easier thereby making people more flaky.

> Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism

It's hard to argue with you there, though I suspect that all these "time-saving" inventions also make it more likely that we will spend more time on other things like more work and on electronic devices.

> Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely

And electricity has also made it easier to stay awake at night, staying up later and reducing the quality of sleep. Hundreds of people get worse sleep by being exposed to devices at night. I think it's actually nice to wind down activities when the sun goes down though obviously that is not as easy in the latitudes closer to the pole.

Basically, I think there are a lot of hidden dangers that people accept because in the short term they don't realize that technology makes life less fulfilling.


> Dishwashers and laundry machines

> enabling feminism

Thats all it took?

So if theres no electricity - its back to square one?


> If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.

Time to go live in a cabin in the woods and go write your manifesto on a typewriter...


Typewriter? You heathen! It must be chiseled into the cave wall with a bone.


Writing?! Surely not...


Ah yes... the oral tradition. How could I forget?


All this focus on writing things down has weakened your memory!


I think what gets lost in these doom and gloom predictions is that there is a large healthy portion of young adults that do not engage in internet forums or social media.

It is perfectly viable in the modern day, to work a job, have passionate hobbies, regularly meet for social events, volunteer, etc and spend minimal to zero time engaging on the internet, besides pragmatic things like map directions


I would have died long ago without modern technology, and the many surgeries I have needed. It's hard to take your argument seriously when I consider the consequences of what you're advocating for.


Yeah but you have to balance the positives and negatives. Sure you being alive is all very well, but sometimes GP has to overhear teenagers talking about TikTok, and that is unacceptable.


>I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.

I've more or less come to a pretty similar conclusion. I wouldn't characterize it as evil per se, but it's a fools errand at best. My line of thinking goes somewhat like this - before the Neolithic revolution humans had an extremely small set of problems. The main problem being "what am I going to eat?", and to a large degree, life must have revolved around this problem almost entirely. There weren't that many people, there weren't that many problems, we somehow persisted in that state for hundreds of thousands of years with literally nothing to write home about. Any advance in technology has literally been trading one problem for at least three more. Now there are loads of problems, loads more people, and the standard approach to solving all the problems is to invent new technologies, which in practice seem to actually exacerbate the problems. So, I just sort of view the current state of things as "somewhere around the turn of the Neolithic Revolution we took a wrong turn, and it has widely been regarded as a bad move."

It's a weird sort of defeatist, nihilistic, melancholy worldview, but to be honest, I don't think we're wrong. I mean... what's the endgame of technology?


I would put the optimal state around native American level of technology. At least some sense of medicine and first aid, food is largely figured out, but no real oppressive technologies figured out yet.


>but no real oppressive technologies figured out yet

And there-in lies the arms race, for if you figure out an oppressive technology, you can oppress people with it who don't have the means to resist.

Hmm... the evil argument starts to make more sense.


> Technology has always made us trade quality for low-quality quantity in exchange for convenience.

Technology evolves. Even if it may start with some low quality aspects, it doesn't need to stay that way.

> People now interact more through technology which removes a lot of body language and other enriching experiences.

Which is just different communication, not better, nor worse in general. Of course this kinda sucks for people who do not know the new communication-code well enough. But people do evolve communication to replace relevant missing parts. Body language for example was mostly replaced with emojis and memes, which can be better, or worse.

> we will forget what it's like to even be human.

You can't forget what you are. You are you everyday, ever minute, every second of your existence. What you speak about is people having a different culture from the one you know and understand. That's something completely different.

> technology is ultimately evil

Technology is a tool, is can't be evil or good. It's up to the users how they handle it.


> Technology is a tool, is can't be evil or good. It's up to the users how they handle it.

I fundamentally disagree with this premise. I believe evil is roughly equivalent to the inevitability of bringing about evil, and I believe AI falls under such a classification.


> Which is just different communication, not better, nor worse in general.

That is where we disagree fundamentally. I do posit that the communication is actually absolutely and unequivocally worse.


> Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.

I went to the mall today and you can tell malls are dying. I lived in a small town where the mall died and it had a zombie like existence a long time before it finally cratered. The mall here in this larger town has that feeling. I also thought about how nice it is to go to the mall just to be out among people. The same is true of the downtown. If the endgame is for everyone to stay home and shop online that's going to be a very soulless existence.


Or don't shop at all and use that extra time to walk with friends in nature. Or when you really do need to shop, avoid the commute and use that extra time to spend with friends in nature. Being forced to be around strangers to get chores done doesn't put soul into my life.

I also avoid laundromats and do laundry at home and it doesn't feel soulless.


I never went to malls to socialize.


Sometimes circumstances mean that going to a mall is the only way some folk can get to meet their fellow human beings. And that doesn't mean it doesn't have other advantages such as conversing with people one might not normally come across.


Why are you on a forum called Hacker News if you hate technology so much?


It's surprisingly on brand for this place.


I don't hate all technology. Rather, I advocate a specific approach to technology, which is a cautious one. Such an approach is antithetical to the classic tech company, and so I hate the approach.

I don't believe all technology is bad. Rather, I believe that all technology needs to be handled in a specific way so that it does not overwhelm us. Though, I do believe that some technology is fundamentally evil.

I would consider myself a hacker, but I do not believe in the capitalistic approach to technology advancement for the sake of short-term profit. I think technology can be used wisely and I do not believe we are doing so.

In fact, I started out as a mathematician and programmer and I still appreciate the beauty of those fields, but I think we need to treat STEM knowledge like we treat knives: useful but dangerous.


I think ,it's not technology that's real problem ,it's that loss of ethics in domain of knowledge ,since industrial and scientific revolution ,we put more emphasis on reductionism and objectification ,even human are being objectified ,this disease of over rationalism plaguing to every domains of knowledge ,I think it's always been constant battle of rationalism vs romantics .


May I suggest you research the proper way to format commas?

I know, proper form is a skill that has been lost with the advent of social media.

I posit that the mastery of orthography is a precondition to be taken seriously even today.


Sorry ,English isn't my native language ,thanks for the feedback anyway ,I will sure consider researching.


i think english just isnt their native language, this seems pretty snarky.


In what language do you put a space before comma but not after? Even without that the usage is puzzling enough to make me genuinely curious what their native tongue is. It almost reads like a haiku.


> technology is ultimately evil

If we stop pursuing technological progress, we'll never be able to reach humanity's true potential. If we keep pursuing technological progress, those futures are still possible. We need to be wiser and mature about the way we pursue it but we still need to pursue it.


But at the same time, there are tons of positive uses for things like this too. Imagine being a creator who wants to share their interests with the world, but hates their own voice or doesn't have the confidence to speak on camera. You could make a lot of people's lives better by creating content for YouTube, Twitch, TikTok, Instagram, etc, but you wouldn't be brave enough to otherwise.

Something like this could be incredible for those people. A natural sounding alternative to text to speech for people who dislike how they sound.

And it could also be used to anonymise people in documentaries about serious topics (like say, organised crime) without actors, letting people bring the atrocities of said folks to light without need to trust others or the risk of being found out.

Other examples could include vTubers, artists creating characters for TV shows, films and video games, etc.

All technology can be abused, and sadly with how humanity acts, like will by a small percentage of the population. But for every person abusing it for dubious purposes, there are dozens or hundreds or thousands of others who can make the world better with it.


I think a great tool for this would be a cross over voice changer AI, so you could still speak naturally but then sound like the model voice, that way it would be a little less soulless.


Honestly, that would be incredible for so many purposes! vTubers and amateur media creators would love to be able to just speak and have it translated into their voice of their characters in a more natural way!

Would also be an interesting one for theme parks, since it could led the costumed characters speak in the voices of the relevant characters rather than remaining silent, which would add a lot to the sense of immersion there too. (something like the website on the other hand could let the animatronics, CGI characters and others hold conversations with guests too, which would also be neat)


Yes, I’m sure there are many positive uses as well, I just have a hard time seeing how that’s not going to be outweighed by the bad given the current environment. There’s going to need to be some sort of social/cultural/technological adaptation when the negative starts hitting with force to curb it towards positive uses. People need to start thinking about mitigation strategies now.


I am with you on this one. What defines us as a people is the ability to enjoy shared social experiences. The more tailored and personalized an experience becomes the more it isolates us. We don't (at least I am my social circle) speak about TikTok's the way we speak about YouTube videos.

But more importantly, boredom triggers innovation. As we are consuming ourselves to death, we might lose the ability to truly create. Maybe that's why the last 20 years of content feel quite generic and sterile.


I think we already have a lot of soulless human generated search results.

I think there will be need for a greater level of filtering and curation yes, but I see it as an opportunity both for creators and curators.

The barriers to entry for media creation will go down, but with saturation also the already low margins of profit will get worse.


Also, AI will do the filtering, not just blocking uninteresting content, but actually removing known and uninteresting information from content.


I'm not confident that ai will ever catch up with scams for new scammer tactics.


we'll go back to the old way of consuming media - recommended by friends, vetted by known curators.


> As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.

Eh. I’ll take a MAYBE over the past 10 years or more of the human driven social media manipulations and scams and poison. We’ve made almost literally fucking nothing of value in a decade. It’s been ads, Ponzi schemes, and a race to the bottom of tolerance.

I’ll take the democratization of content. Knowing that it will allow the good and the bad.

… so how it different from the radio or TV or “influencers” now? I have limited time to consume media and am not going to be less picky when it gets easier for people to make garbage.


There was some innovation but 2010-2020 had some dead air as investors lavished ponzi scheme SaaS companies with cash and big firms poured the profits of early internet into VR, AR, AI, drones, self-driving, etc.

The last year and a half things are starting to pop off. OpenAI, SpaceX, Comma, Helion, many more…that doomer “everything sucks and is collapsing” mentality is on the way out in my opinion. The time for talk is over and it’s time to build, or so they say.


I’m hopeful that, as with most apocalypse-capable technologies, humans will adapt and overcome.

Humans will be different on the other side of this coming wave of simulation indistinguishable from reality, but we’ll be okay.

It’ll suck living through the transition though. Not looking forward to the crap tsunami on the horizon.

But as a species, we’ll adapt and survive as always.


Assuming the internet will soon mostly be generated content, and assuming this content is dull and soulless as you describe it, I'm wondering if it's not going to make the real world and interactions in person more interesting?

I could do with cutting my screen time and the best way to do that might be to make everything boring.


> to end up making an incredible amount of sterile soulless content that makes everyone's lives worse

Worse? We already have 8 billion people a significant part of whom are pumping sterile soulless content.

If anything it will heat up the competition in the creativity field and allow truly creative things to proliferate.


I'm optimistic. I think the progress in AI will make people more aware where the soul really is, as they will learn to distinguish. I think the human spirit will be faster in learning to recognize that which is not really interesting than AI will be able to make improvements faking it.


The ideal use case is someone who wants to be an influencer, but is neither pretty, intelligent, nor has a good voice, could simply use face filters, GPT text, and a voice filter to make themselves sound and look beautiful.


I don't follow influencers but my guess is that they already do this, at least they use filters. If someone can use all these tools to gain considerable amount of fame and fortune, is (s)he really not intelligent? Of course, all these online personas will be lies, even bigger lies than today but I don't think it really matters. I'd argue that most people following these contents are not looking for reality.


I want to agree with you, but I have to admit I hate most human narrators of audiobooks. I would actually much prefer this company‘s voices to most of the humans reading books that I have encountered.


The “narrative” example is pretty good, but the “conversational” example is rather unpleasant to listen to.

(Especially if you know how well Meryl Streep delivers that monologue in the original: https://youtu.be/Ja2fgquYTCg)


That's a pretty high bar. Even most Hollywood productions can't afford Meryl Streep, let alone a new site, podcast, or video game.

From wikipedia:

Mary Louise "Meryl" Streep [is] often described as "the best actress of her generation." Streep is particularly known for her versatility and accent adaptability. She has received numerous accolades throughout her career spanning over five decades, including a record 21 Academy Award nominations, winning three, and a record 32 Golden Globe Award nominations, winning eight. She has also received two British Academy Film Awards, two Screen Actors Guild Awards, and three Primetime Emmy Awards, in addition to nominations for a Tony Award and six Grammy Awards.


Maybe it's because I haven't heard the source material, but that Conversational voice really appeals to me. I wish my phone and assistants used that voice.

(and also I can't wait for a "real" ChatGPT-era AI to go with it, to put those braindead jokes of an "assistant" Siri, Alexa, and Google Assistant out to pasture)


Let's talk about this "Narrative" example.

When I listened to it, my first impression was that it must be the real actor they included for comparison purposes but that they failed to label it correctly. I thought it is not machine-generated. I couldn't tell the slightest artifact except what sounded like a low-bitrate sound encoding (maybe using a codec geared toward speech). Can you tell anything "off" about it?

As for the encoding artifact such as a tinny sound or low-bitrate sound, that is the type you hear on an MP3 or low bitrate codec for speech. For example, when I record a message on https://vocaroo.com/ the "premier" voice recording service it sounds 10x worse. Here is a sample I just recorded of my own speech: https://voca.ro/18oSJ1sHU5w5

After my first impression that the narrative example might be a real human mislabelled for comparison purposes, I listened to the next two, labelled News and Conversational. I found these very easy to tell as AI-generated.

Thinking back to why I found the narrative example so compelling, I thought perhaps the issue is that the first example is in British English which I'm less used to than American English. I grew up in the United States. Perhaps since the accent doesn't match my own, it is harder for me to perceive it as generated.

-> Can a native speaker of British English tell us whether listening to the first example you can tell in any way that it is a robot? Maybe it is as obvious to you as the next two are to me.

Still, I've listened to a fair amount of British English in my life so perhaps there is an alternative explanation for why the first one was better. For example, it could have been trained on a reader's voice who has narrated thousands of hours in very high studio quality in a fairly consistent way, leaving this type of text much easier to synthesize than the other two examples due to more training data or higher-quality audio.

For me, the first one is really indistinguishable from a narrator's true voice, though it does sound a bit tinny which could also happen as an artifact of the recording process.

In terms of "how confident are you that this is a real person" the second two examples I would put at 0 - it's totally obvious that it is not a real person, whereas the first one sounds like a 10 to me: obviously a real narrator. (With a bit of artifacting that sounds like an mp3.)

[1] The text is here https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...


Hey! ElevenLabs here, confirming that all 3 samples (including the Narrative one) were AI-generated! We'll be opening up our platform later this month and would love for you test it yourself!


congratulations!! I can't tell the first isn't human no matter how much I try. That is an amazing achievement.


I'm a native British English speaker and can confirm the first example is incredibly good. It would be very difficult/impossible for most people to tell that the voice is generated from that clip alone.


Agreed, the intro and narrative ones are great. The news one is terrible.


Okay can I ask a question that has been bothering me for a long time?

Why do seemingly all these text-to-speech programs attempt to produce spoken voice based solely on raw text? Why don't they consume a MIDI-like text-markup language where you can write phonetic pronunciations along with markup about the emotion, volume, speed, etc.? I feel like this is a huge unnecessary roadblock holding back this kind of technology. It'd be like if every music composition program rendered a wave file not by MIDI or VST, but by trying to visually read sheet music. I totally understand why TTS solutions that have to consume arbitrary content, like screen-readers, need to read purely raw text. But content creators don't need to be limited to raw text! Why is everyone doing it that way? Where is the TTS markup language for content creators?


I don’t know how many of the solutions offer this, but there is a markup language for TTS:

https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Langua...

Amazon Polly, (which seems kind of ancient with all these new solutions showing up) has supported SSML for some time.

AWS Polly SSML docs: https://docs.aws.amazon.com/polly/latest/dg/ssml.html


In practice they are next to useless, the expressions are not very...expressive (just try it in the AWS editor). I suspect a LLM would be able to infer the context or we can use prompt engineering to generate the appropriate tokens encoding emotions for the intermediate neural codecs directly (Mel spectrograms are so passé now post Vall-E).


Something I always noticed is that they get Morgan Freeman to do voiceovers for science shows, but he’s not a scientist so he has a sort of generic inflection when he talks about the various ideas in the script. And then you watch Carl Sagan’s COSMOS, where he co-wrote the material, and there is so much depth and expression to his delivery. There’s a lifetime of public speaking, specifically delivering complex scientific topics to a general audience, that Sagan drew from when recording his show.

Sagan would have learned this through conversation with people, and careful updates to his expression and delivery as he matured.

I guess an LLM could improve upon previous methods but I would also say there is a gap that even humans struggle with, which requires really complex knowledge both of public speaking and of the material. It may be a long time before we can really master that with AI systems.


maybe the only way to express speech precisely is the speech itself ?


ElevenLabs dev here - we believe this is a 2 step process and agree it is needed!

First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.

For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!


Your context-aware TTS is already sounding very good. If I were using it to produce a narration that other people would be listening to, I would want to make at most couple of minor adjustments every few sentences. Most of those adjustments would fall into a few categories: stronger or weaker stress on a particular word, rising or falling intonation on a phrase, longer or shorter pauses between words, and correction of the phonemes in a word. A half dozen toggles for those adjustments might be enough for most cases.

I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.


I feel like it would be much harder to create a set of hard controls, like MIDI, to affect the voice acting vs. trying to do a co-embedding space of voices and descriptions of the voices and just saying "Say this quietly and meanly". Thoughts?


Exactly! Only issue is having a well-labelled dataset with those type of cues. We have an idea on how to do it though!


> I feel like this is a huge unnecessary roadblock holding back this kind of technology.

There are speech synthesis markup languages, like SSML. And targeting even lower level has always been possible with commercial speech engines.

Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.


I think there are two "sweet spots" here.

The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.

The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.


Just my 2 cents, but it seems to me that too little focus in the tech world has been spent on understannding what speech is. Tonality, mood, facial expression and body language all is ignored or people pretend like there are no such thing. I believe this is broadly true in western society by now- people went digital but do not yet realize why communication went to hell in the last decade.


I used to work in automotive navigation. Other colleagues handled our voice systems, but I do remember all our prompts were written in SSML[1] with varrying amounts of specificity. We would use Lua to configure and customize the SSML, including some custom extensions for different voice renderers.

Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.

[1] https://en.m.wikipedia.org/wiki/Speech_Synthesis_Markup_Lang...


There's lots of text and audio already without this, that's probably the key factor practically. Similarly then for use cases, converting text that already exists is much more approachable than creating new marked up text.

Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.


There's a pretty advanced Mac OS speech markup language, I wrote about it here: https://www.mattmontag.com/personal/mac-os-x-speech-synthesi...

Going back further, there was also a prosody markup for Sound Blaster speech synthesis (Dr. Sbaitso, anyone?).


I think markup would always be more work and less effective than using your own voice input to guide its tone.


But not nearly as manageable. Imagine saying the same thing about music, for example. Musical notation is clearly more work than just humming a tune, but there's still a need for it.


I remember in the 80s of last century there was a speech synthesis software I had on an 8 bit computer that accepted either normal text, or phonetic notation that had extra modifiers for basic things like "make this a question" etc.


Do you remember what that was? Dectalk was around in the 80s and so might’ve been that, but it wasn’t a generally available thing. Dr Sbaitso was common, but that wasn’t until 91/92.


Yes I do, it was a Commodore 64 cartridge called "Black Box 8". And it spoke Polish with the right accent with all the sounds not present in English etc.

I read back then that it was domestic Polish make, but back then there was no such thing as IP protection so it is very likely it was based on work of Denic Klatt(same as DECtalk). When I heard some DECtalk recordings in a youtube video not long ago it immediately reminded me nded me of Commodore 64 Black Box 8. Although DECtalk spoke in English and black box 8 spoke in Polish there is some similarity that can be heard in their voices(not pitch - this was a user setting, but more of a rhythm if it makes sense)


There are solutions that let you use curves like in an audio program to define inflection and pitch, speed of speaking, etc. Some of the competitors of this post's service do that.


I wonder if it would be possible to automate this by pairing the speech synthesis with a ML model that understands the context of the text it is parsing.


As a note, there are indeed markup formats to write the phonetic pronunciations, and also allowing everything you mentioned.

It's called SSML.


As the sibling comment notes, there is in fact markup for this, and the results are actually pretty great.


The examples are insanely good. Insanely good. I can barely believe we really live in a world where this is possible. I don't have anything constructive to add.. just wow.


I work in TTS and i just dont believe this. If these really are random text and not trained on literally the copy they are reading, with no correction I would be surprised. Also, our competitors have good voices but they also take ages to produce. Maybe these really are legit but take like 1 minute to produce or something. So while this is impressive, i doubt that in practice this would be this high quality and could even approach real time


Thanks! ElevenLabs dev here - these are generated 6x faster than real-time, with latency of <1s. No corrections required.

We are working on long-form speech synthesis too, needless to say, the audio reading the article has been also synthesized by a voice that does not exist.


Ok I think it's fair to say you're either full of shit or the world leading experts in TTS.


I want to agree, but I searched on their website and found their narration service with 2 full book examples. I listened to the first one for a while and it's the first time an Ai narrator was good enough to keep me listening: https://www.audiostory.ai/2065785/11707800-alice-s-adventure...


It's noticable worse than the examples in the blog post. I mean, it's good enough for listening, but no better than the competition.


It's vastly better than any TTS system I have used, but then I've only used a few (mainly phone assistants and the thing built into kindle).

What is the competition that you are referring to?


Yeah, as I mentioned I work in TTS and agree with you. If this is legit it is pretty amazing. Certainly would put them as one of the top providers especially given that they could ramp up voice selection. Also, if they truly are training on random stuff they would not have to pay royalties to voice actors since these voices don't exist. This is on par or better then most competitors i am aware of.


I'm listening to an audiobook whose reader is not as good as some of these voices. At one level, I'm impressed but at an another I'm sadden since we are heading towards uncharted territory. We are looking at a future where we'll have content, video,audio, and text by the truckload. More does not mean better. It just means more blah stuff. I don't think that's the future I'm looking forward to live in.


The key will be authenticity and trust. And in the world where the percentage of online content that contains this ends up in the vast minority of content, in person expertise and meetings will have to make a return out of sheer necessity.

It's starting to very much feel like we're entering the age of information manipulation outlined in the Ghost in the Shell TV series. Except it isn't a 90's/00's depiction of the future, it's just with far less robots and prosthetics and a lot more mundane.

I just keep coming back to the scene where they have satellite video footage of a nuclear submarine preparing for a nuclear attack and the discussion lamenting that it's just video, nobody will believe it as evidence.


I think you are overestimating the capabilities of ai to create novel content ,high genuine quality content will be always there ,but amount of bs content will increase


Imagine if in-game voice chat automatically converts player speech into the voice of the character they're playing -- this would resolve a lot of the gender-based harassment problems arising from competitive games requiring vocal communication, since now _everyone's_ default is hiding the actual player's voice, contrasting the "just use a voice changer if you're a girl playing" suggestion which themselves draws attention by being out of the ordinary.


I’m looking forward to NPCs having dynamic responses with real voices

Doesn't have to be prerecorded, just trained


Games could have more than three dialogue options again!


I feel like if Bethesda really wants another industry defining game, this is the path they should be taking. AI generated conversation with AI generated voice acting with voice-to-text recognition. You can literally have microphone-voice conversations with NPCs that have rich, AI generated backgrounds and personalities.


Even bigger than that (I think at least) is the potential for fully voiced mods. There’s nothing stopping modders at that point from adding content indistinguishable from the base game.


I'd love to see that. Voice acting for mods where you want to include new NPCs requires either someone to donate some voice lines, or paying for it to be recorded. If you want to patch existing NPCs, that's even harder because getting the original voice actor to do the new lines would require both persuading them to do it, and complying with any agreements they might have with the publisher that could prevent that.


I doubt Bethesda would facilitate this. They'd likely use voice actors to train the voice, and having a famous voice actor saying saucy kink/bdsm/violent things that you tend to see in some mods wouldn't be great PR


How would the union take to that, though? This is not meant as an anti-union comment. I'd just be really surprised if Bethesda ever got to work with union VAs ever again if they went all in on an all-AI voiced game.


If they went all in on generated voices, would they want to work with voice actors?


I could see Bethesda still wanting to select voice actors to train the model with, especially if they're household names. So they'd get paid.

You do raise an interesting point, though. Compensation for AI-derived transformations of your work (in this case, your voice) needs to be a thing.


A galaxy scale exploration game on the scale of Elite Dangerous where you could have more complex and varied interaction would be pretty amazing. The way you could apply these new AI models to video games has some wild potential. I think video games are one of the areas where I see the most potential for positive impact rather than negative impact.


The file sizes would drop by 50 gigabytes again

Most of it is high definition audio these days, and then that just gets replaced by a 10gb training set, or maybe the training set becomes a shared resource on the console


Generating quality voice is sufficiently compute-intensive that it would increase the file size, as they would still ship all the audio (instead of computing locally) but there would just be so much more of it.


I'm working on a VR space game that actually uses Ssml azure cloud generated voices for dialog, but I've ditched the rogue-like procedural elements which are wickedly hard to implement


This would be incredible, especially with the thousands of unique characters games often have nowadays. Imagine every NPC having a unique voice, and the ability to dynamically respond to the players?

Damn that would do a ton for immersion!


Even customisable like your character's appearance. This was one of my criticisms of Fallout 4, the voice actors weren't bad, it just didn't fit very well some player characters.


Could really bring life to some older games with lots of text. Daggerfall springs to mind.


Imagine if in-game voice chat automatically converted a % of guys voices into girls voices, so they would start getting harassed, realize how awful that is, and then over time stop doing it.


Imagine if all POC wear white people body suits, that would solve racism !


That's just kind of a lame workaround that doesn't tackle the actual issue though.


Less than a week ago, I said AI would upend the market for voice actors within the next couple of years: https://news.ycombinator.com/item?id=34271948


Not only voice actors, include radio hosts, documentary/news content, any voice over for anything, as well as imitation of familiar voices.


This will really open pandoras box for scammers and other bad actors. Grandma won't know she's speaking with an AI.


Grandma already falls for scams. Will I know I’m speaking with an AI?


I really can't tell half the time on easy-to-spam forums like Reddit if I'm chatting with an AI or a human.


Any interaction that you didn’t kick off is a scam. Whether is is an AI or human is irrelevant.


That would mean any interaction I initiate is a scam for the other party


I think this easiest question for a turing test to AI: "What would you choose as a turing test for an AI?"


I have an AI service from my mobile company that talks to scammers. Idea is to hold scammer on call as much as possible. Then you can listen or read transcribe of those calls.


Blade Runner: 2024

spotting open (closed) AI models by doing the Voight-KAPTCHA test



The "budget advantage" doesn't matter in the top half of the industry; directing a human voice talent is not going away anytime soon.

Budget clients are suspicious of AI voices and feel "cheated" if they think someone they hired are using one. This will change fastest.


I'd like to see this technology become cheap and ubiquitous enough that everyone can choose for themselves what voice they would like to hear right at the moment of consumption. It's always a huge bummer when there's a book I want to listen to on audible with terrible narration. Somebody must have liked that voice for the person to be hired, but people's tastes differ and sometimes the people they've selected just really grate on my ears.

It would also be cool if celebrities / existing voice talent could somehow license the synthesis of their voice. I read something about James Earl Jones doing this with Disney for future Star Wars projects. I'm sure there are people out there who would love to have every work they listen to be in the voice of their favorite narrator/celebrity.


This is cooler than ChatGPT and image generation as far as I'm concerned. If they're able to bring out the emotional connectivity and purposefulness of the human voice, it will be revolutionary...


The laughing examples are pretty impressive.

"The first AI that can laugh" - https://blog.elevenlabs.io/the_first_ai_that_can_laugh/


There are so many uses cases for this, even with the current quality. Many game developers dream of having something like this.


Awesome, I think a few years we'll hit levels of AI generative media tech where you can produce, as a lone greybeard, a Cyberpunk 2077 tier title. Same # of bugs too ;)


Still sounds pretty fake to me. There’s a hurriedness to the speech and a monotonic uniformity in enunciation that is uncannily machine. Good to know that voice actors will have jobs for a while longer…


I thought the Narrative one was 100% there. I'd still give the News one 99% and Conversational 98%.


Yes, for the sake of humanity, I hope the examples are cherry picked and The lord of the rings audiobook is in the train set...


> Good to know that voice actors will have jobs for a while longer…

They don’t have to work anymore, just sell their voice and sit at home collecting royalty payments is the future according the TFA.

And they’ve been making progress on the roboticness with every new model that comes out. Just a matter of time (and data) for the AIs to figure out how words string together naturally.


This assumes that legislation/ajudication won't tell AI companies that grabbing any content they can find and not reimburse the original author is "fair use" or something equivalent in other jurisdictions. Here's to hoping.


The random voice generator is pretty bad but sometimes you actually get a reasonably good voice except you can hear clicking sounds that interrupt the voice.


Yeah right. You would never pass a blind test on this.


I think the narritive one would pass a blind test.

The conversational one wouldn't, although it could pass for a bad (human) voice actor.


I'm both scared and peeking through my fingers at the thought of the evolution of vocal-tuning plugins like Melodyne. Currently you can basically draw the pitch of a vocal performance, however using AI you could re-render the wavefile and adjust more parameters than simply pitch - such as timbre, inflection, vibrato, dynamics, distortion, openness, softness, breathiness, or a bunch of other vocal attributes.


Voice synthesizer plugins, such as Vokaloid or Synthesizer V, can already do that quite convincingly, so it is only a matter of time before it can be applied to existing voice recordings.


I have only ever listened to one audio book and that was "Hitchhiker's guide to the galaxy" by Stephen Fry. This is nowhere close to that.

It does mimic the ups and downs of voice but they don't add up. The don't make sense. They don't really have any connections with what is being spoken.

But since it can do expressions, it probably only needs special markers in text to tell it how to really read a sentence.


Stephen Fry is considered one of the best audiobook readers of all time. This AI voice is still better than 100% of AI audiobooks in the market, and likely better than a good portion of HUMAN readers as well.


I found the samples incredibly good. But the samples in their other post about conveying emotions[0] are still far from acceptable.

In any case, I'm hoping this can be expanded to other languages as it would be an amazing tool for language learning.

[0] https://blog.elevenlabs.io/the_first_ai_that_can_laugh/


Thanks (ElevenLabs dev here), we are constantly working on improving our model, we do out own research and train it completely from scratch.

We do support Polish already and the quality is actually better IMO than English as we use a newer generation model: https://www.youtube.com/watch?v=ra8xFG3keSs Some people think it is fake and we hired a real voice actor to read.


I’ve been reading up on this the last couple of days because…oh, look, squirrel!

This seems to me where The Big Guys are going to dominate because it comes down to a big data problem. For example, whisper (admittedly speech to text) was trained on 480,000 hours of speech data scraped from the web. The next ‘contender’ used something like 48,000 hours. Who can compete with that who doesn’t own a whole cloud?


As someone working on singing synthesis, I know how hard it is to get that last 10% quality that makes a human listener instantly recognise if the voice is real or generated.

These are really impressive results! For anyone interested, my team’s singing work: https://youtu.be/LPy20zSWhZA)


If you are going to have such an intensive particle effect in your videos at least bother to upload a 4k version so there is a tiny chance that not every single frame consists of nothing but artifacts.

Also don't put gumi and English in the same search query on YouTube. I don't know how they did it but the voices from six years ago sound better than SOTA TTS based on deep learning today...


Clearly the point of the video is its AUDIO content, not the visuals. The lack of a "4k version" does not make any difference other than saving you bandwith :-)


Very well done! Any suggestions on where/how one might learn to do something similar? I love the idea of being able to swap singers on a given track


Sounds damn good. Would it be possible to use your own voice for training, and replicate it?

Obviously that could come with some serious security risks, but it would also make content presentation much easier for many people. Gone are the days of doing voiceover recordings for videos.


Hey! ElevenLabs dev here - yes exactly! We do rapid voice cloning (just on few seconds samples) that for American accents works really well - which is already available in Beta. We can also do a professional near-identical copy with longer samples too.



Their Steve Jobs voice simulation is creepily good:

https://www.youtube.com/shorts/34vB41lyQ-A


This probably breaks HN etiquette but...

wow


We should be allowed to break etiquette for the rare and the shocking!


This is awesome for any kind of situation where you need a (human) speaker. No tripping over words, mumbling, mispronouncing --all fluid and audible with perfect enunciation!


Nice timing as I'm looking for a way to replace espeak. Are there any pretraines text-to-speech models available? Or, some dataset that could be use to train a model?



That one requires a big GPU and isn’t real-time.

If you want to clone a voice and have a shitton of compute to fine tune it’s a good one.

If you just want your computer to tell you you need to be out the door in 30 seconds or you’ll miss the bus then not so much.


I found that it's much easier for me to read and remember when reading with voice assistant for which I need real-time. Ages ago I bought Ivona Text-to-speech and was serving me very well for many years. The last few years I used AWS Polly and espeak (using this https://github.com/laszukdawid/cracker) but thought that there must be something better.


There seems to be a fairly wide selection between state of the art and just glueing together a bunch of phonemes it’s just that tortoise-tts is up there with the state of the art.

I haven’t looked into the mid range stuff but there’s probably something out there with pretty good quality if you don’t mind doing some coding, end user applications seem to be mostly in the startup SaaS charge by the character domain.


Thanks! That was the first search and has nicely written colab, so will definitely give it a try. However, I've seen in readme that generating a sentence takes quite a long time.

> On a K80, expect to generate a medium sized sentence every 2 minutes.

Are you aware of other available?


This, and tools like it, could revolutionize video game voice acting. Have any video game engines integrated tools like this so developers can use them?


My voice is my passport, verify me.... aww fuck I have to do a voice activated "I am human" check now?!?


Robot test: "Hello there" and what's your reply?


I am dancer


I think a great use case for this technology could be to preserve dying languages. I'm sure a lot of work has already gone into preserving the written form of these languages, but training models on data sets of native speakers could be a way to preserve pronunciation.


I'm "waiting" for the time when scammers will start calling us with similar voices.


Waiting? Talk to some small business owners, they’re already being bombarded. One common tactic is to ask the A.I. to do maths and see it breakdown and say it’s going to ask it’s “supervisor”


Say you're an indie game developer. In 2022 you'd pay someone on Fiverr to do a 'trailer' voiceover on your game trailer. This year, you'd use this - and also get a few more languages in there. Next year is gonna be an interesting year.


"voice owners and their licensors"

Is that even a thing? You can't copyright a voice. There can be a personality right under state law, but the main case on that was someone hired to sound like Bette Midler for a commercial.


> At Eleven, we're fully committed both to respecting intellectual property rights and to implementing safeguards against potential misuse of our technology

Unlike Stable Diffusion trampling over the copyright of artists without their permission and OpenAI doing the same for code mangled with incompatible licenses and monetizing it and outputting the trained data verbatim whilst opening a pandora's box and then attempting to write detectors and watermarks afterwards. I'm skeptical on Eleven Lab's statement on adding their detectors before release, but we'll see.

Should there eventually be an open source version of a competing model by someone, it should be trained on public domain sources. This was the case with Dance Diffusion as Stability AI would have been sued to the ground by the RIAA had they done that. [0] [1]

It will only be a matter of time before the legal system catches up with AI generated content and scrutiny over the trained data on copyrighted content without permission and how it was trained. Any output generated by an AI is automatically public domain and un-copyrightable. [2]

This AI hype is another VC scam to unload their investments in AI startups to big tech once again and then pretend how AI is making the world better but when they know it is actually doing the opposite with far reaching consequences. Of course it can't be stopped, but it also cannot go unchecked and unregulated forever.

[0] https://www.musicbusinessworldwide.com/record-industry-clamp...

[1] https://techcrunch.com/2022/10/07/ai-music-generator-dance-d...

[2] https://www.copyright.gov/rulings-filings/review-board/docs/...


Are you really surprised though?

The crypto hype was a vehicle for VC backed companies to sell unregulated financial products to retail investors.

The "gig economy" was a vehicle for VC backed companies to skirt labor protections and zoning laws.

And now AI is a vehicle for VC backed companies to skirt copyright laws.

"Disruption" is often just about finding edge cases of existing laws and regulations and exploiting them for profit until legislation catches up.


> Are you really surprised though?

Hardly. It is no different to my observations years ago. [0] [1]

[0] https://news.ycombinator.com/item?id=21738233

[1] https://news.ycombinator.com/item?id=27493369

Just like the clamping down of cryptocurrency markets and enforcement of regulations, a similar set of rules and regulations will be set for AI companies for complying with existing copyright laws.

The VCs know it is a scam and they are also smart enough to know that this won't go unchecked forever and they will have to unload their investment at the peak of the hype cycle.


My word those female voices for news and controversy are AWFUL. I only made it 2-3 seconds in.

The male narrative voice is silky smooth. In fact I prefer to the classic YouTube male mystery voice that sounds like the narrator had a lobotomy.


Good to see that authors/maintainers of AI models are beginning to think about attribution. But it seems like this will be a hard problem to solve. For example, say my voice was part of the training data set, to what degree can I lay claim to the newly created voices? Also, will there be some sort of grading/ranking (e.g. it could be argued that some of the voices used in the training set are more desirable than others, and therefore their "owners" deserve better fees etc.)?


The text to speech function at the top of the article is the actual product but they are not going the extra mile and record it again for the other speed multiplier like x.7 or x2.0. You can clearly hear the mp3 struggling, especially at 0.7 speed.

It would have been interesting how they perform in comparison. The fact you are able to adjust the voices is even one of their selling points. I really wonder why they haven't done that


> severely underhyped: voice AI

These two words made it all sound like they are just trying to ride the AI wave instead of actually solving a real world problem


There’s a pretty cool trinity audio bot that converts any Twitter thread into audio: https://twitter.com/trinityaudiobot/status/16131660716907970...


Twitter is bad enough to read much less have to listen to


About a month ago, I made a toy bot that listens to your voice with OpenAI Whisper, generates a response with GPT-2 and vocalizes the response using the Eleven Labs. The TTS quality produced by the Eleven Labs algorithm was mind-blowing to me. The API that they provided was super easy to use. Good product!


I always wondered why those generative voices dont capture the feeling of the text per segment and incorporate it to the output e.g. news, narration, first person hunted by vampires, whatever. Seems like a kind of low hanging fruit.

Disclaimer: I use tons of audiobooks so that might not be what people need in general


By the way if anyone is in this thread due to working on AI speech synthesis for any company, I am interested in AI as well as audio production and I would love to talk about joining the team as an AI researcher. Just send me some mail, my email is in my profile.


Impressive. Any chance there will be an API version of this product for real-time apps?


This is the major issue with the majority of this technology at the moment. Theres a plethora of options available and soon to be unveiled by several startups who are talking up their tech... but they are almost all for "editing"/"after the recording" work. You have to have a complete recorded track you can pass into their software (usually by uploading to their service) and then it will crunch away at the file and work their magic.

The current real time options I've found are... lacking, they are mostly fake/toys (not actually using voice cloning, just old school pitch shifting) or tech demo videos, with a scattering of research papers which are highly variable in terms of "how easily can i reproduce this", ranging from "sure if I want to waste money on a google colab instance, to "only works with specific model of video card due to reasons"

If you know of any real-time (audio stream in -> audio stream out) voice cloning/transform/replacement tools, feel free to post about them in a reply, this is an area of tech I'm trying to keep on top of and I'm only human so I have no idea what new company or research I might miss.


Hey - ElevenLabs dev here. The quality above works with <1s latency that for some real-time apps is already sufficient. On smaller chunks of text it can be as quick as ~500ms.


Awesome, thanks for adding that extra color!


The conversational one doesn't sound like an AI but some of the emphasis is still a bit awkward.

If I didn't know better I would have thought it was recorded by a person who was uncomfortable having their voice recorded.

Still insanely impressive though.


They need to take this and similar AI and come up with better dubbing for movies in other languages. Netflix should really lead the way here with the amount of dubbed content that they currently possess.


Exactly (ElevenLabs dev here)! This is actually out mission - make all content available in any language and voice. Dubbing is where we are going!


If dubbing is where you are going... does that mean you're also going to pair it with deepfaking the videos to make the facial movements match the new vocalizations? Because that'd be a wild product.


I've been using Azure to generate speech audio for my game and it's extremely good. These samples seem even better. I'm wondering how less cherry picked clips will turn out


More advanced scams potentiated by technology advancements are an arms race hard to keep ahead of. Despite all the possible positives, this seems almost inherently dystopian.


Should mention this to the https://thisxdoesnotexist.com/ dev


No mention of any other competitors that've been doing this stuff for several years? Uberduck? Fakeyou? Coqui? 15?


This appears to give by far the best results that I have seen. Can you link to a speech sample that you think is similar in quality to the article?


RIP Auto-Tune


If the music that my grocery store foists upon me is anything to go by then I say it can't come soon enough


Impressive generated voices for TTS


Still nothing comes close in terms of "install/usage accessibility" compared to the terribly sounding espeak


Respeecher is doing this for years now. i don't see any major advancement.


Respeecher is a completely different product with different goals and usecases...


The "conversational" example should be named "Karen".


Why someone should listen a voice when is faster to read the blog post ?


Why are all people exactly the same? It's odd that the audio book business went out of business years ago because people only want to read.

And yes this was a snarky answer because even if you don't realize it, it was a snarky question.


Absorbing information through your audio input while the visual input is busy with a mindless task is amazing. You can listen to an article or an audiobook while doing laundry.


Everyone is different, I read rather slow and can ingest the article much quicker with audio (esp. sped up audio) than by reading it.


Where can I use this? Is it public? Is there an API?


Hey! ElevenLabs here - not public yet, but we will be opening up Beta later this month. API is available directly in the platform.


How long before we have a meta "This 'this X doesn't exist' doesn't exist"?


Interestingly, some of the robot styles take a very obvious and dramatic fake breath. I say "fake" since a robot doesn't need to breathe and it's not exactly considered a phoneme. The fake breaths don't really make the robot sound more convincing.

When you listen to the first example labelled "Narrative" you can tell where a human speaker would have inhaled (which is something the AI could have picked up on from copious training data) though the inhale itself could be muted in post-editing, e.g. after the long 24-word first phrase[1] ending in "special magnificence", and then again at the end of the sentence. It could just be the way the AI reads the comma but it is very convincing.

The "News" and "Conversational" examples don't include that pause effect. In the cerulean monologue, there is no pause after "for instance" despite it being in the monologue.

However, the robot takes a deep dramatic breath after the word "I see"[2]. " Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet and you select I don't know that lumpy blue sweater for instance because you're trying to tell the world that you take yourself". There is no pause on the comma around "for instance" though the script has one. I decided to check whether the robot is just copying the original film exactly and that's not it either.[3]

Comparison:

    Robot: "Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet [no breath] and you select I don't know that lumpy blue sweater for instance [QUICK HALF BREATH BY ROBOT] because you're trying to tell the world [no breath] that you take yourself too seriously to care about what you put on your back but [no breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."

    Original: "Oh, okay. I see [no breath] you think this has nothing to do with you. [loud long breath] You go to your closet [breath] and you select I don't know that lumpy blue sweater for instance [no breath] because you're trying to tell the world that you [breath] take yourself too seriously to care about what you put on your back but [breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."
Text: "Oh, okay. I see, you think this has nothing to do with you.

You… go to your closet, and you select… I don’t know, that lumpy blue sweater for instance, because you’re trying to tell the world that you take yourself too seriously to care about what you put on your back, but what you don’t know is that that sweater is not just blue, it’s not turquoise, it’s not lapis, it’s actually cerulean. "

I've annotated the breaths in the "conversational" robot sample vs the original film:

                     Robot                  Original                Same/different?
     I see...        [Loud breath]          [no breath]             Different
     with you...     [Loud quick breath]    [loud long breath]      Similar
     your closet...  [no breath]            [breath]                Different
     for instance... [QUICK half breath]    [no breath]             Different
     that you...     [no breath]            [breath]                Different
     back but...     [no breath]            [breath]                Different
The robot's loud dramatic breath is unmistakable, but it's clear it's not copying the source exactly, since it occurs at different places.

[1] The text is here: https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...

[1] The text is here: https://artdepartmental.com/blog/devil-wears-prada-cerulean-...

[2] https://www.youtube.com/watch?v=us52N76XA28&t=1m24s


(ElevenLabs dev here) The generative voices and the way they sound is very much a function all the training data, sampling and interpolation as you also pointed out. As a lot of these do involve deep breaths, that why synthesized voice will also have it present albeit at sometimes different times than human. Interpunction is the biggest influence on where those pauses will happen.

From the users so far they found it actually enjoyable to listen to and that the breathing and pauses are accurate!


I agree - the pauses in the first sample called "Narration" are incredibly accurate and pleasant to listen to.

As a developer, can you tell the difference between "Narration" and the human speaker? What can we listen for or what gives it away? For me I listened to the "Narration" clip many times and as a native British English speaker also confirms in another comment, it seems very difficult/impossible to tell the first clip is generated. Congratulations on such an achievement!


I noticed a breath in the demo audio in the linked article and while it stood out, I was impressed by it rather than thinking it felt forced. I'm sure if I listened to enough AI voice it would stand out more and feel forced.


Did you find the whole clip it was in convincing? For me, I didn't even notice the breath but the entire second and third clip felt obviously AI-generated. But the first clip sounded absolutely real (maybe with some compression artifacts - see my other comment.)

Later when I went back and listened carefully for why the first clip felt so "real" I noticed it had pauses. (No breaths per se but they are sometimes removed from edited audio.) However, I then noticed that the conversational clip, which felt unnatural to me, had very obvious breaths. The entire effect of the conversational clip didn't sound like a human at all. It sounded like an AI.

Did you find the whole conversational clip "convincing"? (Did it sound like a human to you?) How about the narration clip?


Amazing.


> Not only can they be more cost-effective without compromising on quality...

That feels dishonest. Even if this AI is just as good at speaking as a professional voice actor (which I'm not sold on), a voice actor does more than just read the line. In ideal circumstances, they have a lot of context for what their character is doing and feeling.

Is this potentially a good option for saving money on video game voices? Quite possibly yes. Is there no compromise on quality? No, not yet.

Past that, the whole "Ethical AI" section's arguments seem ridiculous. Of COURSE it puts the livelihoods of voice actors at risk. Your product's whole point is that fewer man hours are needed for voice work. Just accept that you're making those jobs obsolete. There's a perfectly good argument that it's okay to do that. Throwing bullshit at us to convince us that "no, the voice actors will still have lots of work, and they won't even have to talk!" just makes you sound like snake oil salesmen.


> Even if this AI is just as good at speaking as a professional voice actor (which I'm not sold on), a voice actor does more than just read the line. In ideal circumstances, they have a lot of context for what their character is doing and feeling.

How long do you think that advantage will last for. Years? Months? Weeks?


>Even if this AI is just as good at speaking as a professional voice actor (which I'm not sold on)

I think the 'narration' example was voice actor quality. The other two were slightly off.


This would put companies like Audm out of business, but it seems like they already only employ one voice actor for most gigs (ya gotta respect how much she gets done though!). I wish there was more work for professional voice actors, audiobooks done by the likes of Roy Dotrice are an absolutely fantastic ride


[flagged]


Were you listening to it with audiophile level sound drivers?


No. The problem is not sound quality. The problem is that the speech sounds unnatural and computer-generated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: