This Voice Doesn't Exist – Generative Voice AI

piotr11 · on Jan 13, 2023

Hey - developers behind ElevenLabs here. Thank you so much for the constructive and positive feedback - we’re taking it onboard!

We’re currently focused on researching and deploying a different way for speech synthesis that can generate nuanced intonation and emotions by understanding text and taking context into account. Additionally, we provide creators with a way to clone their own voice based on very short samples. With the published blog post, we are now deploying a way to help them design entirely new ones!

Anyone will be able to generate that level of quality just with a copy-paste. We are planning to open up Beta later this month. Our goal is to let you convert any written content into high-quality, compelling audio.

To address a few questions that frequently came up:

- Latency for our streaming TTS is <1s with quality results available above, which is the usual problem with existing good TTS models (like tortoise-tts)

- We can clone voices instantly, based just on 5s of speech, without training required

- We are working on adding SSML-like support for better control; speed controls will be coming as part of that too

- API is directly available as part of Beta; we are preparing the infrastructure to scale easily for the release!

We are hiring researchers, frontend and full-stack developers! If you are interested, send over your GitHub account and short message to founders[at]elevenlabs.io.

diminikolaou · on Jan 13, 2023

Hey Piotr - just wanted to say congratz for the awesome work so far man. The quality is genuinely unbelievable. I don't know if you guys are ready to take clients at scale, but I don't see any reason why all newsletter creators wouldn't use your tech right now to address whole new markets. I'll be following the journey, excited for what's to come.

hiisukun · on Jan 14, 2023

Maybe I'm late to the party -- but this [1] graphic is great in the linked article.

Could the designer share a little about how it was made? Does it represent one of the generated voices, or is it just 'artistic'? (both are cool, I think).

[1] https://blog.elevenlabs.io/content/images/2023/01/Sequence-0...

fireant · on Jan 14, 2023

The voices are really amazing, I couldn't really tell that they are synthetic and I was looking for it.

The only issue is that the actual recordings sound like they have been overcompressed, or poorly recorded - is there any way to improve this? Something like superresolution, but for voice?

rexreed · on Jan 13, 2023

What is your business model? How are you deciding who gets Beta access? What does the voice generation interface look like?

matisqe · on Jan 13, 2023

We are offering both Speech Synthesis (/TTS) and Voice Lab (Rapid Voice Cloning and Voice Design) as a standard SaaS model (w/ fixed quota of characters you can voice per month). API is directly available on the platform. Outside of standard package that flips to usage-based model and we do tailored deals for custom needs and discounts for high-volume usage.

Currently testing Beta with a range of storytelling and publishing use-cases, tackle relevant feedback and make sure the infrastructure supports it. We are planning to open up Beta to everyone by end of this month.

Voice Design interface is currently set of sliders and toggles but currently iterating on what is most accessible.

TheMrZZ · on Jan 13, 2023

Hi! Are your models english only, or do you plan on tackling other languages?

piotr11 · on Jan 13, 2023

They will be multi-lang, the tech scales to any language and we are working to add more (it is relatively easy). Here is the demo in Polish TTS: https://www.youtube.com/watch?v=ra8xFG3keSs

pronlover723 · on Jan 13, 2023

What are the odds of this kind of thing being open source so I can use it at home. So far, most of the "good" text-to-speech systems are all commercial services

https://aws.amazon.com/polly/

https://cloud.google.com/text-to-speech

https://azure.microsoft.com/en-us/products/cognitive-service...

And now one is also a service.

I tried using tortoise-tts on my M1. Generating a 7 minute speech took 3 days and, while better than the 15 yr old text-to-speech built into the OS it wasn't close to the quality of the services above. Maybe I don't know who to use it but of course it's not as simple as text-to-speech. You need the system to ideally understand the text it can act out parts

Of course see my username. I want to generate personal adult content so I'd prefer not to upload it to a service.

yreg · on Jan 13, 2023

Any time I see AI model news on hn nowadays, my first question is whether I can run it locally, and if not, what are the alternatives that I can run locally.

EarlKing · on Jan 13, 2023

> what are the alternatives that I can run locally

...you will be disappointed by the answers to that question for the foreseeable future.

yreg · on Jan 13, 2023

I'm the opposite of disappointed. The amount of public pretrained models that have been popping up recently is crazy.

bdhcuidbebe · on Jan 13, 2023

Same model with random tweaks applied.

Just because there is a new toy doesn’t mean capitalism gave up.

alex_sf · on Jan 13, 2023

Nor have pirates:

https://twitter.com/novelaiofficial/status/15785291897410805...

yreg · on Jan 13, 2023

There is much more than stable-diffusion out there :)

Of course capitalism doesn't give up, I wouldn't even want it to.

Roark66 · on Jan 13, 2023

The speed of progress on this front is increasing. These days even "cheap" rockchip MCUs are packing 5TOPs AI accelerators. And both AMD and Intel are working on much more powerful ones for their cpus. Heck, I recently wrote a mobile (android) app that runs pretty powerfull AI for intensive image processing locally on mobile phones thinking improved privacy would be more in demand than sending everything "to the cloud". I was mildly surprised to discover most people don't care (after writing the app). Still, I wouldn't be surprised if in 10 years the majority of AI people use rums on end user devices.

yreg · on Jan 13, 2023

Yeah, most people don't care, but it might also be the case that many people who care use iOS, since that's the platform where all photo machine learning provided by the system happens on device.

kerpotgh · on Jan 13, 2023

That’s because you’re running tortoise on a CPU. It does about a sentence a minute on my 3090 gpu. It’s also quite good if you pick “high quality” and train it with 10 sec clips at the framerate and bitrate it asks for.

taf2 · on Jan 13, 2023

You could try https://github.com/mozilla/TTS

mirkonasato · on Jan 13, 2023

Effectively superseded by https://github.com/coqui-ai/TTS

hackernewds · on Jan 13, 2023

What kind of personal adult content do you generate? We are curious for details

didericis · on Jan 13, 2023

I can't tell if I'm starting to get that old person "new things are scary" instinct or if my gut level of fear about the implications of these things is warranted.

As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse. We're already drowning in ad dominated cynical soulless computer generated search results. Are all online forums going to end up being drowned out by cynical pumped out super cheap to produce simulacrums of creative content now too?

If I want people to buy more Triscuts next year what's stopping me from writing a bunch of prompts to insert subtle marketing cues to buy Triscuts with entire fake ecosystems of users, fan art, radio call ins, user stories, etc in like every niche community in existence and flooding them with soulless fake interaction?

That exists to a certain extent already, but I don't see how this stuff won't make it way easier, way more effective, and way more widespread.

spaceman_2020 · on Jan 13, 2023

My YouTube feed is currently filled with videos of whitehats hacking into Indian scam call centers.

Most of the time, the giveaway is the callers' Indian accent. If you could simply type into a box and speak with an American accent, it would be really hard to get caught.

We're opening a pandora's box here if I'm honest. I'm hardly one for pro-regulation, but good God, we're playing with things here that can really hurt us down the line.

shaky-carrousel · on Jan 13, 2023

> If you could simply type into a box and speak with an American accent, it would be really hard to get caught.

Not really. If they say "kindly do something", they are Indian scammers.

tux3 · on Jan 13, 2023

Yes, however if that were a problem in the scenario above, I'm pretty sure LLMs could fix that as well.

They're already very good at translation today, it stands to reason that they could do the needful when it comes to turning regional English into American English. Or Bri'ish English, if that's the accent you want your TTS model to have.

spaceman_2020 · on Jan 13, 2023

"Hey chatGPT, write a short script convincing someone that I'm from a small town in America"

moffkalast · on Jan 13, 2023

And here I thought the giveaway was just them trying to blatantly scam you.

pclmulqdq · on Jan 13, 2023

There are a lot of words and phrases that indicate that you are speaking Indian English, separate from the accent. Using "learning" as a noun is a very common one in tech.

spaceman_2020 · on Jan 13, 2023

If you can use AI to create a fake voice, you can also use AI to create a prompt for the voice

sebzim4500 · on Jan 13, 2023

I'm sure that could be corrected by even a very basic language model

barking_biscuit · on Jan 13, 2023

>As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.

My sentiments exactly. I think it's a bit of column A and a bit of column B. I'm reminded of the quote "everything has its pleasure and its price". The more expensive things are to produce, the the less of it there will be, but what is produced will be higher quality across the board. The less expensive it becomes to produce, the more of it will be, and the aggregate quality will be lower.

It's not always a bad thing, but the downsides are plain to see when you look at the amount of spam and low-effort content out there. That said, we've all massively enjoyed the upsides too, so it's a balancing act. I think where things were at before the recent wave generative AI tools was perhaps right on the sweet spot of "it's democratized enough that anyone can have a go, but still requires effort and a degree of talent to do well". The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.

These new tools potentially push that effort/reward ratio to the point where the signal/noise ratio simply gets too high. Of course the "make money online" community is all over this stuff and today I watched a video of a guy showing how you could supposedly clone courses on Udemy using ChatGPT and other tools etc. The problem is the "course" would literally consist of generic advice, high-level information on a particular topic that suffices only as a very surface level introduction and isn't enough to help you build any functional skills in that domain, so it's effectively useless. The only person it's not useless to is him and as he would pocket a cool $5-ish per sale. It was somewhat sad and somewhat sick to hear him cackling away about being able to con people out of money while passing himself off as an expert.

And yet, it's entirely what I would expect would happen.

dwighttk · on Jan 13, 2023

>The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.

Adblockers are amazing

barking_biscuit · on Jan 13, 2023

Are there "influencer" blockers?

dwighttk · on Jan 13, 2023

youtube lets you tell them channels you don't want recommended... I don't know how well it works, I usually just say I'm not interested in a single video

ghaff · on Jan 13, 2023

I suppose the optimistic view--such as it is--is that there is already a vast amount of low quality content out there that was created for pennies and plastered with ads and/or hoping someone will pay a modest amount. So I'm not sure that things like ChatGPT make things that much worse than they already are--and we can mostly live with things today. The pessimistic view of course is a whole new cohort of grifters decide to give it a run whether they ultimately make money or not.

vouaobrasil · on Jan 13, 2023

I agree with this completely. Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.

The most dangerous aspect of this is that each step seems relatively harmless: right now, ChatGPT and DALL-E are amusements, but each small step is building a monstrous and as you say, soulless machine that overloads us so much that we will forget what it's like to even be human.

I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.

If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.

hexage1814 · on Jan 13, 2023

And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.

Nature is SHIT, that is why people created technology. There is nothing preventing you from going to the middle of nowhere and reject modernity, no one is forcing you, but you are that because you wanted and liked it. You talk people should have an "instinctual revulsion" towards technology, but not even you yourself has this reaction towards technology because it is a stupid idea that not even luddites like you commit to it.

If anything the technology we have nowadays is not even 0,01% of what we should have. We should have the technology to make any movie anyone ever wanted to see in a blink of an eye, all done in the best quality ever imagined. We should have the power to build a Dyson Sphere around the sun to harness its energy. We should be able to construct fully immersive virtual reality, like San Junipero from the Black Mirror's episode, we should have the power extend human life indefinitely.

aktenlage · on Jan 13, 2023

Why are you so hostile? What sense does it make, to attack him because he does not already have, what he is wishing for?

Nature is not "SHIT", for whatever that should mean. Neither the blanket statement "Technology is evil" nor "Nature is shit" make sense. We are humans. We need nature - it is what we evolved to and our technology is not able to replace it without loss. Specific technology is great to overcome existential limitations, but most technology is not.

Sure, there is great technology out there that improves our lives. On the other hand, there is so much technology that makes our lives worse (because of how it is used: e.g., by being of advantage to few people, while being bad for everyone else or by helping individuals now but having severe effects lateron), it can hardly be ignored, that a better process for selection or containment of technology would be necessary to improve everybodies life. But mankind is bad at forgoing.

Current technology seems to be great at generating convenience and excitement. And the examples you mention (movies, infinite energy, VR and eternal life) feel like the wishes for more excitement of a teenager (and this is not meant condescending), but life is so much more than excitement. Excitement is just the cherry on top. I'd rather see more tech that is wholesome - but that area seems to be left to nature.

vouaobrasil · on Jan 13, 2023

> And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.

I don't demonize all technology. There must be an optimum somewhere, and I would like to engage in open discourse in order to understand where that optimum is. I believe advanced AI takes us away from the optimum.

Extending human life indefinitely is a terrible idea. We have a natural lifespan and we need to function within it. We should not proceed towards being saturated in technology as that will surely destroy the natural life on this planet.

Teever · on Jan 13, 2023

https://knowyourmeme.com/memes/we-should-improve-society-som...

8note · on Jan 13, 2023

There's no lack of revolutionary tech that has made life overall better with higher quality.

Even like, a bic light is so much better quality than a flint and steel or fire sticks.

Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand

Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism

Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely

Advancements in technology are mostly quite good, and improve both quality and convenience

vouaobrasil · on Jan 13, 2023

Seems like for every advantage you list there's also a disadvantage.

> Even like, a bic light is so much better quality than a flint and steel or fire sticks.

And is part of the disposable society creating immense amounts of waste.

> Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand

Smartphones reduce the quality of social interaction. People often check them when they should be paying attention to their friend, and they make cancelling last-minute easier thereby making people more flaky.

> Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism

It's hard to argue with you there, though I suspect that all these "time-saving" inventions also make it more likely that we will spend more time on other things like more work and on electronic devices.

> Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely

And electricity has also made it easier to stay awake at night, staying up later and reducing the quality of sleep. Hundreds of people get worse sleep by being exposed to devices at night. I think it's actually nice to wind down activities when the sun goes down though obviously that is not as easy in the latitudes closer to the pole.

Basically, I think there are a lot of hidden dangers that people accept because in the short term they don't realize that technology makes life less fulfilling.

vagrantJin · on Jan 13, 2023

> Dishwashers and laundry machines

> enabling feminism

Thats all it took?

So if theres no electricity - its back to square one?

snek_case · on Jan 13, 2023

> If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.

Time to go live in a cabin in the woods and go write your manifesto on a typewriter...

barking_biscuit · on Jan 13, 2023

Typewriter? You heathen! It must be chiseled into the cave wall with a bone.

HPsquared · on Jan 13, 2023

Writing?! Surely not...

barking_biscuit · on Jan 13, 2023

Ah yes... the oral tradition. How could I forget?

HPsquared · on Jan 13, 2023

All this focus on writing things down has weakened your memory!

gmadsen · on Jan 13, 2023

I think what gets lost in these doom and gloom predictions is that there is a large healthy portion of young adults that do not engage in internet forums or social media.

It is perfectly viable in the modern day, to work a job, have passionate hobbies, regularly meet for social events, volunteer, etc and spend minimal to zero time engaging on the internet, besides pragmatic things like map directions

beebmam · on Jan 13, 2023

I would have died long ago without modern technology, and the many surgeries I have needed. It's hard to take your argument seriously when I consider the consequences of what you're advocating for.

sebzim4500 · on Jan 13, 2023

Yeah but you have to balance the positives and negatives. Sure you being alive is all very well, but sometimes GP has to overhear teenagers talking about TikTok, and that is unacceptable.

barking_biscuit · on Jan 13, 2023

>I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.

I've more or less come to a pretty similar conclusion. I wouldn't characterize it as evil per se, but it's a fools errand at best. My line of thinking goes somewhat like this - before the Neolithic revolution humans had an extremely small set of problems. The main problem being "what am I going to eat?", and to a large degree, life must have revolved around this problem almost entirely. There weren't that many people, there weren't that many problems, we somehow persisted in that state for hundreds of thousands of years with literally nothing to write home about. Any advance in technology has literally been trading one problem for at least three more. Now there are loads of problems, loads more people, and the standard approach to solving all the problems is to invent new technologies, which in practice seem to actually exacerbate the problems. So, I just sort of view the current state of things as "somewhere around the turn of the Neolithic Revolution we took a wrong turn, and it has widely been regarded as a bad move."

It's a weird sort of defeatist, nihilistic, melancholy worldview, but to be honest, I don't think we're wrong. I mean... what's the endgame of technology?

thatguy0900 · on Jan 13, 2023

I would put the optimal state around native American level of technology. At least some sense of medicine and first aid, food is largely figured out, but no real oppressive technologies figured out yet.

barking_biscuit · on Jan 13, 2023

>but no real oppressive technologies figured out yet

And there-in lies the arms race, for if you figure out an oppressive technology, you can oppress people with it who don't have the means to resist.

Hmm... the evil argument starts to make more sense.

PurpleRamen · on Jan 13, 2023

> Technology has always made us trade quality for low-quality quantity in exchange for convenience.

Technology evolves. Even if it may start with some low quality aspects, it doesn't need to stay that way.

> People now interact more through technology which removes a lot of body language and other enriching experiences.

Which is just different communication, not better, nor worse in general. Of course this kinda sucks for people who do not know the new communication-code well enough. But people do evolve communication to replace relevant missing parts. Body language for example was mostly replaced with emojis and memes, which can be better, or worse.

> we will forget what it's like to even be human.

You can't forget what you are. You are you everyday, ever minute, every second of your existence. What you speak about is people having a different culture from the one you know and understand. That's something completely different.

> technology is ultimately evil

Technology is a tool, is can't be evil or good. It's up to the users how they handle it.

vouaobrasil · on Jan 13, 2023

> Technology is a tool, is can't be evil or good. It's up to the users how they handle it.

I fundamentally disagree with this premise. I believe evil is roughly equivalent to the inevitability of bringing about evil, and I believe AI falls under such a classification.

vouaobrasil · on Jan 13, 2023

> Which is just different communication, not better, nor worse in general.

That is where we disagree fundamentally. I do posit that the communication is actually absolutely and unequivocally worse.

djmips · on Jan 13, 2023

> Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.

I went to the mall today and you can tell malls are dying. I lived in a small town where the mall died and it had a zombie like existence a long time before it finally cratered. The mall here in this larger town has that feeling. I also thought about how nice it is to go to the mall just to be out among people. The same is true of the downtown. If the endgame is for everyone to stay home and shop online that's going to be a very soulless existence.

tchaffee · on Jan 13, 2023

Or don't shop at all and use that extra time to walk with friends in nature. Or when you really do need to shop, avoid the commute and use that extra time to spend with friends in nature. Being forced to be around strangers to get chores done doesn't put soul into my life.

I also avoid laundromats and do laundry at home and it doesn't feel soulless.

Sightline · on Jan 13, 2023

I never went to malls to socialize.

vixen99 · on Jan 13, 2023

Sometimes circumstances mean that going to a mall is the only way some folk can get to meet their fellow human beings. And that doesn't mean it doesn't have other advantages such as conversing with people one might not normally come across.

Kiro · on Jan 13, 2023

Why are you on a forum called Hacker News if you hate technology so much?

krapp · on Jan 13, 2023

It's surprisingly on brand for this place.

vouaobrasil · on Jan 13, 2023

I don't hate all technology. Rather, I advocate a specific approach to technology, which is a cautious one. Such an approach is antithetical to the classic tech company, and so I hate the approach.

I don't believe all technology is bad. Rather, I believe that all technology needs to be handled in a specific way so that it does not overwhelm us. Though, I do believe that some technology is fundamentally evil.

I would consider myself a hacker, but I do not believe in the capitalistic approach to technology advancement for the sake of short-term profit. I think technology can be used wisely and I do not believe we are doing so.

In fact, I started out as a mathematician and programmer and I still appreciate the beauty of those fields, but I think we need to treat STEM knowledge like we treat knives: useful but dangerous.

sanroot98 · on Jan 13, 2023

I think ,it's not technology that's real problem ,it's that loss of ethics in domain of knowledge ,since industrial and scientific revolution ,we put more emphasis on reductionism and objectification ,even human are being objectified ,this disease of over rationalism plaguing to every domains of knowledge ,I think it's always been constant battle of rationalism vs romantics .

tucosan · on Jan 13, 2023

May I suggest you research the proper way to format commas?

I know, proper form is a skill that has been lost with the advent of social media.

I posit that the mastery of orthography is a precondition to be taken seriously even today.

sanroot98 · on Jan 13, 2023

Sorry ,English isn't my native language ,thanks for the feedback anyway ,I will sure consider researching.

thatguy0900 · on Jan 13, 2023

i think english just isnt their native language, this seems pretty snarky.

Kiro · on Jan 13, 2023

In what language do you put a space before comma but not after? Even without that the usage is puzzling enough to make me genuinely curious what their native tongue is. It almost reads like a haiku.

Strategy3892 · on Jan 13, 2023

> technology is ultimately evil

If we stop pursuing technological progress, we'll never be able to reach humanity's true potential. If we keep pursuing technological progress, those futures are still possible. We need to be wiser and mature about the way we pursue it but we still need to pursue it.

CM30 · on Jan 13, 2023

But at the same time, there are tons of positive uses for things like this too. Imagine being a creator who wants to share their interests with the world, but hates their own voice or doesn't have the confidence to speak on camera. You could make a lot of people's lives better by creating content for YouTube, Twitch, TikTok, Instagram, etc, but you wouldn't be brave enough to otherwise.

Something like this could be incredible for those people. A natural sounding alternative to text to speech for people who dislike how they sound.

And it could also be used to anonymise people in documentaries about serious topics (like say, organised crime) without actors, letting people bring the atrocities of said folks to light without need to trust others or the risk of being found out.

Other examples could include vTubers, artists creating characters for TV shows, films and video games, etc.

All technology can be abused, and sadly with how humanity acts, like will by a small percentage of the population. But for every person abusing it for dubious purposes, there are dozens or hundreds or thousands of others who can make the world better with it.

ChildOfChaos · on Jan 13, 2023

I think a great tool for this would be a cross over voice changer AI, so you could still speak naturally but then sound like the model voice, that way it would be a little less soulless.

CM30 · on Jan 13, 2023

Honestly, that would be incredible for so many purposes! vTubers and amateur media creators would love to be able to just speak and have it translated into their voice of their characters in a more natural way!

Would also be an interesting one for theme parks, since it could led the costumed characters speak in the voices of the relevant characters rather than remaining silent, which would add a lot to the sense of immersion there too. (something like the website on the other hand could let the animatronics, CGI characters and others hold conversations with guests too, which would also be neat)

didericis · on Jan 13, 2023

Yes, I’m sure there are many positive uses as well, I just have a hard time seeing how that’s not going to be outweighed by the bad given the current environment. There’s going to need to be some sort of social/cultural/technological adaptation when the negative starts hitting with force to curb it towards positive uses. People need to start thinking about mitigation strategies now.

jsemrau · on Jan 13, 2023

I am with you on this one. What defines us as a people is the ability to enjoy shared social experiences. The more tailored and personalized an experience becomes the more it isolates us. We don't (at least I am my social circle) speak about TikTok's the way we speak about YouTube videos.

But more importantly, boredom triggers innovation. As we are consuming ourselves to death, we might lose the ability to truly create. Maybe that's why the last 20 years of content feel quite generic and sterile.

tropicalfruit · on Jan 13, 2023

I think we already have a lot of soulless human generated search results.

I think there will be need for a greater level of filtering and curation yes, but I see it as an opportunity both for creators and curators.

The barriers to entry for media creation will go down, but with saturation also the already low margins of profit will get worse.

affgrff2 · on Jan 13, 2023

Also, AI will do the filtering, not just blocking uninteresting content, but actually removing known and uninteresting information from content.

8note · on Jan 13, 2023

I'm not confident that ai will ever catch up with scams for new scammer tactics.

spaceman_2020 · on Jan 13, 2023

we'll go back to the old way of consuming media - recommended by friends, vetted by known curators.

SV_BubbleTime · on Jan 13, 2023

> As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.

Eh. I’ll take a MAYBE over the past 10 years or more of the human driven social media manipulations and scams and poison. We’ve made almost literally fucking nothing of value in a decade. It’s been ads, Ponzi schemes, and a race to the bottom of tolerance.

I’ll take the democratization of content. Knowing that it will allow the good and the bad.

… so how it different from the radio or TV or “influencers” now? I have limited time to consume media and am not going to be less picky when it gets easier for people to make garbage.

fullsend · on Jan 13, 2023

There was some innovation but 2010-2020 had some dead air as investors lavished ponzi scheme SaaS companies with cash and big firms poured the profits of early internet into VR, AR, AI, drones, self-driving, etc.

The last year and a half things are starting to pop off. OpenAI, SpaceX, Comma, Helion, many more…that doomer “everything sucks and is collapsing” mentality is on the way out in my opinion. The time for talk is over and it’s time to build, or so they say.

berniedurfee · on Jan 13, 2023

I’m hopeful that, as with most apocalypse-capable technologies, humans will adapt and overcome.

Humans will be different on the other side of this coming wave of simulation indistinguishable from reality, but we’ll be okay.

It’ll suck living through the transition though. Not looking forward to the crap tsunami on the horizon.

But as a species, we’ll adapt and survive as always.

alex_duf · on Jan 13, 2023

Assuming the internet will soon mostly be generated content, and assuming this content is dull and soulless as you describe it, I'm wondering if it's not going to make the real world and interactions in person more interesting?

I could do with cutting my screen time and the best way to do that might be to make everything boring.

itvision · on Jan 13, 2023

> to end up making an incredible amount of sterile soulless content that makes everyone's lives worse

Worse? We already have 8 billion people a significant part of whom are pumping sterile soulless content.

If anything it will heat up the competition in the creativity field and allow truly creative things to proliferate.

random_upvoter · on Jan 13, 2023

I'm optimistic. I think the progress in AI will make people more aware where the soul really is, as they will learn to distinguish. I think the human spirit will be faster in learning to recognize that which is not really interesting than AI will be able to make improvements faking it.

xwdv · on Jan 13, 2023

The ideal use case is someone who wants to be an influencer, but is neither pretty, intelligent, nor has a good voice, could simply use face filters, GPT text, and a voice filter to make themselves sound and look beautiful.

nemaar · on Jan 13, 2023

I don't follow influencers but my guess is that they already do this, at least they use filters. If someone can use all these tools to gain considerable amount of fame and fortune, is (s)he really not intelligent? Of course, all these online personas will be lies, even bigger lies than today but I don't think it really matters. I'd argue that most people following these contents are not looking for reality.

tomcam · on Jan 13, 2023

I want to agree with you, but I have to admit I hate most human narrators of audiobooks. I would actually much prefer this company‘s voices to most of the humans reading books that I have encountered.

drewbug01 · on Jan 13, 2023

The “narrative” example is pretty good, but the “conversational” example is rather unpleasant to listen to.

(Especially if you know how well Meryl Streep delivers that monologue in the original: https://youtu.be/Ja2fgquYTCg)

tedd4u · on Jan 13, 2023

That's a pretty high bar. Even most Hollywood productions can't afford Meryl Streep, let alone a new site, podcast, or video game.

From wikipedia:

Mary Louise "Meryl" Streep [is] often described as "the best actress of her generation." Streep is particularly known for her versatility and accent adaptability. She has received numerous accolades throughout her career spanning over five decades, including a record 21 Academy Award nominations, winning three, and a record 32 Golden Globe Award nominations, winning eight. She has also received two British Academy Film Awards, two Screen Actors Guild Awards, and three Primetime Emmy Awards, in addition to nominations for a Tony Award and six Grammy Awards.

xp84 · on Jan 13, 2023

Maybe it's because I haven't heard the source material, but that Conversational voice really appeals to me. I wish my phone and assistants used that voice.

(and also I can't wait for a "real" ChatGPT-era AI to go with it, to put those braindead jokes of an "assistant" Siri, Alexa, and Google Assistant out to pasture)

logicallee · on Jan 13, 2023

Let's talk about this "Narrative" example.

When I listened to it, my first impression was that it must be the real actor they included for comparison purposes but that they failed to label it correctly. I thought it is not machine-generated. I couldn't tell the slightest artifact except what sounded like a low-bitrate sound encoding (maybe using a codec geared toward speech). Can you tell anything "off" about it?

As for the encoding artifact such as a tinny sound or low-bitrate sound, that is the type you hear on an MP3 or low bitrate codec for speech. For example, when I record a message on https://vocaroo.com/ the "premier" voice recording service it sounds 10x worse. Here is a sample I just recorded of my own speech: https://voca.ro/18oSJ1sHU5w5

After my first impression that the narrative example might be a real human mislabelled for comparison purposes, I listened to the next two, labelled News and Conversational. I found these very easy to tell as AI-generated.

Thinking back to why I found the narrative example so compelling, I thought perhaps the issue is that the first example is in British English which I'm less used to than American English. I grew up in the United States. Perhaps since the accent doesn't match my own, it is harder for me to perceive it as generated.

-> Can a native speaker of British English tell us whether listening to the first example you can tell in any way that it is a robot? Maybe it is as obvious to you as the next two are to me.

Still, I've listened to a fair amount of British English in my life so perhaps there is an alternative explanation for why the first one was better. For example, it could have been trained on a reader's voice who has narrated thousands of hours in very high studio quality in a fairly consistent way, leaving this type of text much easier to synthesize than the other two examples due to more training data or higher-quality audio.

For me, the first one is really indistinguishable from a narrator's true voice, though it does sound a bit tinny which could also happen as an artifact of the recording process.

In terms of "how confident are you that this is a real person" the second two examples I would put at 0 - it's totally obvious that it is not a real person, whereas the first one sounds like a 10 to me: obviously a real narrator. (With a bit of artifacting that sounds like an mp3.)

[1] The text is here https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...

iggykova · on Jan 13, 2023

Hey! ElevenLabs here, confirming that all 3 samples (including the Narrative one) were AI-generated! We'll be opening up our platform later this month and would love for you test it yourself!

logicallee · on Jan 13, 2023

congratulations!! I can't tell the first isn't human no matter how much I try. That is an amazing achievement.

majolo · on Jan 13, 2023

I'm a native British English speaker and can confirm the first example is incredibly good. It would be very difficult/impossible for most people to tell that the voice is generated from that clip alone.

rkagerer · on Jan 13, 2023

Agreed, the intro and narrative ones are great. The news one is terrible.

feoren · on Jan 13, 2023

Okay can I ask a question that has been bothering me for a long time?

Why do seemingly all these text-to-speech programs attempt to produce spoken voice based solely on raw text? Why don't they consume a MIDI-like text-markup language where you can write phonetic pronunciations along with markup about the emotion, volume, speed, etc.? I feel like this is a huge unnecessary roadblock holding back this kind of technology. It'd be like if every music composition program rendered a wave file not by MIDI or VST, but by trying to visually read sheet music. I totally understand why TTS solutions that have to consume arbitrary content, like screen-readers, need to read purely raw text. But content creators don't need to be limited to raw text! Why is everyone doing it that way? Where is the TTS markup language for content creators?

bredren · on Jan 13, 2023

I don’t know how many of the solutions offer this, but there is a markup language for TTS:

https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Langua...

Amazon Polly, (which seems kind of ancient with all these new solutions showing up) has supported SSML for some time.

AWS Polly SSML docs: https://docs.aws.amazon.com/polly/latest/dg/ssml.html

KRAKRISMOTT · on Jan 13, 2023

In practice they are next to useless, the expressions are not very...expressive (just try it in the AWS editor). I suspect a LLM would be able to infer the context or we can use prompt engineering to generate the appropriate tokens encoding emotions for the intermediate neural codecs directly (Mel spectrograms are so passé now post Vall-E).

SequoiaHope · on Jan 13, 2023

Something I always noticed is that they get Morgan Freeman to do voiceovers for science shows, but he’s not a scientist so he has a sort of generic inflection when he talks about the various ideas in the script. And then you watch Carl Sagan’s COSMOS, where he co-wrote the material, and there is so much depth and expression to his delivery. There’s a lifetime of public speaking, specifically delivering complex scientific topics to a general audience, that Sagan drew from when recording his show.

Sagan would have learned this through conversation with people, and careful updates to his expression and delivery as he matured.

I guess an LLM could improve upon previous methods but I would also say there is a gap that even humans struggle with, which requires really complex knowledge both of public speaking and of the material. It may be a long time before we can really master that with AI systems.

slim · on Jan 13, 2023

maybe the only way to express speech precisely is the speech itself ?

matisqe · on Jan 13, 2023

ElevenLabs dev here - we believe this is a 2 step process and agree it is needed!

First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.

For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!

tkgally · on Jan 13, 2023

Your context-aware TTS is already sounding very good. If I were using it to produce a narration that other people would be listening to, I would want to make at most couple of minor adjustments every few sentences. Most of those adjustments would fall into a few categories: stronger or weaker stress on a particular word, rising or falling intonation on a phrase, longer or shorter pauses between words, and correction of the phonemes in a word. A half dozen toggles for those adjustments might be enough for most cases.

I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.

spywaregorilla · on Jan 13, 2023

I feel like it would be much harder to create a set of hard controls, like MIDI, to affect the voice acting vs. trying to do a co-embedding space of voices and descriptions of the voices and just saying "Say this quietly and meanly". Thoughts?

matisqe · on Jan 13, 2023

Exactly! Only issue is having a well-labelled dataset with those type of cues. We have an idea on how to do it though!

riceart · on Jan 13, 2023

> I feel like this is a huge unnecessary roadblock holding back this kind of technology.

There are speech synthesis markup languages, like SSML. And targeting even lower level has always been possible with commercial speech engines.

Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.

feoren · on Jan 13, 2023

I think there are two "sweet spots" here.

The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.

The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.

bdhcuidbebe · on Jan 13, 2023

Just my 2 cents, but it seems to me that too little focus in the tech world has been spent on understannding what speech is. Tonality, mood, facial expression and body language all is ignored or people pretend like there are no such thing. I believe this is broadly true in western society by now- people went digital but do not yet realize why communication went to hell in the last decade.

havnagiggle · on Jan 14, 2023

I used to work in automotive navigation. Other colleagues handled our voice systems, but I do remember all our prompts were written in SSML[1] with varrying amounts of specificity. We would use Lua to configure and customize the SSML, including some custom extensions for different voice renderers.

Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.

[1] https://en.m.wikipedia.org/wiki/Speech_Synthesis_Markup_Lang...

IanCal · on Jan 13, 2023

There's lots of text and audio already without this, that's probably the key factor practically. Similarly then for use cases, converting text that already exists is much more approachable than creating new marked up text.

Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.

montag · on Jan 13, 2023

There's a pretty advanced Mac OS speech markup language, I wrote about it here: https://www.mattmontag.com/personal/mac-os-x-speech-synthesi...

Going back further, there was also a prosody markup for Sound Blaster speech synthesis (Dr. Sbaitso, anyone?).

IshKebab · on Jan 13, 2023

I think markup would always be more work and less effective than using your own voice input to guide its tone.

wpietri · on Jan 13, 2023

But not nearly as manageable. Imagine saying the same thing about music, for example. Musical notation is clearly more work than just humming a tune, but there's still a need for it.

Roark66 · on Jan 13, 2023

I remember in the 80s of last century there was a speech synthesis software I had on an 8 bit computer that accepted either normal text, or phonetic notation that had extra modifiers for basic things like "make this a question" etc.

jsjohnst · on Jan 13, 2023

Do you remember what that was? Dectalk was around in the 80s and so might’ve been that, but it wasn’t a generally available thing. Dr Sbaitso was common, but that wasn’t until 91/92.

Roark66 · on Jan 14, 2023

Yes I do, it was a Commodore 64 cartridge called "Black Box 8". And it spoke Polish with the right accent with all the sounds not present in English etc.

I read back then that it was domestic Polish make, but back then there was no such thing as IP protection so it is very likely it was based on work of Denic Klatt(same as DECtalk). When I heard some DECtalk recordings in a youtube video not long ago it immediately reminded me nded me of Commodore 64 Black Box 8. Although DECtalk spoke in English and black box 8 spoke in Polish there is some similarity that can be heard in their voices(not pitch - this was a user setting, but more of a rhythm if it makes sense)

jb1991 · on Jan 13, 2023

There are solutions that let you use curves like in an audio program to define inflection and pitch, speed of speaking, etc. Some of the competitors of this post's service do that.

altacc · on Jan 13, 2023

I wonder if it would be possible to automate this by pairing the speech synthesis with a ML model that understands the context of the text it is parsing.

lucasfcosta · on Jan 13, 2023

As a note, there are indeed markup formats to write the phonetic pronunciations, and also allowing everything you mentioned.

It's called SSML.

pmichaud · on Jan 13, 2023

As the sibling comment notes, there is in fact markup for this, and the results are actually pretty great.

dj_mc_merlin · on Jan 13, 2023

The examples are insanely good. Insanely good. I can barely believe we really live in a world where this is possible. I don't have anything constructive to add.. just wow.

wand3r · on Jan 13, 2023

I work in TTS and i just dont believe this. If these really are random text and not trained on literally the copy they are reading, with no correction I would be surprised. Also, our competitors have good voices but they also take ages to produce. Maybe these really are legit but take like 1 minute to produce or something. So while this is impressive, i doubt that in practice this would be this high quality and could even approach real time

piotr11 · on Jan 13, 2023

Thanks! ElevenLabs dev here - these are generated 6x faster than real-time, with latency of <1s. No corrections required.

We are working on long-form speech synthesis too, needless to say, the audio reading the article has been also synthesized by a voice that does not exist.

moffkalast · on Jan 13, 2023

Ok I think it's fair to say you're either full of shit or the world leading experts in TTS.

ThePyCoder · on Jan 13, 2023

I want to agree, but I searched on their website and found their narration service with 2 full book examples. I listened to the first one for a while and it's the first time an Ai narrator was good enough to keep me listening: https://www.audiostory.ai/2065785/11707800-alice-s-adventure...

kreddor · on Jan 13, 2023

It's noticable worse than the examples in the blog post. I mean, it's good enough for listening, but no better than the competition.

sebzim4500 · on Jan 13, 2023

It's vastly better than any TTS system I have used, but then I've only used a few (mainly phone assistants and the thing built into kindle).

What is the competition that you are referring to?

wand3r · on Jan 13, 2023

Yeah, as I mentioned I work in TTS and agree with you. If this is legit it is pretty amazing. Certainly would put them as one of the top providers especially given that they could ramp up voice selection. Also, if they truly are training on random stuff they would not have to pay royalties to voice actors since these voices don't exist. This is on par or better then most competitors i am aware of.

WheelsAtLarge · on Jan 13, 2023

I'm listening to an audiobook whose reader is not as good as some of these voices. At one level, I'm impressed but at an another I'm sadden since we are heading towards uncharted territory. We are looking at a future where we'll have content, video,audio, and text by the truckload. More does not mean better. It just means more blah stuff. I don't think that's the future I'm looking forward to live in.

Fordec · on Jan 13, 2023

The key will be authenticity and trust. And in the world where the percentage of online content that contains this ends up in the vast minority of content, in person expertise and meetings will have to make a return out of sheer necessity.

It's starting to very much feel like we're entering the age of information manipulation outlined in the Ghost in the Shell TV series. Except it isn't a 90's/00's depiction of the future, it's just with far less robots and prosthetics and a lot more mundane.

I just keep coming back to the scene where they have satellite video footage of a nuclear submarine preparing for a nuclear attack and the discussion lamenting that it's just video, nobody will believe it as evidence.

sanroot98 · on Jan 13, 2023

I think you are overestimating the capabilities of ai to create novel content ,high genuine quality content will be always there ,but amount of bs content will increase

xeonmc · on Jan 13, 2023

Imagine if in-game voice chat automatically converts player speech into the voice of the character they're playing -- this would resolve a lot of the gender-based harassment problems arising from competitive games requiring vocal communication, since now _everyone's_ default is hiding the actual player's voice, contrasting the "just use a voice changer if you're a girl playing" suggestion which themselves draws attention by being out of the ordinary.

yieldcrv · on Jan 13, 2023

I’m looking forward to NPCs having dynamic responses with real voices

Doesn't have to be prerecorded, just trained

wlesieutre · on Jan 13, 2023

Games could have more than three dialogue options again!

93po · on Jan 13, 2023

I feel like if Bethesda really wants another industry defining game, this is the path they should be taking. AI generated conversation with AI generated voice acting with voice-to-text recognition. You can literally have microphone-voice conversations with NPCs that have rich, AI generated backgrounds and personalities.

paulbgd · on Jan 13, 2023

Even bigger than that (I think at least) is the potential for fully voiced mods. There’s nothing stopping modders at that point from adding content indistinguishable from the base game.

selfhoster11 · on Jan 17, 2023

I'd love to see that. Voice acting for mods where you want to include new NPCs requires either someone to donate some voice lines, or paying for it to be recorded. If you want to patch existing NPCs, that's even harder because getting the original voice actor to do the new lines would require both persuading them to do it, and complying with any agreements they might have with the publisher that could prevent that.

93po · on Jan 13, 2023

I doubt Bethesda would facilitate this. They'd likely use voice actors to train the voice, and having a famous voice actor saying saucy kink/bdsm/violent things that you tend to see in some mods wouldn't be great PR

hhjinks · on Jan 13, 2023

How would the union take to that, though? This is not meant as an anti-union comment. I'd just be really surprised if Bethesda ever got to work with union VAs ever again if they went all in on an all-AI voiced game.

thfuran · on Jan 13, 2023

If they went all in on generated voices, would they want to work with voice actors?

93po · on Jan 13, 2023

I could see Bethesda still wanting to select voice actors to train the model with, especially if they're household names. So they'd get paid.

You do raise an interesting point, though. Compensation for AI-derived transformations of your work (in this case, your voice) needs to be a thing.

didericis · on Jan 13, 2023

A galaxy scale exploration game on the scale of Elite Dangerous where you could have more complex and varied interaction would be pretty amazing. The way you could apply these new AI models to video games has some wild potential. I think video games are one of the areas where I see the most potential for positive impact rather than negative impact.

yieldcrv · on Jan 13, 2023

The file sizes would drop by 50 gigabytes again

Most of it is high definition audio these days, and then that just gets replaced by a 10gb training set, or maybe the training set becomes a shared resource on the console

PeterisP · on Jan 14, 2023

Generating quality voice is sufficiently compute-intensive that it would increase the file size, as they would still ship all the audio (instead of computing locally) but there would just be so much more of it.

LarsDu88 · on Jan 15, 2023

I'm working on a VR space game that actually uses Ssml azure cloud generated voices for dialog, but I've ditched the rogue-like procedural elements which are wickedly hard to implement

CM30 · on Jan 13, 2023

This would be incredible, especially with the thousands of unique characters games often have nowadays. Imagine every NPC having a unique voice, and the ability to dynamically respond to the players?

Damn that would do a ton for immersion!

DaedPsyker · on Jan 13, 2023

Even customisable like your character's appearance. This was one of my criticisms of Fallout 4, the voice actors weren't bad, it just didn't fit very well some player characters.

petepete · on Jan 13, 2023

Could really bring life to some older games with lots of text. Daggerfall springs to mind.

barking_biscuit · on Jan 13, 2023

Imagine if in-game voice chat automatically converted a % of guys voices into girls voices, so they would start getting harassed, realize how awful that is, and then over time stop doing it.

lm28469 · on Jan 13, 2023

Imagine if all POC wear white people body suits, that would solve racism !

moffkalast · on Jan 13, 2023

That's just kind of a lame workaround that doesn't tackle the actual issue though.

anigbrowl · on Jan 13, 2023

Less than a week ago, I said AI would upend the market for voice actors within the next couple of years: https://news.ycombinator.com/item?id=34271948

bsenftner · on Jan 13, 2023

Not only voice actors, include radio hosts, documentary/news content, any voice over for anything, as well as imitation of familiar voices.

holler · on Jan 13, 2023

This will really open pandoras box for scammers and other bad actors. Grandma won't know she's speaking with an AI.

elboru · on Jan 13, 2023

Grandma already falls for scams. Will I know I’m speaking with an AI?

spaceman_2020 · on Jan 13, 2023

I really can't tell half the time on easy-to-spam forums like Reddit if I'm chatting with an AI or a human.

bee_rider · on Jan 13, 2023

Any interaction that you didn’t kick off is a scam. Whether is is an AI or human is irrelevant.

yCombLinks · on Jan 13, 2023

That would mean any interaction I initiate is a scam for the other party

93po · on Jan 13, 2023

I think this easiest question for a turing test to AI: "What would you choose as a turing test for an AI?"

AlfeG · on Jan 13, 2023

I have an AI service from my mobile company that talks to scammers. Idea is to hold scammer on call as much as possible. Then you can listen or read transcribe of those calls.

intelVISA · on Jan 13, 2023

Blade Runner: 2024

spotting open (closed) AI models by doing the Voight-KAPTCHA test

trinovantes · on Jan 13, 2023

There's already an AI streamer on Twitch

https://kotaku.com/neuro-sama-twitch-vtuber-ban-holocaust-mi...

QuantumGood · on Jan 14, 2023

The "budget advantage" doesn't matter in the top half of the industry; directing a human voice talent is not going away anytime soon.

Budget clients are suspicious of AI voices and feel "cheated" if they think someone they hired are using one. This will change fastest.

jurassic · on Jan 13, 2023

I'd like to see this technology become cheap and ubiquitous enough that everyone can choose for themselves what voice they would like to hear right at the moment of consumption. It's always a huge bummer when there's a book I want to listen to on audible with terrible narration. Somebody must have liked that voice for the person to be hired, but people's tastes differ and sometimes the people they've selected just really grate on my ears.

It would also be cool if celebrities / existing voice talent could somehow license the synthesis of their voice. I read something about James Earl Jones doing this with Disney for future Star Wars projects. I'm sure there are people out there who would love to have every work they listen to be in the voice of their favorite narrator/celebrity.

coverband · on Jan 13, 2023

This is cooler than ChatGPT and image generation as far as I'm concerned. If they're able to bring out the emotional connectivity and purposefulness of the human voice, it will be revolutionary...

belter · on Jan 13, 2023

The laughing examples are pretty impressive.

"The first AI that can laugh" - https://blog.elevenlabs.io/the_first_ai_that_can_laugh/

cheeseface · on Jan 13, 2023

There are so many uses cases for this, even with the current quality. Many game developers dream of having something like this.

intelVISA · on Jan 13, 2023

Awesome, I think a few years we'll hit levels of AI generative media tech where you can produce, as a lone greybeard, a Cyberpunk 2077 tier title. Same # of bugs too ;)

purplepatrick · on Jan 13, 2023

Still sounds pretty fake to me. There’s a hurriedness to the speech and a monotonic uniformity in enunciation that is uncannily machine. Good to know that voice actors will have jobs for a while longer…

drivers99 · on Jan 13, 2023

I thought the Narrative one was 100% there. I'd still give the News one 99% and Conversational 98%.

affgrff2 · on Jan 13, 2023

Yes, for the sake of humanity, I hope the examples are cherry picked and The lord of the rings audiobook is in the train set...

UncleEntity · on Jan 13, 2023

> Good to know that voice actors will have jobs for a while longer…

They don’t have to work anymore, just sell their voice and sit at home collecting royalty payments is the future according the TFA.

And they’ve been making progress on the roboticness with every new model that comes out. Just a matter of time (and data) for the AIs to figure out how words string together naturally.

janosd · on Jan 13, 2023

This assumes that legislation/ajudication won't tell AI companies that grabbing any content they can find and not reimburse the original author is "fair use" or something equivalent in other jurisdictions. Here's to hoping.

imtringued · on Jan 13, 2023

The random voice generator is pretty bad but sometimes you actually get a reasonably good voice except you can hear clicking sounds that interrupt the voice.

Kiro · on Jan 13, 2023

Yeah right. You would never pass a blind test on this.

sebzim4500 · on Jan 13, 2023

I think the narritive one would pass a blind test.

The conversational one wouldn't, although it could pass for a bad (human) voice actor.

jaapbadlands · on Jan 13, 2023

I'm both scared and peeking through my fingers at the thought of the evolution of vocal-tuning plugins like Melodyne. Currently you can basically draw the pitch of a vocal performance, however using AI you could re-render the wavefile and adjust more parameters than simply pitch - such as timbre, inflection, vibrato, dynamics, distortion, openness, softness, breathiness, or a bunch of other vocal attributes.

spacechild1 · on Jan 13, 2023

Voice synthesizer plugins, such as Vokaloid or Synthesizer V, can already do that quite convincingly, so it is only a matter of time before it can be applied to existing voice recordings.

smusamashah · on Jan 13, 2023

I have only ever listened to one audio book and that was "Hitchhiker's guide to the galaxy" by Stephen Fry. This is nowhere close to that.

It does mimic the ups and downs of voice but they don't add up. The don't make sense. They don't really have any connections with what is being spoken.

But since it can do expressions, it probably only needs special markers in text to tell it how to really read a sentence.

singedproxy · on Jan 14, 2023

Stephen Fry is considered one of the best audiobook readers of all time. This AI voice is still better than 100% of AI audiobooks in the market, and likely better than a good portion of HUMAN readers as well.

dalmo3 · on Jan 13, 2023

I found the samples incredibly good. But the samples in their other post about conveying emotions[0] are still far from acceptable.

In any case, I'm hoping this can be expanded to other languages as it would be an amazing tool for language learning.

[0] https://blog.elevenlabs.io/the_first_ai_that_can_laugh/

piotr11 · on Jan 13, 2023

Thanks (ElevenLabs dev here), we are constantly working on improving our model, we do out own research and train it completely from scratch.

We do support Polish already and the quality is actually better IMO than English as we use a newer generation model: https://www.youtube.com/watch?v=ra8xFG3keSs Some people think it is fake and we hired a real voice actor to read.

UncleEntity · on Jan 13, 2023

I’ve been reading up on this the last couple of days because…oh, look, squirrel!

This seems to me where The Big Guys are going to dominate because it comes down to a big data problem. For example, whisper (admittedly speech to text) was trained on 480,000 hours of speech data scraped from the web. The next ‘contender’ used something like 48,000 hours. Who can compete with that who doesn’t own a whole cloud?

angusturner · on Jan 13, 2023

As someone working on singing synthesis, I know how hard it is to get that last 10% quality that makes a human listener instantly recognise if the voice is real or generated.

These are really impressive results! For anyone interested, my team’s singing work: https://youtu.be/LPy20zSWhZA)

imtringued · on Jan 13, 2023

If you are going to have such an intensive particle effect in your videos at least bother to upload a 4k version so there is a tiny chance that not every single frame consists of nothing but artifacts.

Also don't put gumi and English in the same search query on YouTube. I don't know how they did it but the voices from six years ago sound better than SOTA TTS based on deep learning today...

singedproxy · on Jan 14, 2023

Clearly the point of the video is its AUDIO content, not the visuals. The lack of a "4k version" does not make any difference other than saving you bandwith :-)

meremortals · on Jan 13, 2023

Very well done! Any suggestions on where/how one might learn to do something similar? I love the idea of being able to swap singers on a given track

TrackerFF · on Jan 13, 2023

Sounds damn good. Would it be possible to use your own voice for training, and replicate it?

Obviously that could come with some serious security risks, but it would also make content presentation much easier for many people. Gone are the days of doing voiceover recordings for videos.

matisqe · on Jan 13, 2023

Hey! ElevenLabs dev here - yes exactly! We do rapid voice cloning (just on few seconds samples) that for American accents works really well - which is already available in Beta. We can also do a professional near-identical copy with longer samples too.

dmd · on Jan 13, 2023

See https://news.ycombinator.com/item?id=34309306

helloworld · on Jan 13, 2023

Their Steve Jobs voice simulation is creepily good:

https://www.youtube.com/shorts/34vB41lyQ-A

jcims · on Jan 13, 2023

This probably breaks HN etiquette but...

wow

abnercoimbre · on Jan 13, 2023

We should be allowed to break etiquette for the rare and the shocking!

mc32 · on Jan 13, 2023

This is awesome for any kind of situation where you need a (human) speaker. No tripping over words, mumbling, mispronouncing --all fluid and audible with perfect enunciation!

dr_kretyn · on Jan 13, 2023

Nice timing as I'm looking for a way to replace espeak. Are there any pretraines text-to-speech models available? Or, some dataset that could be use to train a model?

mlboss · on Jan 13, 2023

Try https://github.com/neonbjb/tortoise-tts

UncleEntity · on Jan 13, 2023

That one requires a big GPU and isn’t real-time.

If you want to clone a voice and have a shitton of compute to fine tune it’s a good one.

If you just want your computer to tell you you need to be out the door in 30 seconds or you’ll miss the bus then not so much.

dr_kretyn · on Jan 13, 2023

I found that it's much easier for me to read and remember when reading with voice assistant for which I need real-time. Ages ago I bought Ivona Text-to-speech and was serving me very well for many years. The last few years I used AWS Polly and espeak (using this https://github.com/laszukdawid/cracker) but thought that there must be something better.

UncleEntity · on Jan 13, 2023

There seems to be a fairly wide selection between state of the art and just glueing together a bunch of phonemes it’s just that tortoise-tts is up there with the state of the art.

I haven’t looked into the mid range stuff but there’s probably something out there with pretty good quality if you don’t mind doing some coding, end user applications seem to be mostly in the startup SaaS charge by the character domain.

dr_kretyn · on Jan 13, 2023

Thanks! That was the first search and has nicely written colab, so will definitely give it a try. However, I've seen in readme that generating a sentence takes quite a long time.

> On a K80, expect to generate a medium sized sentence every 2 minutes.

Are you aware of other available?