Launch HN: Play.ht (YC W23) – Generate and clone voices from 20 seconds of audio

h1fra · on March 27, 2023

Congrats on launching. People already made a lot of feedback on the product itself so I'll keep mine.

Just a few note on the UX:

- Recording your own voice should contain a script too, that could help increase the quality of the sampling because I struggled to say anything relevant.

- Recording again, there is no time so it's hard to say when it's okay to stop

- You enforce the checkbox "not [...] to generate any sexual content" yet you have a filter to display only nswf

- It doesn't work at all with non-english voices, maybe you can add a warning or a way to fine tune depending on the language?

- There is no way to delete a voice nor an account, that's a huge red flag especially when dealing with PII like this.

- An other person has said it already, but generated voices are identified by an Auto Increment, making it easy to access PII of an other person. I would recommend at the very least a random string or an UUID

- All generated voices are public and no way to delete them

barbariangrunge · on March 28, 2023

The terms of service is terrifying for anybody who has a voice or anything of value they want made into speech

> you automatically grant, and you represent and warrant that you have the right to grant, to us an unrestricted, unlimited, irrevocable, perpetual, non-exclusive, transferable, royalty-free, fully-paid, worldwide right, and license to host, use, copy, reproduce, disclose, sell, resell, publish, broadcast, retitle, archive, store, cache, publicly perform, publicly display, reformat, translate, transmit, excerpt (in whole or in part), and distribute such Contributions (including, without limitation, your image and voice) for any purpose

There’s a bunch more in there too

Most of the ai companies have these terms, and it’s pretty sketchy

eutropia · on March 28, 2023

> irrevocable, perpetual,

Are complete, absolute, non-fucking-starters.

The rest, I roughly understand legally to an extent for actually "doing a SaaS with your voice" as being necessary.

HOWEVER.

I should be able to say "Stop using my voice" and there should be a default license duration equal to your paid subscription. If I have to occasionally click "By clicking this I certify that I renew the license under the originally agreed terms with a new duration" or something, fine, so be it.

JohnFen · on March 28, 2023

Woah, good catch. Yes, those are terms I'd never agree to.

hammadh · on March 27, 2023

Thanks, we intended the playground to be merely a testing tool for the new model we're building. We'll improve based on your feedback!

mywacaday · on March 28, 2023

I noticed that when i put in the following text from the BBC as a test and it pronounces 2008 as "two thousand eight" but I believe most people would pronounce it as "two thousand and eight"

Great work

A billionaire's son, who fled to Yemen within hours of the death of a student in London 15 years ago, has admitted his involvement to the BBC.

The body of Martine Vik Magnussen, 23, was discovered under rubble in a Great Portland Street basement in 2008.

Farouk Abdulhak, who is on the Met Police's most wanted list and is the subject of an international arrest warrant, has never spoken about the case before.

iamthemonster · on March 28, 2023

Two thousand eight = American English Two thousand and eight = UK and Australian English

mywacaday · on March 28, 2023

I'm based in Europe and a native English speaker, I thought I was aware of most of the differences between UK/US English. I can't believe I have worked with Americans for decades and never noticed this. Live and learn!

JohnFen · on March 28, 2023

I'm an American, and both methods sound right to my ears. I hear both variations quite a lot from the people around me. I assume it depends on what part of the US you're from.

jnwatson · on March 28, 2023

American English traditionally uses an “and” to separate the whole from the fraction, e.g. two thousand eight and two thirds.

ljlolel · on March 28, 2023

Pretty sure I would say two thousand eight?

the88doctor · on March 28, 2023

I would definitely just say "two thousand eight"

jeroenhd · on March 27, 2023

Listening to the demos I'm not entirely convinced by this (https://playground.play.ht/listen/189 was pretty funny). I wonder if this company will end up taking down (and subsequently pricing out most people using this tech for fun) arbitrary voice generation just like its competitors have so far.

Going to the demo page and hearing a random snippet of Musk-worship was pretty weird. Out of all audio tracks to place at the top of your demos, you chose this?

nico · on March 28, 2023

> (https://playground.play.ht/listen/189 was pretty funny)

Warning to others wanting to click on the link: damn that was creepy.

noduerme · on March 28, 2023

that's pretty fuckin funny. Did you train it to do that?

andrewstuart · on March 28, 2023

It was damn funny.

I’m still laughing five minutes later.

nullsense · on March 28, 2023

Ghost in the machine.

juggli · on March 29, 2023

damn. scared me

bunnyswipe_com · on March 28, 2023

sounds from hell

mugr · on March 27, 2023

Wow, I call to the team behind this. I really STRONGLY think you should at least implement some sort of URL stealthing. I'm not a Web Security expert, but it reminds me of a talk where some company just made medical records 'public' like this.

h1fra · on March 27, 2023

Oopsie, the infamous id int Auto Increment

airstrike · on March 27, 2023

https://playground.play.ht/listen/1339 and https://playground.play.ht/listen/210 are hilarious

mikrotikker · on March 27, 2023

On the contrary, this should be accessible so we can see what people are generating.

yreg · on March 27, 2023

The demo page says 'Recently generated', you have listened to the last snippet someone made.

jeroenhd · on March 27, 2023

I know the demo page was user generated. My Musk comment referred to this page: https://play.ht/ultra-realistic-voices/

SamBam · on March 28, 2023

Two(!) of them were about Musk.

bongobingo1 · on March 28, 2023

I see a bright future for play.ht in the "pre-event" audiolog generation market. Somebody get ubisoft on the phone.

WakoMan12 · on March 28, 2023

AI can now generate youtube poops

delgaudm · on March 27, 2023

How do you assert that the cloned voice has been truly permitted by the voice owner? I've had my voice cloned without my consent by other people using Descript and Eleven Labs.

What is your process for verifying consent?

ros86 · on March 27, 2023

When I tried this service previously, you had to read (out loud) something saying that you were giving consent.

Aachen · on March 27, 2023

I'd be curious what the false positive rate on that is. Can you clone anyone's by collecting a set of ten voices with unique timbre reading the required statement plus pitch control to get close enough? A hundred? Or can you trick the neural net by giving it something that sounds like white noise to humans until the NN triggers in the right way and goes "ok yep that's a match, you're authorised now"?

Probably not something we'll get to hear as part of the PR pitch.

Or is the consent statement the thing that will be cloned and is there no separate training audio? Then it might actually work and you'll just have to get close enough that the human you're trying to fool can't distinguish anymore (defeating the need for this tech in the first place, at least in targeted rather than automated cases).

ros86 · on March 27, 2023

Yeah, good point - don't know. When I tried I actually did get a (personal?) email saying that it didn't match closely enough. After uploading another sample (based on a different text) it went through.

I like your idea of just training on the consent text! That wasn't the case when I tried it as you needed around 3h (optimally) of training data.

cortesoft · on March 28, 2023

If someone has the capability to trick the service like that, they likely have the capability to recreate the functionality themselves.

Aachen · on March 28, 2023

With a couple soundalike voices and changing the pitch in Audacity? That's a far, far cry from doing cutting edge neural networks that clone voices with samples of less than half a minute.

If you mean the white noise, I meant that as a brute force attack because, to do it more targeted (to know what it'll accept as seeming like your target voice), you'd likely need their exact model rather than doing your own.

barking_biscuit · on March 27, 2023

Just use another voice cloning service to do that.

vishalchandra · on March 28, 2023

1xdevloper · on March 27, 2023

It's mentioned in the second demo video that they have a strict process to prevent cases like yours. I think Descript started asking for identity verification after its service was abused. This one probably has a similar process too.

dkdbejwi383 · on March 27, 2023

I think the previous comment wants to know what the "strict process" is exactly.

whywhywouldyou · on March 27, 2023

Right, and I'm sure their "strict process" is something like "we take it down after you notify us and provide proof that the voice is yours".

yellow_lead · on March 27, 2023

But they don't say what it is

mikecoles · on March 27, 2023

TIL, the Booth Junkie is on HN. Love your work, sir.

delgaudm · on March 27, 2023

Thanks my friend!

anigbrowl · on March 27, 2023

Hey HN, we are Mahmoud and Hammad

Are you though? You might just be computer-generated.

While I'm very impressed with this technically (and as a pro-audio person I feel validated to see my predictions of a few years back coming true so dramatically), I don't see anything about risk management in here. Your tech absolutely will get used by scammers, given the overabundance of voice data on the open internet. How are you going to hedge against that?

mahmoudfelfel · on March 27, 2023

We have many mitigations in place to increase the safety of this service, I mentioned some of that here https://news.ycombinator.com/item?id=35331310

anigbrowl · on March 27, 2023

That's interesting. But I think it's a mistake to focus on relying on price to prevent abuse at scale. The use case for abuse of this technology is in highly targeted frauds, not broad-spectrum scams like insurance robocalls. Additionally, this will be zero deterrent to deep-pocketed actors like political action committees that generate fakery to influence elections and the like.

I'm trying not to be reflexively dismissive, and I know the technology is evolving so fast that your individual company can't necessarily pre-empt it, any more than an email software supplier is responsible for the existence of phishing. But I work adjacent to the security space (studying violent extremists) and I can think of a ton of ways to abuse this where economics would be absolutely zero deterrent.

mmkos · on March 28, 2023

Wow, I haven't even thought of that. Imagine this being used together with a chatgpt equivalent. Scam rates are going to go through the roof.

Natfan · on March 27, 2023

This is already being used for scams.

https://playground.play.ht/listen/1079 (https://archive.ph/HKjue)

How exactly do you expect to combat this type of content?

hammadh · on March 27, 2023

The intention for this playground was to let people try the model. We actually have auto moderation on the user facing platform (https://play.ht/) and malicious text gets blocked and the user get flagged.

owlbynight · on March 28, 2023

Except this post is 8 hours old and I'm still able to view this link.

SamBam · on March 28, 2023

17 hours old, still there.

jprete · on March 28, 2023

Another 8 hours and I can still see it too.

Apocryphon · on March 28, 2023

How would your auto moderation detect that example is malicious?

bradleysz · on March 27, 2023

This is not a full solution, just spitballing, but I wonder how effective it would be to have a flagging system built with GPT4 where the prompt was some form of "This is text submitted to a text-to-voice model. Determine the probability that this is being used maliciously." Then manually review anything that returns >X%.

thisOtterBeGood · on March 28, 2023

Same: https://playground.play.ht/listen/2002 (https://archive.ph/wip/hsaqU)

spokeonawheel · on March 29, 2023

Sounds like old school AI, very similar to the spam problem google solved could easily take care of this.

Just stop it before they can generate it.

However. Its just a matter of time so I wouldnt put it on the author to stop this kind of stuff. The only defense is education.

MuffinFlavored · on March 27, 2023

https://play.ht/app/voice-cloning > Clone a voice now

Pops a modal: Try Voice Cloning for Free!

Enter a credit card for $0.00/mo with no other information on screen

Bounce.

Why not let me play around with it a little without asking for a credit card?

devmunchies · on March 27, 2023

I think if you are cloning voices, you should be required to have a credit card or some other KYC identifier. Even if it's free. This kind of highly abusable tech should have a paper trail IMO.

MuffinFlavored · on March 27, 2023

I guess I misunderstood/didn't think it all the way through. Not sure what the balance should be but... I just wanted to see how it would be at cloning my voice (not "a" voice that doesn't belong to me) as a quick gauge to "is this technology ready to play around with".

delgaudm · on March 27, 2023

As someone whose voice has been cloned without my consent, I could not agree more.

antibasilisk · on March 27, 2023

yeah because that's working great for crypto lol

flangola7 · on March 27, 2023

What do you mean? KYC is required on every US exchange.

Firmwarrior · on March 27, 2023

Exactly, that's why the crypto space doesn't have any scams

antibasilisk · on March 27, 2023

It's trivial to get your money to exchanges run by people/machines who don't care to comply with US law, or to render the KYC worthless in the first instance.

hammadh · on March 27, 2023

It's an effort to prevent abuse. We previously asked users to pay upfront but most people want to try it out first.

jameshiew · on March 27, 2023

I would mention something to that effect in the modal because it wasn't clear to me why it was asking for card details at that point for "$0.00/mo" (though I guessed the reason). Maybe something like "To prevent abuse, we require card details, but you won't be charged", but worded better.

MuffinFlavored · on March 27, 2023

> but you won't be charged

"no matter what based on your usage/you are locked into the free tier" would have helped for sure

i still would've bounced because i just wanted to goof off with it quickly while it captured my attention and requiring a payment method is just... terrible friction to capture users being able to quickly test one of the key features you advertise, but i guess if fraud concerns are that bad, that's the tradeoff you have to accept?

hammadh · on March 27, 2023

Thank you. Would fix this.

MuffinFlavored · on March 27, 2023

Do you accept anonymous Visa/Mastercard/etc. gift cards in this payment method? If you do... are you actually preventing abuse or just making it slightly more complicated to pull off?

joshmn · on March 27, 2023

Playing with this now, wow.

My mom passed away a few years ago. I always let her calls go to my voicemail so I could have them. I was using Google Voice at the time so this worked wonderfully. Unfortunately, I will not listen to many of them — she was an alcoholic and I can't bear to listen to her while drunk. The few I have of her when she's sober I listen to occasionally.

Having said, this is really nice.

testmasterflex · on March 27, 2023

Sorry man. :( I wish you well

gwerbret · on March 27, 2023

Given the very (very, VERY) obvious concerns associated with malicious deployment of this tech, and the minimal/largely ineffectual countermeasures deployed by the founders, what surprises me the most is that YC gave this startup its stamp of approval. It used to be that they offered at least a basic sanity check to anything they funded. Is this now getting lost as they scale up their funding operations?

rcme · on March 28, 2023

The constant worry about malicious deployment is so tired in my opinion. The technology to clone voices exists. Your trust is audio recording should already be shaken. Trying to hobble this product on the grounds of "it's dangerous" just serves to limit creativity.

mike_d · on March 28, 2023

I think this in an area where there are many more malicious use cases than legitimate.

It's like spyware developers that claim their software is for remote administration of computers you own.

rcme · on March 28, 2023

Don’t take this personally, but I don’t think you’ve thought too hard about what you can do with this technology. For instance, you could create audiobooks for every book ever published. You can change scripts for movies after shooting has already happened. Indie game developers can now afford high quality voices in their games. Even AAA games like The Elder Scrolls can vastly expand their in-game voice variety. I think it’s amazing.

Apocryphon · on March 28, 2023

You're not wrong, all of that is great. But the capacity for even more spam calls and scammers generating fraudulent content with cloned voice samples will be an immensely annoying issue. Anyone who owns a phone in the modern age absolutely cannot be a Pollyanna when looking at this technology. There are real issues that must be addressed.

rcme · on March 28, 2023

Is access to realistic voice really the limiting factor in spam calling?

mike_d · on March 28, 2023

Right now if your brother texts you and says they are stuck somewhere and need you to wire them $500, you call them and verify it is legitimate. That is the best answer we have for users.

Once your brothers voice has been cloned and generative AI can deep fake his face on a video call, we are pretty much screwed.

rcme · on March 29, 2023

I usually just make sure the number is right.

Apocryphon · on March 29, 2023

How would you do that if they're claiming they lost their phone and are thus calling for a pay phone or the phone in a police station?

jprete · on March 28, 2023

The technology never should have existed in the first place. Adding to it is still adding additional harm.

drewwwwww · on March 28, 2023

asking someone to take basic precautions like “have a fire extinguisher” or “hazmat labeling” just gets in the way of innovation!

rcme · on March 28, 2023

Generally hazmat labeling is only required when there is a clear and present harm possible. So far, the only argument presented is “theoretically someone could abuse the this” which is very different than “if you’re exposed to this, you will get cancer.”

If a theoretical argument that the masses can’t be trusted due to the actions of a few appeals to you, then I imagine you also support banning encryption to prevent terrorism.

MattRix · on March 28, 2023

well no, their argument is more like: "Everyone should keep advancing the art of cooking, but just make sure they have a fire extinguisher handy" and your response is "Don't use fire to cook because there's a chance it could burn your house down".

Apocryphon · on March 28, 2023

Well, jurisdictions are starting to ban gas stoves, so indoor cooking with fire might eventually become a thing of the past.

mahmoudfelfel · on March 28, 2023

What is in this demo is a very rate-limited, early version of our new model. We have many mitigations in place to increase the safety of our main product (Play.ht); I mentioned some of that here https://news.ycombinator.com/item?id=35331310

CJefferson · on March 28, 2023

Your mitigations are so bad, you must know that they are almost useless. No website has ever kept up with spam by individually checking every complaint, and you would only stop people after they've already recorded large amounts of output from a stolen voice.

On the other hand, as you also know, you could easily put a much stronger safeguard on by making people say a short prepared statement and only clone voices who have said that statement. You do not appear to have done that. Why not, except it would make it harder to voice steal?

neilv · on March 28, 2023

The mitigations don't seem to address targeted attacks that I'd think you can assume will happen.

How do you address the civil and criminal liability of that?

bryant · on March 28, 2023

Are there even any criminal laws against this? I mean, it strikes me that there should be, but my non lawyerly self has never heard of any.

newonhere · on March 28, 2023

Yes, absolutely. Using one‘s likeness without explicit consent first is illegal in (most of?) Europe and is a tricky subject in the US. For example look at Crispin Glover‘s lawsuit against Back to the Future II.

https://en.wikipedia.org/wiki/Personality_rights

bryant · on March 28, 2023

To be clear, I'm looking for references to criminal law, not civil or case law. This all seems a combination of the latter, and even here it's not obvious what applies in cases where a third party produces the infringing content.

newonhere · on March 28, 2023

I am not a lawyer.

I had to briefly look up the difference between criminal law and case law to see what you mean. I have no idea if there are any criminal cases in the US about this.

In Civil Law countries as opposed to Common Law countries the plaintiff could very well be the state for this type of legislation. For example, look at tech companies being fined for GDPR violations, same basic idea.

taytus · on March 28, 2023

Then why release this outdated version?

utopcell · on March 28, 2023

Can't you take a moment and appreciate the great technical achievement before you ?

kybernetyk · on March 28, 2023

I don’t want to topple US democracy. I just want to make walk through videos for my apps without my terrible accent coming through :)

So thanks for funding this.

anoy8888 · on March 28, 2023

Moral is the last thing YC cared about . In fact Paul wrote an essay about good founders being the kind of people who tend to break rules. Rules, laws , morality are for other people

hn_throwaway_99 · on March 28, 2023

I think it's impossible to write anything with nuance on the Internet, as it is always taken out of context and used as a gross caricature of the original point, as you have done.

I'm quite familiar with pg's essay, and the idea that he's arguing that "Rules, laws , morality are for other people" is laughable. Sure, it's fair to argue against his point, but one of the main thing he highlights in his essay is clearly knowing the difference between rules and laws that are morally important, versus ones that came about through regulatory capture or "tradition".

Most societies have at least some tradition of celebrating "good trouble".

rockemsockem · on March 28, 2023

I feel silly for saying this, but it's very obvious that rule breaking is not the same as acting immorally.

Machines should be able to speak in realistic voices. Every cool future I can imagine includes that. Why muzzle it just because people can be scammed with it?? People get scammed by jpgs, random people calling and saying they're from the bank or the IRS. I'm really not interested in limiting what humanity can accomplish to account for folks getting scammed. Many of them will get scammed anyway. What we should do is make it easier for these individuals to learn how to not get scammed or to come up with scam insurance for the ridiculously gullible. I'm just spitballing, but banning cool technology because people can be scammed is overkill.

CJefferson · on March 28, 2023

My problem isn't computers speaking in realistic voices. The problem is speaking in your voice, or my voice.

killdozer · on March 28, 2023

Just because there's some bad stuff out there doesn't mean that opening the flood gates to higher amounts of more advanced bad stuff is acceptable. It's absolutely reasonable to look at a new technology and try to figure out if it is likely to bring about more harm than good and take action based on that assessment.

dang · on March 28, 2023

That's not fair at all. YC has declined to invest in many startups that they didn't think were doing good things. I don't know anything about the thought process in this particular case but you're miles off base with "Moral is the last thing YC cared about".

nsxwolf · on March 27, 2023

This is going to be the shortest gold rush in history. Make your money now because in a couple years you'll be able to build and deploy your own Play.ht for free with a single ChatGPT prompt.

tanepiper · on March 27, 2023

"Trusted by 7000+ users and teams of all sizes" [posts a bunch of company logos]

You've just launched in beta, how can you claim this? I'm always very suspicious of this (I take this from the position of being a tech lead at a multi-billion euro retailer who's logo you'll never be able to use)

Is this one developer? A team? Or is this just marketing bullshit for VCs who somehow don't verify if this is true or not?

hammadh · on March 27, 2023

We launched the playground.play.ht in beta to share the new speech model we are working on. We've been operating play.ht for a while and have teams from these companies using the platform.

swyx · on March 27, 2023

exactly. i hate this practice of just spraying logos all over with no context. give me 3 logos but each with a written case study or zoom conversation or even a tweet saying what they use it for and you get more trust than 100 logos

pencilbow · on March 28, 2023

I've used play.ht for a few weeks now, they offer a solid product IMO. Wouldn't be surprised others are using them too

nanis · on March 28, 2023

What are the "legitimate" uses cases for this kind of service where they would expect to make money from individuals who want their voices cloned? Dub movies? Audiobooks?

mahmoudfelfel · on March 28, 2023

We have seen use cases in audiobooks, podcasts, marketing videos, explainer videos, Commercials, and Gaming, among others.

koolba · on March 28, 2023

I suppose audio books read by people who can’t be bothered with reading a book out loud?

Or maybe a comedic angle of audio books read by unusual accents? Imagine Harry Potter read by Arnold Schwarzenegger.

barbariangrunge · on March 28, 2023

None. The terms of service have creators surrender rights to their voices and their ip when they use this service

janalsncm · on March 28, 2023

If I’m an author and I want to create an audiobook, I might be able to create the whole thing after reading just the first page.

shw1n · on March 28, 2023

My product (www.dopplio.com) leverages this exact tech to reduce manual work done in sales

So I’m always excited to have more options

JohnFen · on March 27, 2023

This is a good reminder that we all need to have a "safe word" that we can use to verify to the important people in our life that the voice they may be hearing on the phone or elsewhere is really us.

Get a panicky call from "me" in the middle of the night? If I don't include my safe word, that call isn't from me.

gus_massa · on March 27, 2023

That scam was popular here in Argentina a few years ago. We call it "virtual kidnaping" https://www.fbi.gov/news/stories/virtual-kidnapping , nobody is kidnaped, it's just a scam using a phone call.

It's not very important that the voice is similar to the supposed victim. Usually the person in the call is weeping and it's very difficult to recognize the voice. Moreover a confusing voice at 2am may be interpreted as any of your relatives or friends, but an exact voice can be interpreted only as one and it's easier to know that that person is safe.

JohnFen · on March 27, 2023

Some scammers tried to pull this scam off on my stepfather years ago. He got a call that, through the wailing and tears, told him that I'd been thrown into a Mexican prison and needed bail money immediately.

He was 90% convinced that it was true, but my mother made him call me before doing anything, which saved him about $10k. She thought it was suspicious that I would have left the country without mentioning it to her.

If the person he was talking to was relatively calm and sounded like me, it might have been successful.

gus_massa · on March 27, 2023

At least here, most were just calling random telephone and letting people guess who is the kidnaped person

> Bububu. Hi, I'm ... bububu

> John?

> Bububu. Yes, I'm John. bububu. I'm in bububu ... the jail in ... bububu

> Mexico?

> Bububu. Yes, In Mexico. bububu. And I need money ... bububu

There are other that research the victim and have more data for a targeted call, but it's more difficult so most cases where random calls where they don't have a sample voice of the victim.

mahmoudfelfel · on March 28, 2023

Society definitely needs to adapt to this new norm; we are trying to roll this out as safely as possible, but others are not as careful, and this technology will just become more ubiquitous over time.

TheUndead96 · on March 28, 2023

It is frightening that we have gotten to this point already.

mlboss · on March 27, 2023

Very good suggestion

jascii · on March 27, 2023

I'm having a hard time coming up with a non-nefarious use case for this.

bovermyer · on March 27, 2023

I'd get a kick out of having my own blog posts read to me in James Earl Jones's voice.

Or, heck, my own voice. Though it'd be surreal to hear not-me-but-me saying things I've never said.

woodrowbarlow · on March 28, 2023

even this is ethically questionable. james earl jones's voice is his livelihood.

bovermyer · on March 29, 2023

While that is true, I'm not suggesting a pattern of behavior - just that it would be fun to hear.

mahmoudfelfel · on March 27, 2023

We have been seeing some of these genuine use cases: youtube creators, audiobooks, elearning videos, podcasts, commercials, dubbing, and gaming.

tgv · on March 28, 2023

BS. That could just be done without imitating someone's voice.

ksrm · on March 28, 2023

No-one is going to listen to an audiobook made with this. It's still fundamentally just TTS.

pencilbow · on March 28, 2023

Have you tried it? I've listened to 2 generated audiobooks so far, has been great

zanderwohl · on March 27, 2023

I am toying about with building a virtual puppet software in the style of watchmeforever. I have a number of voices I do for the stage and DnD that I would be willing to train a few models on so I could give my puppets unique voices.

rockemsockem · on March 27, 2023

Anything written can be listened to with this tech. Any news article, any short story, a draft of a piece of writing you're working on. There is too much text for human beings to read it all.

scrollaway · on March 27, 2023

> There is too much text for human beings to read it all.

so your logic is that all that text should be audio and people will consume more? Because I got news for you, reading is faster than listening.

rockemsockem · on March 28, 2023

When I said there's too much text for human beings to read it all I meant that it isn't feasible to pay people to read all text that someone might want to listen to into audio. Like a random blog written by someone in their spare time probably isn't going to hire a voice actor.

I think the case for having all text be listenable is pretty clear. We're all really busy and often our hands are busy but we're not doing something that mentally stimulating. This is an ideal time to listen to an audiobook, a blog, the news, or whatever else you'd like.

throwaway675309 · on March 27, 2023

Oh yeah, how does reading work out for you while you're driving a car? smh...

vincnetas · on March 28, 2023

And all AI bots are here to generate even more text. :( We will need to rethink and reevaluate lots of things that we are used to.

erichocean · on March 28, 2023

I'm using this kind of technology for temporary voice tracks in animated shorts.

I'd really like something like Img2Img for voices so I can translate a performance to an arbitrary (synthetic) voice.

nullsense · on March 28, 2023

Tortoise TTS can do this. You just pass it your example as a conditioning latent.

erichocean · on March 28, 2023

Thanks!

atentaten · on March 27, 2023

Generating audio for an audio book: If an author could speak for 20 minutes and then generate audio for an entire book from the book's text and the model, I think that would be very useful.

sva_ · on March 27, 2023

20 seconds*

atentaten · on March 27, 2023

The OP mentioned that for so called, "High-fidelity voice cloning", it would take 20 minutes of training. I think a book author would want the best quality possible to reproduce their voice.

JohnFen · on March 27, 2023

Why reproduce their voice? There's no value-add there.

sva_ · on March 27, 2023

Many people prefer an audiobook version of a book to be read by the original author, which isn't always the case. If an author could make that version happen by using 20 minutes of their time + text2speech of the whole book, that would be an immensely positive value proposition on the side of this company.

But I'm not sure. Part of why I'd prefer the original author to read a book is that they vocally emphasize certain parts of the book, and I don't think these models could do that at this point.

JohnFen · on March 27, 2023

> Many people prefer an audiobook version of a book to be read by the original author

Right, but having AI read the book in the author's voice is definitely not the author reading the work.

As you mention, the reason that people like to hear the author read it is because it's the author reading it, theoretically emphasizing and acting things out according to what was intended. It's not just to hear the author's voice.

So I don't see what the value-add is.

jeroenhd · on March 27, 2023

Voice generator tech has created some decent surreal memes (like audio recordings of Biden, Obama, and Trump playing video games together).

Outside of memes or maybe the occasional well-intentioned prank, I really can't think of anything either.

Rubinsalamander · on March 27, 2023

Massively reducing costs for Voice Over in Video Games. This should make it even feasible to create mods with audio which would be great :)

jeroenhd · on March 27, 2023

I would consider studios taking voice actors' voices and using them to generate new content beyond their contract to be abuse. I'm sure big corporations are rubbing their hands in anticipation, but I'm sure killing the VA industry will make the world just a tiny bit worse for everyone else.

Mods are more difficult to attach a moral judgement to. I don't think I'd really consider them malicious, as long as they're not sold, but there's a very thin line between a high quality mod and stealing someone's voice.

Rubinsalamander · on March 28, 2023

I think it will probably kill the current Business Model of the VA Industry. Having the ability to generate as much audio content as you like without the risk of the VA not being available anymore (dead, booked out,...) is just too good to pass up.

Instead we will probably see licenses for generated voices. And in case for games the game developer could make the voice model freely available for mods of his game.(The mods are already using assets from the game, why not also audio?)

jeroenhd · on March 29, 2023

Machine generated content cannot be copyrighted so I doubt companies will switch to AI generated voices for big games for that reason.

Voices can't be copyrighted either, so I don't see how a license for a generated voice would even work.

buu700 · on March 27, 2023

On the other hand, why shouldn't voice actors benefit from this tech?

I can easily imagine a future where AI-generated impersonations are deemed by courts or new legislation to be protected by personality rights. In that world, voice actors could expand their business by offering deeply discounted rates for AI-generated work.

Alternatively, if/when tech like Play.ht is consistently good enough, maybe it just becomes a standard practice for all voice acting work to include a combination of human- and AI-generated content, like a programmer using Copilot or a writer using GPT.

gamblor956 · on March 27, 2023

I'm sure programmers would love to expand their business opportunities by offering deeply discounted rates for creating AI-generated code.

No? Then why do you assume that someone else would want to do the same in their profession?

As AI-generated content is not protectable under IP law, it's a non-starter for games, film, TV, or music for anything except background filler.

buu700 · on March 27, 2023

Sure, why not? If you could earn more money and produce more value to society with the same amount of labor, and the legal/regulatory environment supported it, I wouldn't see a reason not to.

If you had a solo contracting business, and the technology existed to fully outsource a development project to AI based on carefully documented requirements, using it would be a cheaper alternative to subcontracting. Rather than writing every line of code by hand, you would transition to becoming an architect, project manager, code reviewer, and QA tester. Now you're one person with the resources and earning potential of an entire development shop.

I have my fair share of complaints about AI coding tools, but that isn't one of them. Maybe the increase in supply would result in a lower average software engineering income, but it wouldn't have to if demand kept pace with supply.

Furthermore, code is more fungible than a person's voice. If someone wants a particular celebrity's voice, that celebrity has a monopoly on it. Thus, it's not obvious that increasing the supply of one's voice acting work would decrease its value. (I suspect the opposite to be the case, until a point of diminishing returns.)

Although the voice acting case has a similar concern; will we get an explosion in new and/or higher-quality media, or will we see a consolidation to a smaller number of well known voice actors taking an outsized amount amount of work? Another issue, if we look beyond impersonation specifically, is that human voices may become marginalized over time in favor of entirely synthetic voices. I imagine that this would start with synthetic voices playing minor roles alongside human/human-impersonated voices, but over time certain synthetic voices would organically become recognizable in their own rights.

Again, I see plenty of concerns with AI in general, but more of a mixed bag than strictly negative, and there isn't anything inherently nefarious about this product in particular.

Personally, I'm optimistic about what society looks like in the long run if humanity proves to be a responsible steward of increasingly advanced AI. By the time we're at a point where 90% of people can be effectively automated out of a job, we'll have had to have figured out some alternative way of distributing resources among the population, i.e. a meaningful UBI backed by continued growth of our species' collective wealth and productivity. I can easily imagine a not-too-distant world that is effectively post-scarcity, where it's not frowned upon to spend years (or lifetimes) on non-income-generating pursuits, and where the only jobs performed by humans are entrepreneur, executive, politician, judge, general, teacher, and other things of that must be done by humans for one reason or another.

So am I happy that AI is encroaching on skilled labor? In the short term, not necessarily. But it's not necessarily bad either, it's the reality that we're in, and long-term I'm more optimistic than not.

Izkata · on March 28, 2023

Star Trek: Prodigy has already used audio from previous movies and TV to bring back to life several actors from previous series. It's not exactly the same as this, but their dialogue was taken out of context to create new scenes and story.

jeroenhd · on March 29, 2023

I know, and I almost wished they did use AI for that segment because it was pretty jarring (especially the TOS recordings).

There's still a huge difference between "reusing the work the studio paid for" and "recreating your voice forever after doing a single project".

inerte · on March 27, 2023

I think “talking” with dead relatives or friends will become real pretty soon.

If people can find comfort hearing their mom say words of encouragement in a tough situation, I think a lot of people would do it. Kinda hard because for some others that would mean never getting closure.

Weird stuff is certainly about to happen…

starkparker · on March 27, 2023

The last thing on earth I'd want is for any aspect of my dead relatives to be reanimated through technology. No. That's absolutely fucking horrific to consider. I don't need a hallucinating AI pretending to be my dead wife. That's literally shambolic.

There is vastly more potential for that to be abused by others than used in any emotionally or socially constructive way.

Rubinsalamander · on March 28, 2023

I would also find that very creepy and it would probably keep you from moving on. I think there is a big difference between remembering what happened by looking at a photo or hearing an audio recording and having newly generated "content" from a deceased loved one.

woodrowbarlow · on March 28, 2023

there has been some media coverage on this already (e.g. [1]). an emerging concern among mental healthcare professionals is that a sufficiently-convincing simulation could interfere with the progression of the stages of grief, prolonging the 'denial' stage and potentially heightening the intensity of the stages that follow.

[1] https://www.wired.com/story/a-sons-race-to-give-his-dying-fa...

selflesssieve · on March 27, 2023

I can’t wait for spoofed messages from my loved ones.

apazzolini · on March 27, 2023

The scam via voicemail possibilities are endless!

barking_biscuit · on March 27, 2023

What we really need is something on par with this or Eleven Labs that's open source. Then the real fun will begin. At this point I think it's just a matter of time.

nmfisher · on March 28, 2023

Join the LAION Discord #audio-generation - some of us are literally working on this right now.

rockemsockem · on March 28, 2023

Awesome to hear! Joined!!

barking_biscuit · on March 28, 2023

Awesome!

Dowwie · on March 27, 2023

I recommend you immediately add identity verification (state-issued identification verification), set up appropriate secrets store for PII, and audit trail EVERYTHING your users are doing, storing the contents in a secure location. Yesterday. This service will be used to harm others, shortly. I do think that there are exciting, honest things that can be done with this service but you need to set up some friction for use. Know-your-customer rules are going to apply to this category in short time.

People here are talking about taking this service offline but I think everyone needs to be thinking about countermeasures, working on those services next. The genie is already out of the bottle. The degree of effort to put this together is low enough that it will be replicated around the world.

KaiserPro · on March 27, 2023

Like this example here: https://playground.play.ht/listen/1554 which says:

> "Hi Mom, I need some help. Some guys hit me over the head and put me in a van, and they're saying they'll kill me if you don't wire money to this bank account."

top class.

EDIT this was about one page down on the "see what people are generating" page

_tom_ · on March 27, 2023

My stepmother tells she has been getting this type of scam, minus the accurate voice, for years. About one a year.

I'm not sure she would have spotted the scam if it had sounded right.

muyuu · on March 27, 2023

On the bright side, it's not a very convincing rendition of a human.

braingenious · on March 27, 2023

I agree. That guy sounds very nonchalant for being in life-threatening distress.

muyuu · on March 27, 2023

not just that, he sounds remarkably computer-like

PS on the downvote: sorry if I did hurt someone's feelings, but it's the truth

SirLJ · on March 27, 2023

Totally agree, this sounds awful

bentlegen · on March 27, 2023

It doesn’t matter. Given enough time and progress, it will be indistinguishable.

braingenious · on March 28, 2023

I mean, it kind of does matter? Since the start of this thread was a post about the imminent threat that play.ht posed, an example of it not appearing very dangerous is on topic.

It’s very possible that some other voice cloning software will be the undoing of the fabric of society, not this particular website.

bentlegen · on March 28, 2023

It could be this website in 12 months or a few years. Or someone else. Doesn’t matter.

To me your reaction is like dismissing version 1 of the iPhone as a nothingburger because it’s battery isn’t good enough to be practical at launch.

braingenious · on March 29, 2023

It could be like dismissing the iPhone 1. It could also be like dismissing the Nokia Lumia.

Either way an unconvincing audio file is an unconvincing audio file regardless of what other files may become possible in the future through any number of various platforms and software implementations.

In the same vein, it is prudent not to mix up the factual current state of things with possible futures.

I don’t see the purpose of panicking right now. Are you using this anxiety as motivation to design better audio fingerprinting solutions? Identifying bad actors and vulnerable groups? Educating people on how to avoid being manipulated by fake audio?

What does posting “I am scared of the future” online accomplish?

mdrzn · on March 27, 2023

Damn that's a perfect example.

braingenious · on March 27, 2023

How is it that

> I recommend you immediately add identity verification (state-issued identification verification)

and

> The genie is already out of the bottle. The degree of effort to put this together is low enough that it will be replicated around the world.

are thoughts that end up in the same post?

If the genie is out of the bottle, it’s your proposed solution that everybody that runs a model like this implements bank-style KYC?

What do you propose should happen when this sort of software becomes freely available for everyone? When (not if) that happens, what will your suggestion have accomplished?

nullsense · on March 28, 2023

It's more of a "Cover Your Ass with Paper" type thing.

twodave · on March 27, 2023

While I agree with you, the problem is far bigger than any one company in my opinion. These tools are already accessible enough to individuals that no audio or video is trustworthy, regardless of its source. I suspect we can still detect whether most faked audio/video is authentic or not algorithmically, but that's going to turn into an arms race eventually. And IMO none of the "answers" are ones that you really want to see made real, either.

We're in for some really strange times.

Pxtl · on March 27, 2023

I feel like this will be the thing that finally forces digital signing into the public eye. "Wait, is that video real?" "Well, it was signed by a reputable news source."

twodave · on March 27, 2023

Right, which leads to a place where nothing is trusted unless it came from some central authority or from some trusted piece of hardware. I'm not looking forward to the day when I have to use e.g. an Apple or Google piece of hardware or some locked down kiosk or "be famous" in order to conduct business.

Pxtl · on March 27, 2023

The film industry has been pointing cameras at screens for decades. Trusted hardware won't work.

twodave · on March 27, 2023

I assume trusted hardware would include things like LIDAR and biometrics, but if you're assuming those can be beaten then it will be a different kind of arms race, for sure.

mikrotikker · on March 27, 2023

I'll be living in a cabin in the woods by that point.

abirch · on March 27, 2023

I'm imagining the legal implications though I'm not a lawyer. If granny gets ripped off by someone impersonating me with this site, seems like Granny could sue Play.ht.

Play.ht will want to have as much information as possible about their users.

hammadh · on March 27, 2023

You are right, and unfortunately that is a possibility, and we are working on having measure in place to guard against such attempts. We have auto moderation on the input text that will block such audio being generated. Such users are flagged in the system.

Avicebron · on March 27, 2023

What are you filtering for in the input text that would block something like a phone scam?

yreg · on March 27, 2023

How would granny prove the scammer used play.ht?

tomesco · on March 27, 2023

If law enforcement ever busts a scammer and discovers a tool like this was essential to the scam, that would generate lawsuits.

yreg · on March 27, 2023

_fjb4 · on March 27, 2023

While verification could be done for a cloud service like this one, what's more concerning is that locally run models with this tech will be coming soon (think of LLAMA and Stable Diffusion). KYC is merely a stopgap and honestly we'll need effective solutions for detecting vocal cloning impersonation in the future.

perlwle · on March 27, 2023

A couple in Canada were reportedly scammed out of $21,000 after getting a call from an AI-generated voice pretending to be their son.

https://www.businessinsider.com/couple-canada-reportedly-los...

Frondo · on March 28, 2023

There's a podcast I listen to sometimes called "The Perfect Scam," sponsored by the AARP but, I suspect, is intended more for the kids of elderly people who are more at risk for these kinds of things:

https://www.aarp.org/podcasts/the-perfect-scam/

They have quite a few stories about "virtual kidnappings" and interview the people involved -- it's quite interesting, and has given me a lot of insight into how typical it would be for people to hear panic and react with panic...precisely how these scams are intended to go.

hammadh · on March 27, 2023

Couldn't agree more with your comment. We are working on counter measures like manual verification of voice, a classifier to detect cloned speech, etc. As of now we have auto moderation in place that detects and blocks hate/harmful speech.

Firmwarrior · on March 27, 2023

The cat's out of the bag, I'd say you guys should just go full steam ahead and make sure it's your names in the headlines

No need for a bunch of onerous kyc or anything IMO

woeirua · on March 27, 2023

Yes, definitely take this advice from some random user on HN. Can't possibly go wrong.

Firmwarrior · on March 27, 2023

I actually have one thousand HackerNews good boy points, so I'm kind of a big deal

I think that a few years from now this tech is going to be ubiquitous, real time, and work on a mobile device. Trying to slam the lid shut on Pandora's Box probably isn't going to work.. the best thing at this point would be for the word to get out to everyone that voices can now be doctored the same way photos can

gsich · on March 27, 2023

Or it will be used for memes.

nullsense · on March 28, 2023

Working on that as we speak. We will soon all be nostalgic for the memes of this era. Bear in mind 2024 is an election year. What a time to be alive.

chatmasta · on March 27, 2023

Gasp! Yawn. HN has become so pearl-clutchingly alarmist recently. Everybody relax.

The solution to scams is to educate people on scams, as quickly as you can do so in the changing environment, by publishing information about what's possible with the latest technology. The solution is not to require onerous identity verification for every software product that could be used by scammers, because they'll just move to the next product that doesn't require it, or they'll simply provide fraudulent documents. Or you'll get "resellers" who provide their own fraudulent KYC documents and then sell access to their account to other criminals on the black market, making it even more difficult to monitor for abuse.

If you want a startup offering such tools to protect people from scams, they can do it by collecting data on what the tools are used for - it should be pretty obvious based on transcripts who is using it to scam people.

godDLL · on April 8, 2023

I put in "m m m m m m m m m m!"

Got out all kinds. W's, v's, whatnot.

https://playground.play.ht/listen/18373

devmunchies · on March 27, 2023

How is the latency for real-time TTS? I remember kicking the tires several months back but went with one of the big 3 cloud providers since they had lower latency.

I also like that the cloud provider supports SSML and I can explicitly configure the emotion, whereas Playht dynamically changed the emotion based on context of the text.

hammadh · on March 27, 2023

The latency is not real-time yet but we're working on getting it to near real time. Regarding controlling the voice, we've added a few params like rate, voice guidance and temperature but for the most part the emotion is dependent on the text for now.