Hacker News new | past | comments | ask | show | jobs | submit login
The Rise of Synthetic Audio Deepfakes (nisos.com)
183 points by ajay-d on July 27, 2020 | hide | past | favorite | 42 comments



My all-time favorite audio deepfake is Nobel prize winner Milton Friedman reading the lyrics to the 50 Cent track "PIMP". It really captures Friedman's tell-tale cadence and idiosyncratic lilt: https://www.youtube.com/watch?v=4mUYMvuNIas


There is already a large problem with political ads cherry picking and slicing audio and video to cheat viewers. I really worry that deep fakes will take it to another level completely. I fully expect the current administration to eagerly adopt it if available.


I just assume if a politician is speaking they are lying, at best misleading. Add in marketing, I assume it is a lie, regardless of political party.


You still lose. There are players who don't need you to believe them, as it is sufficient for them if people cannot trust each other.


People have been impersonating politicians' accents pretty well for decades, mostly for comedy value, without tripping over sentence structures and pronunciation (unless that's the joke...)

Other than lowering the effort bar, I'm not sure what deepfakes add. Sure, it means someone can make Donald Trump reading copypasta[1] or Dubya rapping [2] without knowing how to do impressions or even speaking the language natively, but a competent voice actor could do a much better job of putting words into politicians' mouths, and one thing political ads don't lack is budget.

1. https://www.youtube.com/watch?v=LEzIAixNkFI 2. https://www.youtube.com/watch?v=LEzIAixNkFI


This might be paranoid. But I've established a protocol with some people in my life. Should someone with my voice ever contact them and ask for money (because emergency bla bla), nothing is to be done until a passphrase is mentioned. It's only a matter of time, until someone gets significant voice data and related contact numbers and proceeds with using those voices to train a model. Afterwards, that model will be used to real-time fake the original voice in a scamming attempt.


That's good practice. Sadly these kinds of scams already happen without the effort of synthetic voices. Scammers call an older person and say "Hey Grandma, it's me your grandson. I need $500 for bail right away!" With the help of Facebook they can learn names and details to sound more convincing.


Recently a friend changed her number and told me via text. Before adding her number i asked her a question that she and only I would know like who sat next to you at the old office.

Think im going to keep doing this type of verification. It may annoy friends and family, but not sure how a hacker could ever know such small details between you and another.


Very good idea. A few years ago, I had a "friend" text me asking for emergency funds. It seemed wonky but within the realm of possibility. I asked a question about a mutual friend from the past revealed it as a scam.


There is an annual challenge for synthetic voice detection, ASVSpoof, that evaluates submissions on different types of attacks to speaker verification systems: text to speech, voice conversions and replayed attacks.

The conclusion from the 2019 evaluation [1]: known synthetic deep fakes are fairly easy to detect using simple models with very low error rates (even high-fidelity techniques with Wavenet vocoders).

[1]: ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection speech (https://www.isca-speech.org/archive/Interspeech_2019/pdfs/22...)


> Deepfake technology is not sophisticated enough to mimic an entire phone call with someone.

With modern voice conversion technology it is perfectly possible actually.


Voice imitation isn't just timbre.

It's prosody, rhythm, accent, word choice, and so on.

By the time you've mastered all those, you're practically halfway to becoming a professional voiceover artist.

Remember, trained voiceover artists have been mimicking voices for a long, long time. Their timbre isn't always perfect, but faking voices doesn't need deepfakes.


If you're looking for it sure. But I'm willing to bet existing technology is sufficient to catch an awful lot of people off guard. Hearing a familiar voice is usually quite disarming.


indeed, the text to speech conversation works extremely fast even in browser (like Colab) once you get your model tuned right.


Kids in 2020 are going to be doing prank phone calls with GPT-3 and voice models from untraceable SIP phone numbers.


and Facebook is currently generating billions of voice profiles based on recorded WhatsApp sessions. Soon these profiles are sold to advertising agencies.


No, WhatsApp messages and calls are end-to-end encrypted.


Sounds like a conspiracy theory but those aren’t mutually exclusive.


Is this true? On Google I only see mentions of using voice data to improve their speech transcribing.


Audio "deepfakes" have been worked on much longer than ones for video, although video deepfakes have the added issue of deep-faking synchronized audio. Today's consumers don't seem to be bothered by video deepfakes if they play to the beliefs of the audience.


Useful example is how the Joe Biden Burisma phone call that bubbled up through Russian media was fabricated. I pulled it apart with ffmpeg and there were a number of artifacts that showed editing and splicing.

If you're handy with ffmpeg and python, you can assess their veracity pretty easily. Of course, if I were on a political ratf'ing team, I'd use the same tools to add those artifacts to a copy of an offending (real but off message) stream and amplify the distribution of that fake-faked version with a debunking press release handy, so YMMV. While the Biden thing wasn't a deepfake directly, (shallow fake?) we're going to see tons of actual deepfakes around the election.

IMO, elections are no longer between candidates, they are a war on truth for domination of the narrative - office is the effect. A campaign that focuses on what happens once the war is over is daydreaming about the future and distracted from the present and this will lose them key battles. For this reason, I think deepfakes are going to be the biggest weapon in campaign arsenals for the near future. Interesting times.


I’m legitimately not sure democracy will survive modern, sophisticated propaganda techniques, plus an open, international Web, plus losing the ability to more-or-less trust audio and video recordings that we’ve grown used to over the last hundredish years. Between state actors and, eventually if not already, transnational corporations waging info warfare, I kinda doubt the institution can take it. Too much info, too fast, from too many sources.


Not sure why you are being down-voted.

Democracy is historically not the default state of human interactions. Just look at facebook/nextdoor/twitter, most people want to force others how to behave (with force that is available), i.e. authoritarianism.

Any technology that makes it easier to vilify someone by definition can be used to weaken democracy.


If developing nations are any indication we have other problems to worry about. While you've got the occasional genocide the advent of mass literacy (decades ago) and modern information technology has made those nations less authoritarian and less corrupt. The leaders mostly now try to keep their misdeeds on the down low because once they're out they spread over social media like wildfire. It's not great but it's progress.


I wonder what encryption and key based techniques can be used to verify the authenticity of audio and video records in the future.


> I wonder what encryption and key based techniques can be used to verify the authenticity of audio and video records in the future.

None, since encryption isn't the answer to this problem. Take Romney's leaked "47%" comment [1] or Hillary Clinton's leaked "deplorables" comment, how would encryption have been useful to either verify the recordings' authenticity or reject them if they had been a deepfakes? It wouldn't have, as those comments were meant for private audiences, so neither of them would have officially signed them. If the encryption could trace the recording back to the individual that made it, then the leaker might decide never release the recording (since they don't want to be outed). And if all the encryption can do is trace back to a random device, why not just get a random device to sign your deepfake?

[1] https://www.npr.org/sections/itsallpolitics/2012/09/17/16131...

[2] https://www.npr.org/2016/09/10/493427601/hillary-clintons-ba...


There are DRM and signing schemes, stegonography, etc, but in the moment, people don't care. Or rather, the message registers in their minds whether it happens to be true or not. It's how advertising works. Beliefs are essentially tribal, and we all believe our tribal sources. If a source of news isn't a part of your tribe, you're probably not going to believe what it's saying until someone from your ingroup verifies it. Crypto doesn't do that.

The irony is that we all trust crypto because of the perceived tribal affiliation of the developers as well, which doubly reduces the case for crypto verifying media.


We can barely get actual security devices to keep their keys secret. Do you expect rando Chinese $49 video recorder to have a trusted key management solution?


It looks like problem! :-) I have solution for this for years, but demand is not there yet.


The solution is to make media literacy and logical fallacies a required course in democratic basic education.


Half of population have IQ lower that average, so they rely on those with larger brains to help them to make decisions. They are vulnerable. You are trying to make them smarter, like you, but this will not help them. You need to assist them, every day of week, to combat enemy propaganda.


I don't get it. By the time someone is running for president, they've been around a while. Someone who doesn't have an understanding of the character or political positions of Joe Biden or Donald Trump (or whoever) by now hasn't been paying attention to anything for a good long while, so why would they start with a random deepfake?

I mean, sophisticated propaganda techniques have been utilized forever, no? How do you defeat it without getting into checksums, etc? Would you agree that critical thinking skill gained from part of a rounded education would help you see through the BS? Of course, one side of the politcal duopoly in the US is trying very hard to keep Americans from getting educated...


> Someone who doesn't have an understanding of the character or political positions of Joe Biden or Donald Trump (or whoever) by now hasn't been paying attention to anything for a good long while, so why would they start with a random deepfake?

Because they saw it on an ad, perhaps even a targeted one. I think deepfakes are going to be a "push" kind of thing: more used to corrupt the background information environment than be engaged with directly.


And unfortunately, quite a lot of people will just lap it up, either as they don't know the tech exists or because something about the message meshes with their preconceived notions.

The art will be to release something just scandalous enough to make a difference, but just believable enough to pass a very basic smell test.

It's horrifying to watch against a background of politics becoming more polarised and more vitriolic on all sides.


The images of spectrogram analysis between the real and fake voices seemed to be distinguishable by the human eye. Can a image model be trained to detect fake voice spectrograms based on pitch and tone choppiness?


The issue is that if you can measure it, you can train an AI to beat the other AI detecting it.

As Pilate said: ‘Quid est veritas?‘


Generation is a much harder problem than discrimination though.


Would you not just then take this and feed it into the training?


This makes me wonder how would one go about adding an authentication key to audio? We have seen in the past encryption for text shared via email and watermarks embedded in images but I haven't come across something for audio. Happy to hear if someone has worked in this field.


> adding an authentication key to audio

I'm afraid that it's not going to work. Not more than adding an authentication key to handwriting.

The voice itself was an authentication key, as was the hand. But now these keys are too weak and easy to forge.

I'd say we should completely stop trusting the authenticity of any recorded voice, and maybe all voice by phone. Trust the content, the style, have a set of agreed-upon keywords with your partner and/or your parents to check that the other side is indeed who it alleges to be in an extraordinary situation. Maybe not today, but tomorrow it may happen to be useful to thwart a scam directed against you, with the scammer perfectly imitating the voice of your loved one(s).


How long does it take to train the model?


Next milestone I'm waiting for: Trump audio to Sarah Cooper video on the fly




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: