That is a truly terrifying reality. Scammers are already having a field day doing cold calls on senior people. Just imagine how confusing it will be when they think it's a relative
It's very impressive, but the samples sound slightly garbled throughout, as if some heavy noise suppression was applied to the whole clip. I wonder if that's pre-processing for the samples on this site, or if that's a side effect of this model.
It sounds to me like the edits are accomplished by taking the original text and feeding back through the model, so that the edited tracks contain zero original audio. Of course when your text precisely matches the training material it comes out pretty well, but not perfectly.
In late March 2024, this results in some noticeable distortion. Don't ask me whether you'll be able to notice by, say, September 2024 though.
It probably comes out less jarring than trying to insert the edits overall, but you still really want a clean voice sample going in. For instance, the Pulp Fiction voice had a background music track, and it does interesting things to the resulting voice once it has to start synthesizing novel text. (Second sample in the first table, Irving Ramses.) I wish they'd tacked on a bit more text to that one, I can just barely hear it's doing "weird things" but there's not a lot of time to analyze it before it's over.
Sorry for the derail, but I really hate that published papers with audio/video clips often don't work at all on iPadOS or iOS. I do most of my leisure reading on my iPad Pro, purposefully separating that leisure time from when I'm on my MacBook, which I try to use exclusively for working.
You should let your family members know that the scammers are going to start sounding like people they know.