As someone who's an idiot about machine learning, is it possible to run this cod...

jmalicki · on July 3, 2021

If you look at the architecture diagram for Wav2Vec-U, the "generator" is doing exactly that - generating waveforms from the vectors. All GANs work this way, and is how websites like https://thispersondoesnotexist.com/ work. Of course as the sibling comment notes the results today might not be great for this task, and it is open research, bit it's not as of it just can't be done at all.

lunixbochs · on July 3, 2021

My reading of the generator diagram (figure 6) isn't that it is generating waveforms, but that it is generating phoneme probabilities.

You can train a similar system to produce audio on the output of wav2vec, though it probably won't sound similar to the input audio (accent/voice) unless you expose more features of the input than phonemes.

monocasa · on July 3, 2021

Generalized reverse projection through even non recurrent neural networks is still an open research problem.

So no in this case.

sdenton4 · on July 4, 2021

I wouldn't rule it out entirely; you could use these as a replacement for linguistic inputs in a tts system, for example, and I imagine it wouldn't be totally terrible. It would still end up being a pretty heavy system, though, with many other parts.

spywaregorilla · on July 3, 2021

That doesn't sound like a particularly realistic problem to solve.

monocasa · on July 3, 2021

I agree, but all the more glory if someone does solve it then. And the field is still new enough that I don't want to be cited for decades like the iPod release "no wireless. Less space than a Nomad. Lame." slashdot comment.