How's the "accent conversion model" work? Is it all embedding based? If so—and i...

How's the "accent conversion model" work? Is it all embedding based?

If so—and if you want to transfer-learn new downstream models from embeddings—then seems to me you are onto a very effective way of doing data augmentation. It's expensive to do data augmentation on raw waveforms since you always need to run the STFT again; but if you've pre-computed & cached embeddings and can do data augmentation there, it would be super fast.