When Google's announcement [1] was posted a few days ago, I listened to their samples and heard an odd effect in the "chocolate bread" sample (the video chat example) [1], which is not mirrored in this article.
On that sample, I felt [2] that the Lyra version exaggerates the pronunciation of the phrase 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness, and overshoots both the lead consonant and first vowel of 'choc', and then proceeds to wash the entire rest of the sentence with a peculiar brightened voice that's high, lacks consonant definition, and is close to ringing.
I'm guessing it's actually style transfer, because though the result sounds not much like the speaker's original, the result is reminiscent of the speech pattern and accent that people with East Asian and Southeast Asian ancestry adopt when speaking American English. It was surprising, given that the speaker doesn't sound like that in the original. I wonder if others hear this too.
While Lyra sounds richer and wider-band than Opus or Speex at these bitrates, the degradations and artifacts of those codecs are universally recognized (through years of familiarity with telephones) as compression artifacts and not innate features of the speaker themselves. Therefore listeners can be expected to be sympathetic to the quality issues and not attribute the whole of the sound on the speaker's person.
If AI-trained voice synthesizer codecs become the norm, and it performs well on most speakers, that expectation will go away, and the resulting audio will be attributed wholly to the speaker. That increases the impact of mistakes and misrepresentations introduced by the codec, unbeknowst to the speaker and listener.
> 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness
I honestly don't hear a 'th' in the original.
> It was surprising, given that the speaker doesn't sound like that in the original.
I disagree. Note that the speaker says "these bread". The three possibilities for those two words—"these bread", "thiiiis bread", and "these breads" with a dropped "s"—would all be weird things for a native english speaker to say for different reasons relating to either wrong pronunciation of "this" or "breads" or the fact that bread is its own collective noun and therefore we typically require separate qualifiers like "these buns" or "these loaves" when separating multiple individual "pieces" (another) into a non-collective. We ask for "some bread" or "a piece of bread", but we don't say "a bread" or "some breads" unless we are discussing categorical types of bread ("ciabatta and rye are breads") rather than instances of such, and only one type of bread is represented in the video.
The Lyra reproduction has a band-pass filtered quality to it, but I find it still remarkably representative of the reference.
Yes yes yes please somebody look into latency with these fancy ML methods! You can quite literally have most ML models approximated to a very good degree as very fast DSP using very few processor cycles given modern-day CPU optimizations. Or heck use an ISA simulator plugged into another fancy ML model, and have it also optimize to minimize instruction count while recreating the same signal! (having a model optimize on its own inference is a neat trick, but I digress.) I’m sure ML is just one bottleneck among many (looking at you, Chromium) but I so desperately wish people started caring about latency again.
"When a man looks for something beyond his reach ..."
The word "looks" sounds completely wrong for me with Lyra. To the point of completely not understanding what this word is supposed to be (first example with your [1] link).
For me "looks" sounds fine but the word before, "Man", sounds like "Lan". So to me the opus sample sounds more understandable. Even though the "quality" of Lyra is better, that shouldn't be the score to optimize for, but fidelity of the compression. It's not helpful if the compression algorithm generates a beautiful flower from a flower image but it's a red flower instead of a blue one like the original. Gives me Xerox vibes...
Similarly, for me, the word "miracle" in the the noisy environment becomes like "vericle" with Lyra, where in Opus it is clearer. (Speex does fairly badly, but in a way that's a clear failure overall rather than making it sound like something else.)
That remains to be seen. In my experience the performance with anything other than (US) English is mediocre at best and less common the language, the worse results get.
So while Spanish, French, or German might get there eventually, don't even try Polish, Czech, or Farsi (Persian) dialects.
On that sample, I felt [2] that the Lyra version exaggerates the pronunciation of the phrase 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness, and overshoots both the lead consonant and first vowel of 'choc', and then proceeds to wash the entire rest of the sentence with a peculiar brightened voice that's high, lacks consonant definition, and is close to ringing.
I'm guessing it's actually style transfer, because though the result sounds not much like the speaker's original, the result is reminiscent of the speech pattern and accent that people with East Asian and Southeast Asian ancestry adopt when speaking American English. It was surprising, given that the speaker doesn't sound like that in the original. I wonder if others hear this too.
While Lyra sounds richer and wider-band than Opus or Speex at these bitrates, the degradations and artifacts of those codecs are universally recognized (through years of familiarity with telephones) as compression artifacts and not innate features of the speaker themselves. Therefore listeners can be expected to be sympathetic to the quality issues and not attribute the whole of the sound on the speaker's person.
If AI-trained voice synthesizer codecs become the norm, and it performs well on most speakers, that expectation will go away, and the resulting audio will be attributed wholly to the speaker. That increases the impact of mistakes and misrepresentations introduced by the codec, unbeknowst to the speaker and listener.
[1] https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...
[2] https://news.ycombinator.com/item?id=26282519