When Google's announcement [1] was posted a few days ago, I listened to their sa...

BugsJustFindMe · on March 1, 2021

> 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness

I honestly don't hear a 'th' in the original.

> It was surprising, given that the speaker doesn't sound like that in the original.

I disagree. Note that the speaker says "these bread". The three possibilities for those two words—"these bread", "thiiiis bread", and "these breads" with a dropped "s"—would all be weird things for a native english speaker to say for different reasons relating to either wrong pronunciation of "this" or "breads" or the fact that bread is its own collective noun and therefore we typically require separate qualifiers like "these buns" or "these loaves" when separating multiple individual "pieces" (another) into a non-collective. We ask for "some bread" or "a piece of bread", but we don't say "a bread" or "some breads" unless we are discussing categorical types of bread ("ciabatta and rye are breads") rather than instances of such, and only one type of bread is represented in the video.

The Lyra reproduction has a band-pass filtered quality to it, but I find it still remarkably representative of the reference.

cbdumas · on March 1, 2021

I agree completely, I've listened to the reference sample probably ten times now and I can only hear /wɪ/

hackpert · on March 2, 2021

Yes yes yes please somebody look into latency with these fancy ML methods! You can quite literally have most ML models approximated to a very good degree as very fast DSP using very few processor cycles given modern-day CPU optimizations. Or heck use an ISA simulator plugged into another fancy ML model, and have it also optimize to minimize instruction count while recreating the same signal! (having a model optimize on its own inference is a neat trick, but I digress.) I’m sure ML is just one bottleneck among many (looking at you, Chromium) but I so desperately wish people started caring about latency again.

scotty79 · on March 1, 2021

"When a man looks for something beyond his reach ..."

The word "looks" sounds completely wrong for me with Lyra. To the point of completely not understanding what this word is supposed to be (first example with your [1] link).

est31 · on March 1, 2021

For me "looks" sounds fine but the word before, "Man", sounds like "Lan". So to me the opus sample sounds more understandable. Even though the "quality" of Lyra is better, that shouldn't be the score to optimize for, but fidelity of the compression. It's not helpful if the compression algorithm generates a beautiful flower from a flower image but it's a red flower instead of a blue one like the original. Gives me Xerox vibes...

ghusbands · on March 2, 2021

Similarly, for me, the word "miracle" in the the noisy environment becomes like "vericle" with Lyra, where in Opus it is clearer. (Speex does fairly badly, but in a way that's a clear failure overall rather than making it sound like something else.)

com2kid · on March 1, 2021

To me, "man" sounds like "lan", but "looks" sounds correct.

I actually listened to the Lyra version first, and thought the speaker said "when a lad looks for something beyond his reach"

dkjaudyeqooe · on March 1, 2021

Depending on your accent and the model there is a fine line between "can't" and "cunt" or "six" and "sex".

ampdepolymerase · on March 1, 2021

Are the speech models sufficiently generic across all languages?

qayxc · on March 2, 2021

That remains to be seen. In my experience the performance with anything other than (US) English is mediocre at best and less common the language, the worse results get.

So while Spanish, French, or German might get there eventually, don't even try Polish, Czech, or Farsi (Persian) dialects.

cityzen · on March 1, 2021

I read your comment before I watched that video and I can't stop laughing. It sounds ridiculous!