Definitely a real effect, but it seems like Google accounted for that in their listening tests.
The Google blog post links to the Lyra paper[1], and Section 5.2 of the paper says:
> To evaluate the absolute quality of the different systems on different SNRs a Mean Opinion Score (MOS) listening test was performed. Except for data collection, we followed the ITU-T P.800 (ACR) recommendation.
You can download those ITU test procedures[2], and skimming through that, it does mention making "the necessary gain adjustments, so as to bring each group of sentences to the standardized active speech level" and a 1000 Hz calibration test tone related to that. (See sections B.1.7 and B.1.8.)
So, if I skimmed correctly, and if the ITU's method of distilling speech loudness into a single number is an effective way to match the volume levels[3], then it seems like they did what they could to avoid cheating at the listening tests.
It is still interesting that Lyra makes things louder, though.
That's good information, thanks. My comment is mostly directed at the misleading blog post. I have no direct reason to believe that the study itself was compromised, though it would be great to have confirmation from the authors that it was not.
The part about matching volume levels in the ITU recommendation seems to be talking about making sure the source recordings were balanced. All their clips might well have been exactly at the ITU recommended level of -26 dB, but if Lyra introduced a level mismatch this would have to have been corrected at a later stage, and it's at least possible that it might not have been. The Lyra paper does explicitly say that they didn't follow the ITU rec for "data collection".
Interestingly, the Opus and Reference sources are almost exactly -26 dB relative to full scale (according to several measurements of loudness), but the Lyra clip is about 6 dB hotter. So the source (the reference clip) exactly follows the ITU rec. Did they remember to fix the levels on the Lyra clips? I hope so!
The Google blog post links to the Lyra paper[1], and Section 5.2 of the paper says:
> To evaluate the absolute quality of the different systems on different SNRs a Mean Opinion Score (MOS) listening test was performed. Except for data collection, we followed the ITU-T P.800 (ACR) recommendation.
You can download those ITU test procedures[2], and skimming through that, it does mention making "the necessary gain adjustments, so as to bring each group of sentences to the standardized active speech level" and a 1000 Hz calibration test tone related to that. (See sections B.1.7 and B.1.8.)
So, if I skimmed correctly, and if the ITU's method of distilling speech loudness into a single number is an effective way to match the volume levels[3], then it seems like they did what they could to avoid cheating at the listening tests.
It is still interesting that Lyra makes things louder, though.
---
[1] https://arxiv.org/pdf/2102.09660.pdf
[2] https://www.itu.int/rec/T-REC-P.800-199608-I
[3] and even for speech that passes through different codecs before its loudness is determined