Any sufficiently fancy compression for communication formats immediately makes m...

oddity · on Aug 15, 2021

Indeed. I also expect this failure mode will be undetected for a long time due to how our sense of hearing works. My last neuroscience class was many years ago, but I do remember that in some sense, we hear what we expect to hear (more so than vision if I recall correctly, though there is plenty that happens in our vision processing) in that our ears tune for particular frequencies to filter out ambiguities.

Suppose a person says something that the codec interprets differently. Perhaps they have one of the many ever evolving accents that almost certainly were not and absolutely could not possibly be included in the training set (ongoing vowel shifts might be a big cause of this). The algorithm removes the ambiguity, but the listener can't tell because they hear themselves through their own sense of hearing. Assume the user has somehow overcome the odd psychological effects that come hearing the computer generated audio played back, if that audio is mixed with what the person is already hearing, it's likely they still won't notice because they still hear themselves. They would have to listen to a recording some time later and detect that the recording doesn't match up with what they thought they said... which happens all the time because memory is incredibly lossy and maleable.

Most of the time, it won't matter. People have context (assuming they're actually listening, which is a whole other tier of hearing what you expect to hear) and people pick the wrong word or pronounce things incorrectly (as in not recognizable to the listener as what the speaker intended) all the time. But it'll be really hard to know that the recording doesn't record what was actually said. You need to record a local accurate copy, the receiver's processed copy, and know to look for it in what will likely be many hours of audio. It's also possible that "the algorithm said that" will be a common enough argument due to other factors (incorrect memory and increasing awareness of ML-based algorithms) that it'll out number the cases where it really happens.

skybrian · on Aug 15, 2021

This seems similar to being able to read your own handwriting, when others can't. If it's an important recording, someone else should listen, and it would be better to verify a transcription.

In a live situation, they will ask you to repeat if it's unclear.

spyder · on Aug 15, 2021

Yep, it's kinda happening with the music example on the page: the Lyra (3kbps) sample have some human sounding part when the original reference is just music without any speech. Probably because Lyra was trained on speech.

im3w1l · on Aug 15, 2021

It's a valid concern but I think it can also be solved. Compress-decompress and compare to the original using a method not susceptible to xerox effect. If the sound has materially changed then use a fallback method, robust but less efficient, for that particular window.

But idk this may be too slow for real time.

userbinator · on Aug 15, 2021

I remember that: https://news.ycombinator.com/item?id=6156238

I agree, neural networks are exactly the type of system that works well "most of the time" but then can fail unexpectedly in odd and subtle ways.

spindle · on Aug 15, 2021

https://www.youtube.com/watch?v=ONWNWBoqTuM