Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've just taken a minute to confirm what my ears told me in Audacity. Please have a look at this screenshot: https://cloudflare-ipfs.com/ipfs/Qma41RMzieQ6ZGdGem9rLxnxEL1...

The Lyra version is clearly much louder. This is a serious problem and it borders on being reasonable to call it "cheating".

It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for. For the purpose of comparing artifacts in two samples, it's absolutely crucial that they be the same volume. You might as well compare two image compression codecs where one of them "enhances" the colors of the original image.

Note: I took the clips for this comparison from the "clean speech" examples at the original source on Googleblog, not the blogspam.



Definitely a real effect, but it seems like Google accounted for that in their listening tests.

The Google blog post links to the Lyra paper[1], and Section 5.2 of the paper says:

> To evaluate the absolute quality of the different systems on different SNRs a Mean Opinion Score (MOS) listening test was performed. Except for data collection, we followed the ITU-T P.800 (ACR) recommendation.

You can download those ITU test procedures[2], and skimming through that, it does mention making "the necessary gain adjustments, so as to bring each group of sentences to the standardized active speech level" and a 1000 Hz calibration test tone related to that. (See sections B.1.7 and B.1.8.)

So, if I skimmed correctly, and if the ITU's method of distilling speech loudness into a single number is an effective way to match the volume levels[3], then it seems like they did what they could to avoid cheating at the listening tests.

It is still interesting that Lyra makes things louder, though.

---

[1] https://arxiv.org/pdf/2102.09660.pdf

[2] https://www.itu.int/rec/T-REC-P.800-199608-I

[3] and even for speech that passes through different codecs before its loudness is determined


That's good information, thanks. My comment is mostly directed at the misleading blog post. I have no direct reason to believe that the study itself was compromised, though it would be great to have confirmation from the authors that it was not.

The part about matching volume levels in the ITU recommendation seems to be talking about making sure the source recordings were balanced. All their clips might well have been exactly at the ITU recommended level of -26 dB, but if Lyra introduced a level mismatch this would have to have been corrected at a later stage, and it's at least possible that it might not have been. The Lyra paper does explicitly say that they didn't follow the ITU rec for "data collection".

Interestingly, the Opus and Reference sources are almost exactly -26 dB relative to full scale (according to several measurements of loudness), but the Lyra clip is about 6 dB hotter. So the source (the reference clip) exactly follows the ITU rec. Did they remember to fix the levels on the Lyra clips? I hope so!


Excellent catch. To be precise you need a measure of perceptual loudness rather than raw waveform excursion, but I would expect the results to be in line with what you've found.

> It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for.

As a former mastering engineer, you're absolutely right that this is well understood in the audio industry. I used to present my clients with level-matched comparisons of source audio vs. processed so they would understand exactly what was being done, aesthetically.


Here's an EBU R 128 measure using r128gain:

    File 'reference.flac': loudness = -25.7 LUFS, sample peak = -9.2 dBFS
    File 'lyra.flac': loudness = -19.9 LUFS, sample peak = -4.1 dBFS
    File 'opus.flac': loudness = -25.9 LUFS, sample peak = -9.7 dBFS
So that also matches pretty closely what my ears heard.


So nearly 6dB louder. That is quite a bit.


Yes. Assuming this was done by the Lyra encoder directly, and not the person who wrote the blog post pushing the slider, you have to wonder how it would respond to an input with a peak around -3 dB. Would it clip? Is it performing some kind of normalization? Who knows!

It's also interesting that the Lyra clip is ever so slightly longer than the other two. The Opus clip has exactly the same number of samples as the reference. Maybe they didn't use a decoder for Lyra at all, just played the file on one system and recorded it using a line-in on another?


Well the blog post states they use a generative model. If that means what I think that means, they are doing in audio what folks have done in images which is sketch where a rabbit should be and have the model generate a rabbit there. Great encoding because the notion of 'rabbit-ness' is in the model not the data.

Again, assuming I understand correctly, that isn't a "trans coder" that is "here is a seed, generate what this seed creates." kind of thing.

Another way to look at it would be to think about text to speech. That takes words, as characters, and applies a model for how the speech is spoken, and generates audio. You could think of that as really low bit rate audio but the result doesn't "sound" like the person who wrote the text, it sounds like the model. If instead, you did speech to text and captured the person's timbre and allophones as a model, sent the model and then sent the text, you would get what they said, and it would sound like them.

It is a pretty neat trick if they are doing it that way since it seems reasonably obvious for speech that if you could do it this way then the combination of model deltas and phonemes would be a VERY dense encoding.


From that I would naively expect that the performance of the codec could be very language dependent.

It would be interesting to see how well it does in other languages.


But that is the thing, what if it isn't a codec? What if it is simply a set of model parameters, a generative model, and a stream of fiducial bits which trigger the model? We have already seen some of this with generative models that let you generate voices that sound like speaker data used to train the model right? What if, instead of say "i-frames" (or what ever their equivalent would be in an audio codec) you sent "m-frames" which were tweaks to the generative model for the next few bits of data.


I think he's saying that if it is in fact a generative model, we will see significant differences when we try different languages.


I think I understand what he is saying, what I am struggling with is why would a 'sound' GAN care about different languages when an 'image' GAN doesn't care about different images?

What I'm getting at is this, do they use the sample as a training data set with a streamlined model generation algorithm so that they can send new initial model parameters as a blob before the rest of the data arrives?

It has my head spinning but the possibilities seem pretty tantalizing here.


I think you would agree that a GAN, or any generative model can only generate something in the same domain as what it was trained on. If you trained on mostly on human faces with a little bit of rabbits, it's not going to generate rabbits well. If you trained it on mostly on English text and a little bit on Mandarin, it's not going to generate good text in Mandarin. Same with sounds. Different languages use different sounds.

If they use any generative model in their codec, they had to train it first, offline, on some dataset. They can't possibly train it equally well on all languages, so we should be able to tell the difference in quality when comparing English to more exotic languages.


I agree with you 100%! This is where I am wondering:

> If they use any generative model in their codec, they had to train it first, offline, on some dataset.

One thing I'm wondering if they have a model that can be "retrained" on the fly.

Let's assume for this discussion that you've got a model with 1024 weights in it. You train it on spoken text, all languages, just throw anything at it that is speech. That gets a you a generalized model that isn't specialized for any particular kind of speech and the results will be predictably mixed when you generate random speech from it. But if you take it, and ran a "mini" training system on just the sample of interest, so you have this general model, you digitize the speech, you run it through your trainer, now the generalized model is better at generating exactly this kind of speech agreed? So now you take the weights and generate a set of changes from the previous "generic" set, you bundle those changes in the header of the data you are sending and label them appropriately. Now you send only the data bits from the training set that were needed to activate those parts of the model that are updated. Your data product becomes (<model deltas>, <sound deltas>).

What I'm wondering is this, if every digitization is used to train the model, and you can send the model deltas in a way that the receiver can incorporate those changes in a predictable way to its local model. Can you then send just the essential features of the digitized sound and get it to re-generate by the model on the other end (which has incorporated the model deltas you sent).

Here is an analogy for how I'm thinking about this, and it can be completely wrong, just speculating. If you wanted to "transport" a human with the least number of bits you could simply take their DNA and their mental state and transmit THAT to a cloning facility. No need to digitize every scar, every bit of tissue, instead a model is used to regenerate the person and their 'state' is sent as state of mind.

That is clearly science fiction, but some of the GAN models I've played with have this "feel" where they will produce reliably consistent results from the same seed. Not exact results necessarily, but very consistent.

From that, and this article, I'm wondering if they figured out how to compute the 'seed' + 'initial conditions', given the model that will reproduce what was just digitized. If they have, then its a pretty amazing result.


What you described could work in principle, but in practice, "mini" training on a single sample is not likely to produce good results, unless the sample is very large. Also, this finetuning would most likely be quite resource intensive. I recall older speech recognition systems where they would ask you to read a specific text sample to adapt the model to your voice, so yes, this can work.

If you can fit a large generative model (e.g. an rnn or a transformer) in the codec, you might be able to offer something like "prompt engineering" [1], where the weights of the model don't change, but the hidden state vectors are adjusted using the current input. So, using your analogy, weights would be DNA, and the hidden state vectors would be the "mental state". By talking to this person you adjust their mental state to hopefully steer the conversation in the right direction.

[1] https://www.gwern.net/GPT-3#prompts-as-programming


how do the samples compare when loudness is made constant/normalized?


I do still prefer Lyra overall, though not as much as some others (see sibling comment). To me, Lyra is cleaner and easier to understand, but the artifacts it introduces are more annoying and fatiguing than those introduced by Opus. Some people in this thread have reported trouble understanding Lyra, which I attribute to the strange artifacts it introduces.


Just reduce the volume of the lyra one by hand. Doesn't change the fact that it sounds leagues above the others.


When I was doing amateur audio engineering from my parents' basement 15 years ago this phenomenon was easily noticeable and extremely difficult to avoid, particularly when doing things where the entire point is to change the loudness of everything (one aspect of mastering) or to change the loudness of things relative to other things (mixing). My "solution" was to simply take a long break (perhaps overnight) and see if I still thought the newer version sounded better with clear ears than I remember the old version sounding with clear ears.


Just as a reference

    Title                   RMS     Peak    Diff
    clean_p257_011_lyra     -20.07  -1.13   18.93
    clean_p257_011_opus     -26.07  -6.65   19.41
    clean_p257_011_refer    -25.77  -6.15   19.63
PSD (Welch's method, window=213)

https://i.imgur.com/Y8A4kkx.png


Isn't that due to audio (frequency) compression coming out of the generative model?

I guess that can be tweaked either way but they're going to tend towards that exactly because it sounds louder and thus clearer.


There are a couple of effects here:

1. Lossy codecs will use a low-pass filter to get rid of hard to compress higher frequencies. This is often inaudible, but even when it is, it should lower the volume, unless you're applying some kind of compensation for it.

2. It's true that lossy codecs compress different frequencies differently, but that's not usually done in such a way that amounts to applying EQ to the frequencies.

3. Even if the relative balance of frequencies did shift as a result of applying lossy compression, this is still done in a way that the overall loudness of the audio does not change. In this case the Lyra output has changed significantly and in an easily audible way (about +6 dB). You could easily get the same effect in Opus just by amplifying (or applying compression to) the result, but Opus is doing things correctly.


I wouldn't call this cheating though. Audio compression makes use of the mental way we perceive sound. If a sound artefact is perceived to sound more clear when it is louder compared to another one with the same compression bitrate yet less volume I would say this falls into the category psychoaccoustic compression


If simply turning up the volume made it easier to understand the speech, then not turning up the volume on the other codecs would make for an unfair comparison.


Off topic but how do you put images on ipfs and what’s the advantage over e.g. Imgur?


Cloudflare are kindly hosting [1] a free HTTP gateway for the IPFS [2] network. So I can host an image myself on a server with IPFS, and Cloudflare will cache it for me. This is better than Imgur because the latter has been redirecting users to annoying "social" pages with ads instead of showing them the actual image, at least in some cases. I also can't be sure whether Imgur recompresses your uploads or not - I assume they usually do.

It's also more generally useful because I can host other files too, not just images.

[1] https://www.cloudflare.com/distributed-web-gateway/

[2] https://ipfs.io/


Is hosting the image yourself, on like a $5 Digital Ocean Droplet and a $10 personal domain, out of the question? This would seem to be the ideal situation in terms of simple, decentralized file hosting solution. What are the downsides of this approach?

(I can imagine a server package that can modify index.html sub-resource URLs depending on current server load, preferring private, locally hosted sub-resources but willing to use 3rd party solutions like Cloudflare, too, if required by a black swan event.)


Out of the question? No. As convenient as running one command on a desktop computer? Also no.

> the ideal situation in terms of simple, decentralized file hosting solution

Not sure what you mean by "decentralized" if you are in fact hosting it yourself.

> What are the downsides of this approach?

Well, for the casual person it has the obvious downside that you have to have your own VPS. Most people don't have those. Even if you do, IPFS has a couple of advantages: you can host images anonymously, and anyone anywhere in the world can "pin" the image to make sure it stays live. If you're using a server and you forget to pay DO your $5 one month, all your images go poof into the ether.


There's also https://imgz.org, that doesn't have annoying social stuff (I made it specifically for that!).


I keep getting 524 errors when trying to access files I uploaded. What am I doing wrong?


Not OP, and I haven't done it myself yet, but it makes a lot of sense. It's basically free image hosting if you can get the file cached by Cloudflare.

Imgur these days is slow and riddled by ads. A page show will sometimes load many times the image size in Javascript, stylesheets and images. It also doesn't allow the user to just view the raw image, going as far as redirecting requests to the raw image to a web page if you directly access the URL.

The only downside I see is that the URL is less user friendly without the IPFS toolset installed. Sounds like a pretty good idea to me.


Are these imgur problems a USA thing? Because I literally never had any of the behavior described. Direct image links always go to the image, there is no JS or HTML or anything.


Yep same here. Maybe the issue is about non-direct links, but it could be that imgur changes what it responds with depending on the request. If the url ends with .jpg it can still serve an HTML page.


If you upload and then link to an image on Imgur and the person clicking the link has not run Imgur's javascript yet within $timeperiod, the image will not display. Instead you'll be given javascript to run.

Cloudflare as a gateway is distasteful and this won't last long, but for now at least when you click an ipfs image over cloudflare you get an image and not javascript code.


Why is CF distasteful?


Not OP but I assume because it kind of defets the purpose of IPFS. IPFS is all about links that refer to content and not location, a cloudflare link is now back to a location and when the cf mirror goes down, the link will be broken.

But its also the only way normal users can see the content.


I lowered the volume on the Lyra one and the sound is still clearly WAY MORE clearer than the other two.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: