Yes. Assuming this was done by the Lyra encoder directly, and not the person who wrote the blog post pushing the slider, you have to wonder how it would respond to an input with a peak around -3 dB. Would it clip? Is it performing some kind of normalization? Who knows!
It's also interesting that the Lyra clip is ever so slightly longer than the other two. The Opus clip has exactly the same number of samples as the reference. Maybe they didn't use a decoder for Lyra at all, just played the file on one system and recorded it using a line-in on another?
Well the blog post states they use a generative model. If that means what I think that means, they are doing in audio what folks have done in images which is sketch where a rabbit should be and have the model generate a rabbit there. Great encoding because the notion of 'rabbit-ness' is in the model not the data.
Again, assuming I understand correctly, that isn't a "trans coder" that is "here is a seed, generate what this seed creates." kind of thing.
Another way to look at it would be to think about text to speech. That takes words, as characters, and applies a model for how the speech is spoken, and generates audio. You could think of that as really low bit rate audio but the result doesn't "sound" like the person who wrote the text, it sounds like the model. If instead, you did speech to text and captured the person's timbre and allophones as a model, sent the model and then sent the text, you would get what they said, and it would sound like them.
It is a pretty neat trick if they are doing it that way since it seems reasonably obvious for speech that if you could do it this way then the combination of model deltas and phonemes would be a VERY dense encoding.
But that is the thing, what if it isn't a codec? What if it is simply a set of model parameters, a generative model, and a stream of fiducial bits which trigger the model? We have already seen some of this with generative models that let you generate voices that sound like speaker data used to train the model right? What if, instead of say "i-frames" (or what ever their equivalent would be in an audio codec) you sent "m-frames" which were tweaks to the generative model for the next few bits of data.
I think I understand what he is saying, what I am struggling with is why would a 'sound' GAN care about different languages when an 'image' GAN doesn't care about different images?
What I'm getting at is this, do they use the sample as a training data set with a streamlined model generation algorithm so that they can send new initial model parameters as a blob before the rest of the data arrives?
It has my head spinning but the possibilities seem pretty tantalizing here.
I think you would agree that a GAN, or any generative model can only generate something in the same domain as what it was trained on. If you trained on mostly on human faces with a little bit of rabbits, it's not going to generate rabbits well. If you trained it on mostly on English text and a little bit on Mandarin, it's not going to generate good text in Mandarin. Same with sounds. Different languages use different sounds.
If they use any generative model in their codec, they had to train it first, offline, on some dataset. They can't possibly train it equally well on all languages, so we should be able to tell the difference in quality when comparing English to more exotic languages.
I agree with you 100%! This is where I am wondering:
> If they use any generative model in their codec, they had to train it first, offline, on some dataset.
One thing I'm wondering if they have a model that can be "retrained" on the fly.
Let's assume for this discussion that you've got a model with 1024 weights in it. You train it on spoken text, all languages, just throw anything at it that is speech. That gets a you a generalized model that isn't specialized for any particular kind of speech and the results will be predictably mixed when you generate random speech from it. But if you take it, and ran a "mini" training system on just the sample of interest, so you have this general model, you digitize the speech, you run it through your trainer, now the generalized model is better at generating exactly this kind of speech agreed? So now you take the weights and generate a set of changes from the previous "generic" set, you bundle those changes in the header of the data you are sending and label them appropriately. Now you send only the data bits from the training set that were needed to activate those parts of the model that are updated. Your data product becomes (<model deltas>, <sound deltas>).
What I'm wondering is this, if every digitization is used to train the model, and you can send the model deltas in a way that the receiver can incorporate those changes in a predictable way to its local model. Can you then send just the essential features of the digitized sound and get it to re-generate by the model on the other end (which has incorporated the model deltas you sent).
Here is an analogy for how I'm thinking about this, and it can be completely wrong, just speculating. If you wanted to "transport" a human with the least number of bits you could simply take their DNA and their mental state and transmit THAT to a cloning facility. No need to digitize every scar, every bit of tissue, instead a model is used to regenerate the person and their 'state' is sent as state of mind.
That is clearly science fiction, but some of the GAN models I've played with have this "feel" where they will produce reliably consistent results from the same seed. Not exact results necessarily, but very consistent.
From that, and this article, I'm wondering if they figured out how to compute the 'seed' + 'initial conditions', given the model that will reproduce what was just digitized. If they have, then its a pretty amazing result.
What you described could work in principle, but in practice, "mini" training on a single sample is not likely to produce good results, unless the sample is very large. Also, this finetuning would most likely be quite resource intensive. I recall older speech recognition systems where they would ask you to read a specific text sample to adapt the model to your voice, so yes, this can work.
If you can fit a large generative model (e.g. an rnn or a transformer) in the codec, you might be able to offer something like "prompt engineering" [1], where the weights of the model don't change, but the hidden state vectors are adjusted using the current input. So, using your analogy, weights would be DNA, and the hidden state vectors would be the "mental state". By talking to this person you adjust their mental state to hopefully steer the conversation in the right direction.
I do still prefer Lyra overall, though not as much as some others (see sibling comment). To me, Lyra is cleaner and easier to understand, but the artifacts it introduces are more annoying and fatiguing than those introduced by Opus. Some people in this thread have reported trouble understanding Lyra, which I attribute to the strange artifacts it introduces.