One thing I'm slightly worried about "machine learning" in compression rather than conventional everything-is-sines mathematical approaches is the possibility of odd nonlinear errors. Remember the photocopier that worked by OCR and would occasionally mis-transcribe numbers?
I don't mind compressing a phoneme to <unintelligible> as much as I would mind it compressing it to a clearly audible different phoneme.
While fascinating, that’s not the same as a codec failing silently by literally changing one word into another, equally clear word instead of getting fuzzy or unintelligible.
At the end of the day, it all comes to using the right tool for the job, and this is just another codec in your toolbox.
This is no different than using, for example, a probabilistic algorithm to solve some NP-hard problem in your real world software. As long as you understand the limitations, I don't see an issue with using an algorithm that has a small non-significant (for your use-case) rate of failure. I would definitely not use this to communicate with the space station, but in the right context (Google Duo, low bandwidth), it's the perfect tool.
[disclaimer: Personal opinion, not that of my employer.]
I had a coworker play me before/after of an early version of the codec "babbling" and it was definitely uncanny valley. It looks like some work has been done on the problem since then.
The second paper linked in the README.md of the repo talks about talks about a few strategies to reduce 'babbling' or 'babble'. For your reference, here's the citation and the link to the PDF.
Denton, T., Luebs, A., Lim, F. S., Storus, A., Yeh, H., Kleijn, W. B., & Skoglund, J. (2021). Handling Background Noise in Neural Speech Generation. arXiv preprint arXiv:2102.11906.
This already happens with existing compression algorithms. Certain vowel sounds get collapsed, so someone will say, for example, "66" and it will come out on the other side as "6". Very annoying because you can't exactly coach a layperson on how to talk "the right way" to not trigger this vowel collapse.
Not suggesting it as a fix, but this did remind me of the military phonetic alphabet, which includes numbers too.
3 is "tree", 4 is "fow er", 5 is "fife", 9 is "niner". The rest of the numbers are mostly as-is, but you'll hear very deliberate enunciation, like "Zee Row" for 0.
whiskey hotel yankee delta oscar india hotel alpha victor echo tango oscar sierra papa echo alpha kilo tango hotel echo lima alpha november golf uniform alpha golf echo oscar foxtrot tango hotel echo mike alpha charlie hotel india november echo ? tango hotel alpha tango india sierra india november sierra alpha november echo!
Humans adapt a whole hell of a lot easier than machines.
Sure, it would be nice to have clean high bandwidth, low latency voice channels to everywhere so you could drop pins and expect the other side to hear it. Unfortunately, high bandwidth never really happened, and some places never ran land lines to everyone's home, and nobody wants to pay the high price of circuit switched voice when packet switched voice mostly works good enough and is enormously cheaper.
But is Lyra a significant improvement over modern Opus at 8Kbps? You can buy a Grandstream HT802 for ~$30 and its DSP can decode Opus today, whereas Lyra will require orders of magnitude more power to decode while providing much worse reproduction accuracy.
I'm having a little trouble following this, could you explain a bit more? It seems to me like "66" would be pronounced "SIKSIKS", so for that to become "SIKS" would mean the "KS" (consonants) would be collapsed, no? (Not trying to refute you or anything, just understand :) )
As someone with a weird sibilant that doesn't seem to compress well, I want to say that it goes across as "sɪkɪks" and I got used to saying "double six" on the phone.
So I would say "seven nine double six", which is another problem if I'm talking to an American.
This applies to GSM digitization and other "regular phone" compression, the newer computer calls have been better at taking the words.
I don’t know if it’s improved over the last 6 months, but Zoom sucks for Native Spanish speakers speaking English. Like zoom would not pick up the J/H sound at all on English words.
> you can't exactly coach a layperson on how to talk "the right way" to not trigger this vowel collapse
I've never noticed. At any rate, we should not coach people to adapt to technology in this way. It is Procrustean and anti-human and unnecessarily places a burden on people that belongs to the software and the developer.
For what it's worth, amateur radio operators already have specialized rules and techniques for speech, to improve clarity over a muffled noisy analog radio channel.
I've always suspected the optimal experience is a balance...we define some intermediate language that both the computers need to be programmed to understand and humans need to be trained to adopt.
The most obvious example is learning to type...I've had by far the most fun working with computers in a keyboard-centric environment, mostly because I'm good enough at pressing keys and the computer is good enough at understanding them.
That said, I agree with both you and GP: trying to train a layperson to talk differently based on the quirks of the codec used to encode their voice seems like a poor choice!
A supersampling algorithm is a (de)compression algorithm. You give it an image and it gives you an "decompressed" image. It's not a very good one though.
The OCR issue was the first thing I thought about. Machine learning is probabilistic, not deterministic, so in the case of S being converted to 5 (or 6 to 8, etc.), which definitely impacts numerical data in the case of the OCR stuff, we can expect similar voice mis-classifications. Perhaps "You're fine" might get misclassified as "you're fired".
IMO: the output of machine learning is correlated garbage. This is confusing to most people who are used to programs that implement an algorithm (reminder that "correctness" is part of the definition of an algorithm.)
It was ordinary compression, something called JBIG2. It did not mistranscribe, but mark slightly different number or character blocks as same, resulting replaced parts in images.
In other words, its match tolerance is a bit too lax, so it get poisoned by blocks in its own dictionary, thinking it already has the blocks for things it had just scanned.
Yes! This is why I always turn off autocorrect! It’s true that I absolutely make more typos without it, but at least they’re obvious as typos, and not different words that potentially change the meaning of the sentence.
One day, voice cloning may become so powerful that only word data and intonations will become part of the datastream. There could be various 'layers' in which encodes/decodes can occur. Voice Cloning would be at the very top of the stack.
"Please note that there is a closed-source kernel used for math operations that is linked via a shared object called libsparse_inference.so. We provide the libsparse_inference.so library to be linked, but are unable to provide source for it. This is the reason that a specific toolchain/compiler is required.*
- README
Ah, so Lyra today will not work on RISC-V, i386, Power, MIPS, lower end or older ARM chips like the Allwinner H3 (very popular in Single Board Computers) and any other new architecture that comes out?
Doesn't seem that better compared to Codec2 which is already fully Open Source (LGPL), even taking into account that Codec2's examples originals are already of much worse quality than the ones on Lyra's website. I'd be curious to hear both working on the same set of audio samples.
Ya I was just coming here to say the same thing. 40ms _just in the codec_ feels like a lot. Because that's not even including time to pull in audio from the hardware (could be 20ms or more in Android devices), time to upload, and time to have it across the Internet, and then time to decode + play on the receiver. That adds up pretty quickly. I'm guessing 40ms was chosen because it is some sweet spot of having enough data to get a worthwhile compression on, but it's one of these things where technology, however impressive it might be, is slowly giving us a worse experience over time in the pursuit of digitization.
>These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network.
So while Encoding doesn't take 40ms, the latency + encoding will indeed be 40ms+.
150ms is the End to End Latency, which is basically everything from Encoding + Network + Decoding. We cant beat the speed of light on our fibre network. We can certainly do something with Encoding and Decoding. And Lyra doesn't seems to help with that case here. Something I pointed out last time Lyra was on HN.
I think Opus default to 20ms with option of 10ms slot ( excluding Encoding speed ) at the expense of quality. What we really need is higher bitrate, lower latency and higher quality codec. Which is sort of the exact opposite of what Lyra is offering.
Which is only in the case of "opposite points of the earth", otherwise you are just adding ~700KM of distance between two point. The point is even if we have perfect Speed of light Data Transfer over a direct line, we are fundamentally limited by it and nothing can be done. But Encoding, Decoding, Time Slots and quality are everything that we have control of and should be look into more seriously.
Yes, because they are convenient for other reasons (don't require infrastructure over land) which makes them suitable for connecting rural areas where it doesn't make sense to run fiber. But fiber will always be the fastest you can get, and if you get fiber in a vacuum, you could theoretically achieve near-speed of light communication. Satellites won't get you anywhere close to that, even if you use lasers, because there is always atmospheric disturbances that introduce latency.
There is no such thing as 5ms VOIP audio latency at 6 Kbps, the IPv4+UDP headers would amount to 44.8 Kbps at minimum, so it's irrelevant if one encoder is tuned to be able to encode 5 ms chunks instead of 40 ms chunks. 40 ms intervals requires a minimum of 5.6 Kbps + the codec rate.
I.e. at 10 Kbps it's impossible to have a lower VOIP latency than 32 ms. Likely the 40 ms number they tuned for in the real world.
I don't think it is discussing encoding time in the article, it says "features are extracted in chunks of 40ms". My reading is that its breaking down the speech into 40ms chunks, compressing it, and sending that.
Sure latency ends up being 40 ms but that's a function of needing to wait to send the encoded data + network headers at 6 Kbps not a function of the encoder being slow holding everything up.
This seems kind of unnecessary, compared to Opus at ~10 kbps. If you're sending IPv6+UDP in 40 ms chunks, that's 9.6 kbps just from the packet headers (25 Hz * 40+8 bytes).
When the voice payload is smaller than the packet headers, you're well into diminishing returns territory.
Opus at 8Kbps sounds better, and commodity, inexpensive hardware like the Grandstream HT802 Analog Telephone Adapter supports this codec today (along with any cheap Android phone).
Lyra as it stands today will not support anything outside of x86-64 and ARM64 without rewriting the proprietary kernel it relies on.
PCMU/PCMA (G.711μ and G.711a) are not original landline quality audio, but rather what Bell Systems felt they could get away with passing off as a toll quality call in 1972.
Lyra will likely sound better, but the reproduction accuracy is apt to be quite a bit poorer as many others have commented. G.711 was created to require nearly no processing (its nearly raw PCM data from a sound card after all) while operating at reasonable bitrates, Lyra looks much more computationally intensive and will likely only run on smartphones in the next few years.
Edit: Is Lyra a significant improvement over modern Opus at 8Kbps? You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today, whereas Lyra will require orders of magnitude more power to decode while providing much worse reproduction accuracy.
The overhead from packet headers to send data every 40ms is 9.6kbps, is the difference between 12.6Kbps and 17.6Kbps meaningful at that point? We are sending the same number of packets, likely with the same packet loss rate.
> A RaspberryPi Zero will provide more than sufficient power for Lyra
A Raspberry Pi Zero can't run Lyra, as the proprietary math kernel is only offered in compiled form for x86-64 and android-arm64: https://github.com/google/lyra#license
> > You can buy a Grandstream HT802 analog telephone adapter for ~$30 and its DSP can decode Opus today
almost entirely irrelevant if you're making calls to or from the PSTN, since your SIP trunking provider most likely only supports G.711 alaw/ulaw, or even if they support you handing them a call as G.722 or any other codec, their upstreams almost certainly don't support anything other than G.711.
Not an authoritative source, but (as a point of interest) analog landlines seem to be specified as 24 dB SNR and 300-3000 Hz passband [1], giving ~21.5 kbps information rate [2].
There's huge wins but the grandiosity of "enabling voice calls" is grating. I don't think this will open many users to voice communication. It will reduce data-costs in a way that has an impact on a significant amount of people's bottom line. But I feel manipulated with the current headline, and by the long extended lack of ability to mix the very real hope with some measure of humility.
In practical terms, very impressive. Anyone know what latency is like? Feels a domain where people who have not experienced low latency full duplex cannot fully appreciate why voice has faded in everyday life...
I have been using Duo more for audio calls lately and the call quality has been excellent. Compared to WhatsApp its much much better which often times can mimic the sound quality of a regular phone call. Ive tested in the US and with my family in India, where the connection isnt the greatest
The Github README shows that the public API uses types from Abseil, a library that "promises no ABI stability, even from one day to the next." That seems problematic.
>"Lyra’s architecture is separated into two pieces, the encoder and decoder. When someone talks into their phone the encoder captures distinctive attributes from their speech. These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network. It is the decoder’s job to convert the features back into an audio waveform that can be played out over the listener’s phone speaker. The features are decoded back into a waveform via a
generative model.
Generative models are a particular type of
machine learning model
well suited to recreate a full audio waveform from a limited number of features. The Lyra architecture is very similar to traditional audio codecs, which have formed the backbone of internet communication for decades. Whereas these traditional codecs are based on digital signal processing (DSP) techniques, the key advantage for Lyra comes from the ability of the generative model to reconstruct a high-quality voice signal."
PDS: Audio Codec meets Machine Learning! I love it!!!
Since this is explicitly targeted at "the next billion users," do we have any sense of how well-optimized this is on non-English audio corpuses? I can't imagine that a model trained primarily on English/Western phonemes would perform as well on the rest of the world.
Unlikely, there's too much pride in each local language. Might all converge on English over a couple of generations, though, but more for commercial reasons.
My immediate thought: train a Transformer (or Tacotron2) that transforms text to the encoded Lyra codes... And, we will finally have a good real-time open-source text-to-speech system running on mobile devices.
I find that Lyra sounds good at first but it can chop off hard consonants in certain scenarios. It sort of sounds like slightly slurred speech. Anyone else getting that impression from their samples?
Another reason for end-to-end speech encryption: to keep your cleartext voice signal away from these overaggressive codecs changing the words. I can understand the need for a super low bandwidth codec on top of Mt. Everest, but 64 kbit PCM was good enough for our grandparents' landlines (or 13 kbit GSM for their mobiles) and it's good enough for us.
Spectacular failure of imagination? I mean it's a speech codec, an incremental improvement over the many out there that work fine. We're no longer in a world where voice calls dominate the world's telecom bandwidth usage. We routinely receive a megabyte of Javascript and ads and crap to display a 288 character tweet. Soon there will be 5G everywhere, so we'll get 10MB of JS etc. to see the same 288 character tweet. 1MB is 10+ minutes of full-rate GSM, or a lot more than that of Opus. If Lyra is really free (no blobs) and its computational requirements don't make us churn our phone hardware yet again, then great, it can reduce the already very low cost of voice calls by another smidgen, increasing carrier margins while almost certainly not showing in lower prices to the end user. So at that end of things, it's tolerable, while it would be horrible if (say) it were patented and became a standard, so that FOSS VOIP clients became non-interoperable with what the big vendors were using.
Lyra is more transformative in some extreme niche areas of extremely limited bandwidth, say spacecraft radios or handheld satellite phones or whatever. Those applications already use super low bandwidth codecs that sound like crap. So Lyra won't really save bits, but it will help intelligibility a lot by sounding better in the same bits.
A more useful system would take Opus-compressed data as input and feature-extract that, presumably faster than this thing. Bonus for not requiring a proprietary library like libsparse_inference.so.
Also, instead of encoding independent 40ms segments, it should be much better to encode 10ms segments given the previous 30ms.
Is there any difference with another audio codec? It's great to see that another player in the market—this time, it's machine learning that produces high-quality calls. I'll keep an eye on the impact in the future. This architecture will surely disrupt our communication industry.
Does anyone know how it compares to Codec2? Opus is great down to ~12kbps but Codec2 is the real contender down at the bottom. And I bet it uses way less CPU than Lyra
Bad internet connectivity in the developing world isn't "only 56kbps" as some people think.
It's "random bursts of fast with random 30 second gaps of no connectivity at all". It's routed through 3 layers of proxies and firewalls which block random stuff and not others, while disconnecting long running connections.
Oh, and it'll be expensive per MB.
To that end, Lyra helps with the expense of a data connection, but is unusable for long voice calls. What would help more is a text chat system like WhatsApp.
Oh right - WhatsApp is already wildly popular in most of the developing world for mostly this reason.
> Oh right - WhatsApp is already wildly popular in most of the developing world for mostly this reason.
Not only that, but carriers will often advertise plans with "unlimited Internet for Facebook and WhatsApp" (a punch in the face of net neutrality).
So not only WhatsApp has more impact with audio messages when audio calls are too unstable, audio calls already substitute the bulk of phone calls even for people who have shitty data plans.
This is what my carrier says on their most basic offering:
> What does WhatsApp Unlimited mean?
> The benefit is granted automatically, without the need for activation. And the use of the app is unlimited to send messages, audios, photos, videos, in addition to making voice calls. Only video calls that are discounted from the internet package, as well as access to external links.
Heya, please could you unpack your reasoning a little bit more?
You said:
> WhatsApp is already wildly popular in most of the developing world for mostly this reason.
I can't speak for the majority of the developing world, but here in South Africa, WhatsApp is indeed the predominant communications app.
That being said, WhatsApp voice calls are also used here quite a bit.
So with that in mind, and reading from the article:
> Lyra compresses raw audio down to 3kbps for quality that compares favourably to other codecs
To me 3kbps sounds pretty great, and might actually work out cheaper / better than one might imagine.
So I'm just wondering, how does WhatsApp voice call data usage compare to Lyra?
Also whilst South Africa is indeed a developing country (where, among other things, the price of data is proportionately high relative to average household income), the cellular network infrastructure is excellent.
So I don't think the random bursts of connectivity you describe are as big of an issue here, whereas the price of data most certainly is.
In which case, I can definitely see a market for Lyra (assuming the 3kbps is indeed vastly superior to WhatsApp's data usage for a voice call).
Hope that makes sense but I'd be happy to extrapolate a little further :-)
Lyra is a good candidate for replacing the protocol already used in Whatsapps voice calls. The binary size of Whatsapp matters, so it would depend on Lyra not requiring a multi-megabyte neural net too. The 40 millisecond extra enforced delay might have a negative impact on user experience.
It might be a good candidate for use in the voice message feature of whatsapp. That feature doesn't require low latency audio, so there might be even better compression schemes that use forward and backward compression techniques.
In the middle east I noticed a baffling-to-me usage of whatsapp: people were simply exchanging voice messages back and forth instead of calling. [0]
Presumably for exactly the reason you've stated.
[0] I later tried it myself with a friend, but you end up losing the benefits of both worlds -- you can't search or review old messages effectively (as you would text), and its significantly slower than calling.
This whole machine learning, optimization etc, story, but the end goal is that Google can easily transcribe your voice calls and store it as text. Then it can apply all shady practices that it previously was too expensive to do because storing voice and extracting information from it required huge storage costs and actual human labour.
Or worst, just imagine what some government you don't trust could do with all those voice call transcripts.
This codec has nothing to do with what you're worried about. There's no current technical limitation preventing what you're describing. Google doesn't do it because it makes no sense for their business and because your phone calls aren't routed through Google's servers. Governments outside the US are already doing it.
It sounds more like a "offline" codec, not a Google service compressing your voice so I don't immediately see how Google would violate our privacy here this time.
This will make voices radically more correlatable, most likely. It's a more effective model for voice, it has run endless regressions & found better patterns to model human sounds upon. That could well make processing & comparing pieces of speech data less computationally expensive.
I don't see much relation to surveillance & transcription issues. This technology does not, would not change the field of battle significantly, if such a battle were about. Which it probably is, in some countries, perhaps even applying to Google-touched, -relayed, or Google-held data.
I don't mind compressing a phoneme to <unintelligible> as much as I would mind it compressing it to a clearly audible different phoneme.