Lyra V2 – a better, faster, and more versatile speech codec

squarefoot · on Oct 1, 2022

Check out also Codec2, Open Source as well, which offers really good quality down to 700 bit/s. It has been ported to small MCUs such as the ESP32, STM32 etc. also supported by Arduino libraries.

https://www.rowetel.com/?page_id=452

https://github.com/deulis/ESP32_Codec2

https://www.arduino.cc/reference/en/libraries/codec2/

throw0101a · on Oct 1, 2022

> Check out also Codec2, Open Source as well, which offers really good quality down to 700 bit/s

Yes, but it tops out at 3200b/s:

* https://en.wikipedia.org/wiki/Codec_2

Lyra V2 seems to start there and then goes up to teens of kb/s, at which point Opus can perhaps take over the job.

squarefoot · on Oct 1, 2022

Yes, that is done on purpose. The goal of the project is to be used in HF/VHF voice communications so it makes sense. That low bandwidth usage allows it to be employed from ordinary HAM gear down to cheap LoRa modules, a feature that opens huge possibilities like building point to point or multipoint encrypted communications with portable devices not tied to cellphone towers, which in some areas of the world would be so useful these days.

FatCatsClub · on Oct 1, 2022

>employed from ordinary HAM gear down to cheap LoRa modules, a feature that opens huge possibilities like building point to point or multipoint encrypted communications

I hope you aren't suggesting that encryption should be used on amateur bands.

squarefoot · on Oct 1, 2022

Of course not; I'm aware that encryption is illegal in HAM bands, I was referring to other uses in emergency situations.

gortn · on Oct 1, 2022

It's not a big deal, there's rarely any enforcement of this, no-one cares except for the usual angry hams.

gilrain · on Oct 1, 2022

That’s how the commons gets tragedied…

dmos62 · on Oct 1, 2022

Is that not a good idea?

mgsouth · on Oct 1, 2022

It's not legal. Most countries prohibit encryption of Amateur Radio transmissions in most cases. Some countries have exceptions such as emergency communications or satellite control. [0]

[0] https://ham.stackexchange.com/questions/72/encrypted-traffic...

londons_explore · on Oct 1, 2022

It's a can-of-worms topic...

jimmySixDOF · on Oct 1, 2022

You mean like the Helium network where they are trying to get enough users with LoRa boxes to replace the telcos ?

[1] https://www.helium.com/

simfree · on Oct 1, 2022

Helium's LoRa network has a vanishingly small number of paying users despite it's size.

Great move for a pump and dump scheme tho, now they are moving onto CBRS LTE with a whole new token separate from HNT.

haberman · on Oct 1, 2022

It's astounding to me that you can get intelligible speech at seven 700 bps.

700 bits is less than the ASCII of this comment.

londons_explore · on Oct 1, 2022

Have fun reading that comment out loud in 1 second.

dale_glass · on Sept 30, 2022

Is there an open codec that concentrates on low CPU usage? I'm fine with it not being very bandwidth efficient.

Opus is a very good codec, but it's not amazing CPU-wise. I work on a VR world, and audio encoding is usually our most limiting factor when running on an VPS. We have the capability to negotiate codecs, so the high cpu/low bandwidth use case is already covered.

What I'm looking for specifically:

* Low CPU usage

* Support for high bitrate, suitable for music and sounds other than voice

* Low latency

douglasheriot · on Oct 1, 2022

It sounds like you’re asking for uncompressed audio? That meets all of your listed requirements. 48kHz * 16bit, single channel = 768kbit/s

dale_glass · on Oct 1, 2022

We support that already, yup. But it never hurts to see if there's something better than that out there.

orlp · on Oct 4, 2022

You can bootleg your own fast lossless codec by doing delta-encoding on the raw PCM to get a lot of zeros and then feed it through an off-the-shelf fast compressor like snappy/lz4/zstandard/etc. It won't get remotely close to the dedicated audio algorithms, but I wouldn't be surprised if you cut your data size by a factor 2-4 and essentially no CPU cost compared to raw uncompressed audio.

muziq · on Oct 5, 2022

You’ve not done this before have you ?

orlp · on Oct 13, 2022

I haven't, but now I have. I took https://opus-codec.org/static/examples/samples/music_orig.wa... from https://opus-codec.org/examples/. Then I wrote the following snippet of Python code:

    from scipy.io import wavfile
    import numpy as np
    import zstd

    sampling_rate, samples = wavfile.read(r'data/bootleg-compress/music_orig.wav')
    orig = samples.tobytes()

    naive_compressed = zstd.ZSTD_compress(orig)
    deltas = np.diff(samples, prepend=samples.dtype.type(0), axis=0) # Per-channel deltas.
    compressed_deltas = zstd.ZSTD_compress(deltas.ravel()) # Interleave channels and compress.

    decompressed_deltas = np.frombuffer(zstd.ZSTD_uncompress(compressed_deltas), dtype=samples.dtype)
    decompressed = np.cumsum(decompressed_deltas.reshape(deltas.shape), axis=0, dtype=samples.dtype)
    assert np.array_equal(samples, decompressed)

    print(len(orig))
    print(len(naive_compressed))
    print(len(compressed_deltas))

giving:

    17432876
    15518973
    12817602

Looks like my initial estimation of 2-4 was way off (when FLAC achieves ~2 this should've been a red flag), but you do get a ~1.36x reduction in space at basically memory read speed.

Using an encoding for second order differences with storing -127 <= d <= 127 using 1 byte and the others 2 bytes (for an input of 16-bit audio) I got a ratio of ~1.50 for something that can still operate entirely at RAM speed:

    orig = samples.tobytes()
    deltas = np.diff(samples, prepend=samples.dtype.type(0), axis=0)      # Per-channel deltas.
    delta_deltas = np.diff(deltas, prepend=samples.dtype.type(0), axis=0) # Per-channel second-order differences.

    # Many small differences, encode almost all 1-byte differences using 1 byte,
    # using 3 bytes for larger differences. Interleave channels and encode.
    small = np.sum(np.abs(delta_deltas.ravel()) <= 127)
    bootleg = np.zeros(small + (len(delta_deltas.ravel()) - small) * 3, dtype=np.uint8)
    i = 0
    for dda in delta_deltas.flatten():
        if -127 <= dda <= 127:
            bootleg[i] = dda + 127
            i += 1
        else:
            bootleg[i] = 255
            bootleg[i + 1] = (dda + 2**15) % 256
            bootleg[i + 2] = (dda + 2**15) // 256
            i += 3

    compressed_bootleg = zstd.ZSTD_compress(bootleg)
    print(len(compressed_bootleg))

    decompressed_bootleg = zstd.ZSTD_uncompress(compressed_bootleg)
    result = []

    i = 0
    while i < len(bootleg):
        if bootleg[i] < 255:
            result.append(decompressed_bootleg[i] - 127)
            i += 1
        else:
            lo = decompressed_bootleg[i + 1]
            hi = decompressed_bootleg[i + 2]
            result.append(256*hi + lo - 2**15)
            i += 3

    decompressed_delta_deltas = np.array(result, dtype=samples.dtype).reshape(delta_deltas.shape)
    decompressed_deltas = np.cumsum(decompressed_delta_deltas, axis=0, dtype=samples.dtype)
    decompressed = np.cumsum(decompressed_deltas, axis=0, dtype=samples.dtype)
    assert np.array_equal(samples, decompressed)

Prints 11593846.

zinekeller · on Oct 1, 2022

While I also want a low-computation codec that can save space, the historical use cases unfortunately assumes a lot more CPU power to be compensated for a lot less bandwidth, so there's little research in this area, and there's no real incentive to make something like ProRes and DNxHD as if you are editing audio the SSD speeds has been so fast that you'll run into CPU problems first.

userbinator · on Oct 1, 2022

Either that or G.711.

viraptor · on Oct 1, 2022

G711 is neither high bitrate nor usable for music.

simfree · on Oct 1, 2022

Then use G.722, it works fine for music.

viraptor · on Oct 1, 2022

No, g722 is still a wideband speech codec. Its available frequency goes up to 7 kHz. The uncompressed audio this thread began with goes up to 22 kHz. With g722 you're losing most overtones, or even all overtones from the top of a piano. Please don't use g722 for music apart from on-hold muzak.

rasz · on Oct 1, 2022

How is Audio encoding the most limiting factor in a VR project? :o Afaik Opus encoder eats something like 30-50MHz of one cpu core.

CodesInChaos · on Oct 1, 2022

It sounds plausible that it's the most expensive thing on the server side, if you have cheap simulation/behaviour and many concurrent users.

But unless it's a non commercial project, the cost shouldn't be a big deal, so it's still a bit strange.

dale_glass · on Oct 1, 2022

We work on a community-led fork of the dead commercial High Fidelity project. The server requirements are indeed very light except for audio.

Physics are actually farmed out to the clients themselves, it's a bit of a quirky idea, but it actually works if one isn't concerned with accuracy.

muziq · on Oct 1, 2022

I did a prototype of a 3D low-latency server side mixing system, based on a hypothetical 4k clients, @48k each being mixed with the 64 loudest clients.. Using Opus, forced to Celt mode only and running 256 stereo sample frames at 128kbps.. Worked well, using only 6 cores for that workload.. The mixing was trivial, but the decode and encode of 4k streams was entirely doable.. This issue at that rate was 1.5M network packets a second.. If I was to revisit it, I’d look at using a simple MDCT based codec, with a simple Psychacoustic model based on MPC (minus CVD) and modified for shorter frames + Mdct behaviour versus PQMF behaviour, without any Huffman coding or entropy coding.. And put that codec on the GPU.. Small tests I did using a 1080ti indicated ~1M clients could be decoded, mixed and encoded (same specs as above) problem is then how to handle ~370M network packets a second :)

Edit: Had high hopes for High Fidelity, and came very close to asking for a job there ;) Shame it’s kaput, didn’t know that :(

dale_glass · on Oct 1, 2022

Those are interesting ideas, thanks! I'll have to try and play with that.

High Fidelity the company is still around, but they pivoted multiple times radically. Initially their plan was social VR of sorts. Then they tried to make a corporate product for meetings and such, and gave up on that right before COVID19 hit!

And after that they ripped out all the 3D and VR and scaled down to a 2D, overhead spatial audio web thing. Think something like Zoom, only you have an icon that you can move around to get closer or further to other people.

The original code still lives on, we picked it up and are working on improvements. Feel free to visit out Discord (see my profile).

jimmySixDOF · on Oct 1, 2022

Apparently RP1 team handle bigger crowd loads through muxing on the server but not sure exactly how that works out for spatial audio there is a Kent Bye Voices of VR podcast discussing how they got 4k users in the same shard.

rasz · on Oct 1, 2022

Ventrilo/teampspeak servers run great on shared hosting. https://www.myteamspeak.com/addons/9ddfa0b2-25c2-4302-8a43-0... gives you positional audio support on teamspeak server

cma · on Oct 1, 2022

Why is your VPS server encoding rather than clients? Are you combining talkers together into one source for doing crowds and avoiding N^2 or something and need to reencode after combining?

dale_glass · on Oct 1, 2022

Correct, server does spatial audio.

It's a community-led continuation of High Fidelity, a dead commercial project. They made their own proprietary codec with excellent performance we can't use and managed to have a couple thousand people in the same server.

fanick · on Oct 1, 2022

I would like to know the answer to this question from dale_glass too.

dale_glass · on Oct 1, 2022

Replied to the parent

izacus · on Oct 1, 2022

The Bluetooth codecs are all designed to be very cheap on CPU and low latency - e.g. LC3, AptX or SBC.

themerone · on Oct 1, 2022

I'm skeptical, those are almost always going to be implemented in hardware, so the complexity of a software encoder isn't a design concern.

There is some correlation between the cost of a hardware implementation and complexity of a software implementation. SBC is a very simple codec, but AptX and LC3 might not be much better than Opus.

CharlesW · on Oct 1, 2022

I couldn't find data on CPU requirements for encode/decode versus, say, Opus, but Apple uses AAC-LD for similar scenarios.

jchw · on Oct 1, 2022

If you are OK with moderately high bitrates, you might prefer something simpler like an ADPCM scheme. It's pretty damn easy to implement ADPCM, certainly a lot less math heavy than MDCT-based schemes, and they achieve good quality at a somewhat higher bitrate (I have no data, but I'd guess 200-250%~ish.)

bjt2n3904 · on Oct 1, 2022

I believe codec2 is pretty easy computationally. The M17 project uses it IIRC, and implements it on an STM32.

CharlesW · on Oct 1, 2022

That'd be a good choice except for the requirement to support non-speech audio.

ksec · on Oct 1, 2022

LC3Plus or AAC-LD. Although they likely don’t fit the definition of Open Codec.

aidenn0 · on Oct 1, 2022

Vorbis might be a good choice there

petercooper · on Sept 30, 2022

That sample at 3200 bits per second is fantastic for such a low bitrate. I also love how that works out at 1.44MB/hr.. one floppy disk per hour!

nine_k · on Sept 30, 2022

Imagine a portable podcast player for fans of vintage tech.

(Won't work with music, of course.)

jagger27 · on Oct 1, 2022

Books-on-floppy would work too. Like a digital version of books on tape, with about as many disk changes.

ace2358 · on Oct 1, 2022

Please I wish something like that came out. Same with modern tech with cassette and minidisc. Something new but with that kind of hardware. I love the physical mediums!

petercooper · on Oct 1, 2022

A Diskman, if you will..

odiroot · on Oct 1, 2022

I still remember the times when I regularly visited my friend with 2 floppies just to get one ~4 minutes MP3 song back home.

andrewmcwatters · on Sept 30, 2022

Subjectively beating Opus on quality-to-bit rate is quite impressive, but I noticed the samples had some interesting audible artifacts. I wonder where these come from, and if they're related in any way to this codec using machine learning techniques.

rektide · on Sept 30, 2022

Very impressive.

It'd be interesting to see what the lift would be to get encoding & decoding running in webassembly/wasm. Further, it'd be really neat to try to take something like the tflife_model_wrapper[1] and to get it backed by something like tsjs-tflite[2] perhaps running atop for example tfjs-backend-webgpu[3].

Longer run, the web-nn[4] spec should hopefully simplify/bake-in some of these libraries to the web platform, make running inference much easier. But there's still an interesting challenge & question, that I'm not sure how to tackle; how to take native code, compile it to wasm, but to have some of the implementation provided else-where.

At the moment, Lyra V2 can already use XNNPACK[4], which does have a pretty good wasm implementation. But trying to swap out implementations, so for example we might be able to use the GPU or other accelerators, could still have some good benefits on various platforms.

[1] https://github.com/google/lyra/pull/89/files#diff-ed2f131a63...

[2] https://www.npmjs.com/package/@tensorflow/tfjs-tflite

[3] https://www.npmjs.com/package/@tensorflow/tfjs-backend-webgp...

[4] https://www.w3.org/TR/webnn/

[5] https://github.com/google/XNNPACK

encryptluks2 · on Oct 1, 2022

Why would you want to run codecs I'm WASM? Makes no sense to me.

jagger27 · on Oct 1, 2022

Forwards and backwards compatibility and not having to rely on every vendor to ship support for the codec you want to use in your software.

rektide · on Oct 1, 2022

Cause the web is awesome and anyone could use something built for it with zero friction.

londons_explore · on Oct 1, 2022

Things missing that I think would have to be added before this could become a widely used standard:

* Variable bitrate. For many uses, the goal isn't 'fill this channel', but instead 'transmit this audio stream at the best quality possible'. That means filling the channel sometimes, but other times transmitting less data (ie. When there is silence, or when the entropy of the speech being sent is low - for example the user is saying something very predictable).

* Handling all types of audio. Even something designed for phone calls will occasionally be asked by users to transmit music, sound effects, etc. The codec should do an acceptable job at those other tasks.

Conan_Kudo · on Oct 1, 2022

Alas, the sources require Bazel to build. That's going to limit adoption because Bazel is difficult to deal with and most projects don't use it.

ajconway · on Oct 1, 2022

1. Install numpy (pip3 install numpy)

2. Download a bazel binary (https://github.com/bazelbuild/bazel/releases or use package manager)

3. bazel build -c opt :encoder_main

4. bazel-bin/encoder_main --input_path=testdata/sample1_16kHz.wav --output_dir=$HOME/temp --bitrate=3200

Done!

freediver · on Sept 30, 2022

This is an amazing contribution by Google. I wonder if there is a simple WebRTC demo app available with this codec plugged in?

hoppyhoppy2 · on Oct 1, 2022

What's the difference between googleblog.com and blog.google ?

rootw0rm · on Oct 1, 2022

blog.google.com = blogger.l.google.com = 172.217.14.105

googleblog.com = 172.217.14.65

learndeeply · on Sept 30, 2022

It would be nice to compare it to a higher quality sample, currently the samples sound like they were recorded through a telephone (4 khz).

simlevesque · on Sept 30, 2022

It's meant to be a phone codec.

learndeeply · on Sept 30, 2022

"HD" phone codecs seem higher quality than the example given.

bscphil · on Oct 1, 2022

That's true, but in this era of high quality voice and video calls, it's not that uncommon for someone to want to play a song or even a live instrument, so some capability for handling that intelligently seems important.

kevingadd · on Oct 1, 2022

The mics built into modern smartphones aren't limited to the fidelity level of ancient telephones, so it seems reasonable to hope for more.

BenjiWiebe · on Oct 1, 2022

I believe the limitation was actually not the mics. You can fit a lot more phone calls into a given bandwidth if you heavily restrict the bandwidth of each phone call.

dcl · on Oct 1, 2022

As an ML person I totally get the how NN works to accomplish this and it's very cool. What's really cool is how they get this to work in real time with little/no latency.