Hacker News new | past | comments | ask | show | jobs | submit login
TSAC: Low Bitrate Audio Compression (bellard.org)
236 points by ajitk 9 months ago | hide | past | favorite | 98 comments



Always love a good bellard ship.

This is definitely better than some of the others out there. I threw together some comparisons here at 7kb/s for mp3/opus/aac: https://non.io/TSAC-Comparisons

Happy to add other comparisons if others want any.

Overall, it's FAR better at these lower bit rates, but that doesn't mean it's necessarily good. One issue I see off the bat is that volume is fairly inconsistent in the output for TSAC, which makes stereo in particular quite hard to listen to with the volume "flickering" in each channel independently.


Since Ballard's codec is "AI" based, can you add google's lyrav2 ( https://github.com/google/lyra ) and Facebook's/meta EnCodec ( https://github.com/facebookresearch/encodec ).

Also I don't seem to be able to access your page, so there might be error.

Finally, when doing opus comparison it's good now to denote if it is using Lace or NoLace decoder post processing filters that became available in opus 1.5 (note, this feature need to be enabled at compile time, and defying decode a new API call needs to be made to force higher complexity decoder) . See https://opus-codec.org/demo/opus-1.5/


> Also I don't seem to be able to access your page, so there might be error.

Interesting, do you have javascript turned off? Can you access this page? https://html.non.io/TSAC-Comparisons/


The page works, earlier when I tried I got a login page. This page is good.

Also awesome to see comparison to EnCodec , which I think is one of the better ones available : https://ai.honu.io/papers/encodec/samples.html

Also, can you confirm if Opus decode is classical or with Lace or NoLace post processing filters that are available in Opus 1.5?


Also I added EnCodec, but wasn't able to get the prereqs working for lyra2.


I've encoded xHE-AAC at 6 kbit/s mono, which is the closest match I could get. It performs much better and is widely supported (Android 9/iOS 13/macOS 10.15/Windows 11), although there are no free low bitrate encoders available yet. I used the Fraunhofer IIS Pro with EZ CD Audio Converter. It would be great if you could add it: https://filebin.net/x46m1x7n6d2t7e6b


Yes! This is even at 44.1 kHz, so you might even find that you get perceptually better results by trying a lower sampling rate (32 kHz, 22 kHz, etc.).


Thats much more helpful comparison. TSAC clearly works far better than mp3 and opus for low bitrates.


Another useful model to compare to would be DAC https://github.com/descriptinc/descript-audio-codec

This is the codec that TSAC extended, so it could be a nice comparison to see. I'd also echo Vocos (from sibling comment), it operates on the same Encodec tokens but generally has better reconstruction quality.


Which Opus settings did you use? Note that Opus recently got new ML features: https://opus-codec.org/demo/opus-1.5/


Could you add VOCOS at 1.5, 3.0 and 6.0? https://gemelo-ai.github.io/vocos/


if you patch out the CRC check in the binary with

echo -ne "\x90\x90" | dd if=/dev/stdin of=tsac bs=1 seek=23914 conv=notrunc

you can corrupt the compressed files with very interesting results: https://meow.social/@mimir/112238998609778334

the fast mode (you don't have to patch the binary for this one, it seems to not do the CRC check?) and the normal (non-fast) mode sound different, but both quite interesting


That’s incredible, especially the second one that kind of creates a new song. I guess that’s the danger when it’s based on a generative model.


I like how it degrades in such an analog way


Reminds me of what it sounds like to tune an old FM radio


Does anyone know how I would recreate this effect with 2 mp3s acting as 2 radio stations? I have tried fading in samples of static but it doesnt feel the same.


The design is very human.


Really? it sounds super tinny and digital, more like when Neo is first pulled out of the matrix.


It’s breaking into Cotton Eye Joe by Rednex (which is another piece of contemporary Eurodance in the same vein).


Note that the second one is actually a mix of different parts/songs.


The voices in my head speak in MP3 to MIDI...


The first track's corruption sounds to me a bit like the intro to "Knuckles" by The Presets.


I can clearly hear Cotton Eye Joe in the second sample!


This doesn't have much of a use case.

- Can't use it in telephony (obvious application for low bitrates); phone handsets and headsets don't have the power to do it in real time.

- Very small files of good quality would be useful in tiny embedded systems that have low flash space: but what systems of that type have the processing power for decoding? Very low storage more or less goes hand in hand with weak processing.

The quality is astonishing for the bit rate, though.


Sometimes you have enough cpu and not enough bandwidth. Remote expeditions, rural schools in underdeveloped parts of the world, etc. You can stream a bunch of stuff (news, audiobooks, daily lectures, etc.) via (otherwise pricy) satellite links, and then a raspberrypi or whatever solar powered device can decode the audio without having to be real-time.

It's not a use for "everybody", but it might reduce the costs for those people who need this (or make new things viable).


Archiving over a long period of time might be a use case.

I often wonder, how much of the data currently in circulation will be lost at some point? HDD/SSD last a couple of years. Most of the data in the cloud will be copied over, but some will be lost. If you extrapolate to a 1000, 1 000 000 years, how much will remain? Will something survive the civilization collapse? I guess most people don't care, but some will ...

One way to make data mediums last longer is to make them lower density, and for that such super-low bitrate could be useful.


No. Historical archives shouldn't use overly clever compression algorithms. Remember the JBIG2 fiasco. https://en.wikipedia.org/wiki/JBIG2#Character_substitution_e...


Not only the JBIG2 fiasco was not an inherent flaw of JBIG2 itself, but any historical archive would want to use a bounded error model for any lossy compression algorithm anyway. We don't exactly know how much error is tolerable for given content, but we know that some error is definitely tolerable for most contents, and its upper bound can be used to specify the safe and reasonable compression level. Once that constraint has been met, the choice of algorithm is no longer relevant.


If you use simple encoding (e.g. uncompressed bitmap), your archival capacity will be extremely limited, esp. if you use low-density medium (optimized for longevity). There's an obvious trade-off between encoding complexity and how much can you archive.

One approach would be to have a layered strategy - simple (but inefficient) encoding for an initial set of data, accompanied by a bootstrap for the next level which would unlock access to a much larger collection of efficiently stored data.


The only data that survives a civilization-level collapse is that which requires as little decoding as possible; in other words, plaintext. Future archaeologists aren't going to have a working copy of your GAN-based audio decoder. Translate your data into text (in as many major languages as possible), carve it into stone, and stuff it in a cave in the desert.


I would worry more about future archaeologists not being able to access e.g. Nvidia and TSMC engineering secrets than their ability to decode my cat pictures and shitty piano practice.


> Can't use it in telephony (obvious application for low bitrates); phone handsets and headsets don't have the power to do it in real time.

Not now but in another 5 years they will start to and in 10 years all new ones will probably have the power for this. I find it really exciting although it will consume more battery to run that; but if that is less than the radio antenna requires then it might make sense.


People are still bullish on Moore's.


There's plenty of use cases, just because you can't think of them doesn't mean there aren't any. Neural network based audio codecs like TSAC, Descript, Encodec, Soundstream are used for music and speech generation, audio upsampling, pre-training acoustic language models, speech-to-speech translation etc.

Check out the citations of Encodec (Facebook's open sourced audio codec) for more examples: https://scholar.google.com/scholar?cites=1126914113099467682...


Something like this might be useful to put into hardware, e.g. with custom silicon or FPGAs. Maybe then the processing power won't be a big issue?

Edit: Ok I just saw the thing is over 200 MB, might not be feasible for a while.


I think the big use case is satellite telephony and satellite radio/audio/podcast play. You could do all audio applications off a 3KBit/s connection - that’s completely insane.

It would probably have to be optimized a bit further though, both in terms of computing as well as size. Goal would probably be real time encoding on an iPhone SE without breaking too much of a sweat, and a encoder/decoder perhaps less than 200MB?

I am curious how well this does work with a full orchestra music — that’s where encoders usually croak. Give me a sample of the Star Wars theme.


How does satellite telephony work? Alice calls Bob over satellite. Alice's telephone is on AC power, and contains a GPU cluster, and Bob's is the same?


Today's GPU clusters is tomorrow's cell phone. But this codec doesn't require that much power anyway. I didn't try it but on some forum somebody claimed 0.5x speed on a Core(TM) i3-7100U CPU @ 2.40GHz [1]. It sounds plausible that with some more optimization and a bit better hardware, that's a bit more specialized for AI, it could do real time encoding and decoding on cell phones.

[1] https://hydrogenaud.io/index.php/topic,125765.0.html


I remember playing OPUS files at 16 KBPS over a... 2G? connection with mplayer's caching options. The audio sounded a bit better than MP3@32 or Real Audio back in the day.

As the music was "avant gardé", it almost fitted the genre.


don't be such a debbie downer! celebrate it, and we'll find a use for that some day


Innovation goes in steps and iterations ;) When mp3 came out, I could just barely play a song encoded from 44.1kHz/16 bit stereo on my PC, taking almost 100% CPU. Today they can be played on a cheap microcontroller.

I like that they share their work, it can lead to something some day.


MP3's were playable on cheap boom box stereos, and portable CD players, 20+ years ago. Such consumer devices capable of decoding MP3's appeared within less than half a decade of MP3 itself, by my recollection.


I think you are correct on that one. How long will it take to run this neural net on cheap consumer devices? It might take more than 5 years. But if all the new AI stuff is not a hype, but continues to be used, we will probably see hardware for running it on cheap circuits in a not to distant future. Maybe using a GPU+RAM like structure. Maybe the analog circuits with analog flash will win? The future will show us :)

Maybe add this URL to the calendar on today's date in 5 years an go back and reply with the answer :-D


When MP2 came out, my computer was barely able to play a song in 44.1 kHz 16 bit mono. I think bitrate was 192 kbps, but not sure.

(Later on I was so surprised MP4 didn't replace MP3!)


If you thought it was weird that they went to video with MP4, imagine my shock that the next generation they got into firearms.


That escalated quickly


Am I missing something in thinking that this could be alleviated like every other compression algorithm by implementing it via a hardware codec?


It's a neural network, not a traditional compression algorithm. It would be difficult to implement this efficiently in an ASIC AFAIK, but if there are any hardware designers that disagree please chime in.


Traditional codecs also use a lot of “magic” tables with numbers (see e.g. AMR codecs used in GSM telephony).

I think this codec could be optimized to run relatively efficiently on the various AI accelerator chips modern phones have, which is “kind-of” doing it in hardware.


I figured some sort of NPU + tuned hardware would be enough, but I'm just going off of intuition.


Ham Radio enthusiasts love to do stuff with $1000 radios. If this can run on any reasonable laptop it could be amazing.

They're putting neural accelerators in everything these days, I wouldn't be surprised if they got it to where it could work on a phone, in which case you could do voice over Meshtastic.


Storage of large amounts of voice conversations for regulatory purposes ? ( say, a trading floor )


Since music quality / stereo are not required, a speech codec could be used. I think this TSAC outperforms most of them on raw bit rate, but not energy efficiency and speed. E.g. SILK goes down to 6 kbps; that could be a contender.

Or maybe you do want really good quality in order to fingerprint the voices. Vocoder artifacts can give parties plausible deniability (that's not my voice).


Or mass surveillance.


> phone handsets and headsets don't have the power to do it in real time.

Until quite recently my phone was the faster computer I owned.

What phone cannot decode these?


Clicked the download link wanting to take a look at the source... and was a bit perplexed before quickly canceling it. 237MB, compressed, for an audio codec!? At that point one can't help but think that the samples are already in the decoder itself.

I wonder how it compares to https://en.wikipedia.org/wiki/Codec2 and related codecs, which go even lower for bitrate.


> one can't help but think that the samples are already in the decoder itself

In a certain sense, maybe they are. Or more accurately, small fragments of samples, and just how to mix them together, is what is transmitted. It reminds me of pre-generated dictionaries with classic LZ compression. If an algorithm is going to work on mostly English text, then it might make sense to include an English dictionary with the algorithm. Brotli does this [Wikipedia]:

> Unlike most general-purpose compression algorithms, Brotli uses a predefined dictionary, roughly 120 KiB in size, in addition to the dynamically populated ("sliding window") dictionary. The predefined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents


The difference is that Codec2 encodes audio at 8 kHz sampling rate while this TSAC codec encodes audio at 44.1 kHz, which makes a pretty big difference in terms of audio fidelity.


Also, codec2 is a vocoder, meaning it is specialised to compressing speech. Give it any old 8kHz sampled audio and it probably wouldn't sound so good.


That sounded interesting, so I did the download and had a look in the archive. Here's the list of contents:

    $ tar tvzf ~/Downloads/tsac-2024-04-08.tar.gz 
    drwxrwxr-x bellard/bellard   0 2024-04-08 14:47 tsac-2024-04-08/
    -rw-rw-r-- bellard/bellard 3040 2024-04-08 14:47 tsac-2024-04-08/readme.txt
    -rwxrwxr-x bellard/bellard 3979504 2024-04-08 14:47 tsac-2024-04-08/libnc_cuda.so
    -rwxrwxr-x bellard/bellard  565336 2024-04-08 14:47 tsac-2024-04-08/libnc.so
    -rw-rw-r-- bellard/bellard 49639706 2024-04-08 14:47 tsac-2024-04-08/tsac_stereo_q8.bin
    -rw-rw-r-- bellard/bellard 85407494 2024-04-08 14:47 tsac-2024-04-08/dac_stereo_q8.bin
    -rw-rw-r-- bellard/bellard 49633561 2024-04-08 14:47 tsac-2024-04-08/tsac_mono_q8.bin
    -rw-rw-r-- bellard/bellard       31 2024-04-08 14:47 tsac-2024-04-08/Changelog
    -rwxrwxr-x bellard/bellard   287536 2024-04-08 14:47 tsac-2024-04-08/tsac
    -rw-rw-r-- bellard/bellard 85143422 2024-04-08 14:47 tsac-2024-04-08/dac_mono_q8.bin
So yeah, it's not exactly a compact stand-alone implementation, but on the other hand it does advanced GPU stuff so I guess nobody expected it to ... or perhaps I did, just a little, based on the author's reputation. :)


But thats the point of it.


An Nvidia GPU is necessary for fast operation.

Compression is getting so heavy that soon it isn't possible to perform it on normal hardware. AV1 already proved that, the future audio/video codecs will be even heavier.

Decompression is also getting heavier. Poor mobile devices.

I'm starting to appreciate well written algorithms which don't require massive computing power. JPEG XL is a good example. It has the same compression ratio as AVIF, but requires less processing power.


If I'm understanding the specs correctly, it is basically a LLM, but for audio. So it requires some serious power to encode it, because it is using the latest AI hype to achieve the result.


Clearly we need a generic LLM PCIe card!


One of the DAC authors here (the codec that this builds off of). Very cool work! Would love to see some more detail on the modifications to DAC. Boosting the capacity with a transformer makes sense to me.

Makes me happy to see DAC getting built on! Thanks!


Digital-Analog-Converter?


Descript Audio Codec: https://github.com/descriptinc/descript-audio-codec, mentioned in the original post. But yes, that is why we called it DAC! :)


I might be missing something obvious, but it's not clear to me how to get an mp3 out of this on Ubuntu 22.04.

Following the docs, `./tsac c myfile.mp3 myfile.tsac` generates a tsac file that's unplayable with mpv. Trying ffmpeg to convert to mp3 didn't work: `ffmpeg -i myfile.tsac compressed.mp3` ("myfile.tsac: Invalid data found when processing input"). Using a wav input file has the same result.

I can use `./tsac d myfile.tsac output.wav` (I don't really want to decompress anything, but worth a try) but then after compressing `output.wav` with `ffmpeg -i output.wav output.mp3`, output.mp3 is the same size as if I hadn't used tsac (of course). If I use ffmpeg with a low bitrate like `-b:a 16k`, I get the usual low-quality gargle rather than the tsac output.


FYI (and in case Mr Bellard is reading), for the "Greatest Love of All" demo, the sample labeled "mono 5.02 kb/s" is in fact linked to the 6.79 kb/s stereo sample. The correct file is available at https://bellard.org/tsac/Greatest_Love_mono.wav


This is quite similar to the models used by all the AI music generators. Some feed the tokens into a language model to generate music, some replace the tokenization part with an alternative that gives a continuous representation for diffusion models.


New advancement of media compression seems always focusing on low bitrate, be it audio, video or image.

Which is totally fair given their applications, but I always wonder how much improvement they bring in high bitrate scenario. For example, are there codecs that have much better (perceptible) quality than Apple AAC 256kbps (or achieving similar quality at, say, 160kbps?) How much better are AV1 at 10Mbps compared to H265/264 (the improvement of H265 compared to H264 in "transparent" encoding was pretty disappointing IMHO).


> are there codecs that have much better (perceptible) quality than Apple AAC 256kbps (or achieving similar quality at, say, 160kbps?)

Opus achieves ABX transparency at around 128kbps (as in, the threshold where the vast majority of users taking a fidelity test are unable to tell the difference between the opus-encoded and lossless version).

https://abx.digitalfeed.net/opus.html


Thank you! The note there:

> NOTE:Opus doesn't support 44.1kHz sample rates, so encodes to 48kHz sample rate. As this causes browser playback issues, it has been resampled back to 44.1kHz. This may affect the sound quality, so this test should be taken with caution.

Is very surprising to me, in two ways.

Firstly I knew 44100 is a relic due to historical reasons, but it's still a quite widely used sample rate in audio world. I have no idea Opus does not support it.

Secondly, it seems to imply browser can't playback 48kHz audio properly. I didn't dig the details, but this sounds weird. Just like 44100, 48k is a very common sample rate, I can't imagine browser would have trouble with it (or any arbitrary sample rate, to be honest).


Like jasomill mentions, the browser playback issues statement has not been true on desktops for a long time. The most recent (or only) example I know of is iPhones, which finally added passable support for non-44.1kHz audio somewhere between iOS 15.7 (late 2022) and last august. Until then they'd sound like a broken vinyl deck when playing 48 kHz audio, oscillating in playback rate and crackling like crazy -- especially when passing through an AudioContext.


Opus doesn't support 44.1 kHz because compatibility and effort/benefit ratio:

https://github.com/xiph/opus/issues/43

The browser audio limitation is presumably a workaround to some bug or performance limitation that was relevant at some point in history (the site was created in 2014).


Kinda related, I was exploring how much complexity you really need with https://qoaformat.org/ - compresses at 278 kbits/s, but is much simpler than even MP2.


> The Transformer model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.

How is this possible? Does it use floating point and concurrency?

Cross-platform floating point determinism is seriously difficult. The Rapier physics engine could do it [0] at the expense of disabling simd and multithreading. It also works only on platforms that strictly comply to IEEE 754-2008 which I think that GPUs usually don't qualify (regarding subnormal numbers etc). Another thing that may have issues is fused multiply-add which may give higher precision than doing multiplication and addition separately (I think some platforms don't have FMA in hardware)

For example, it seems that TSAC currently runs on CPUs and nvidia GPUs. Could porting to AMD GPUs affect determinism?

[0] https://rapier.rs/docs/user_guides/rust/determinism/


It's possible but you have to make sure that floating point operations always happen in the same order (for example you could operate on blocks concurrently then merge them serially). You also have to be careful with optimizations like FMA because they produce a different result than multiply then add.


Are you sure this cross-platform determinism works for GPUs? I can't find any reference about that.


I attempted some ML-as-'compression' experiments ~2 years ago, ended up hitting a wall. Check out samples/pitch here: https://lorinhalpert.com/ipoc/ala/

If someone has audio encoding, playback, and/or DSPs experience email me to be invited to our our Discord server so we can take another crack at it! :)


So, let me get this straight.

Using a ~300 MB model, on a 1 TB hard drive, at 8 Kb/s, we can store... ~30 years of music.


This appears to sit just in the middle of something that could be used for music, with a higher bit rate, still much lower than other competing codecs, and something very effective for voice communication, but shrinking the bandwidth (thus the bit rate) also to limit artifacts. Not an expert in the field, anyway I think the supplied examples aren't the best ones to show its potential.


Finally, a bit-rate where I can tell the difference between compressed and original!


Next step: use a 1000B LMM (Large Music Model) trained on 1000+ TB of music for zero shot retrieval of any possible sound


Reading this I was wondering how far are video compression (with transformers), turns out decoding is still too expensive in practice (under 10 FPS for 1080p video).

https://arxiv.org/abs/2206.07307 https://arxiv.org/abs/2210.13827


FYI it looks to be MIT/BSD license.

Separately:

>The Transformer model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads.

That's neat. So even though it's "AI-based" its output is guaranteed to be the same for a given input?


I wonder where this codec is on complexity / bitrate graph from this post - https://phoboslab.org/log/2023/02/qoa-time-domain-audio-comp...


Pretty good! EnCodec also comes to mind as a neural codec: https://ai.honu.io/papers/encodec/samples.html


Are there standard-ish codec comparison processes that we can run to see how much perceived fidelity is lost in compression here?


Perceptual codecs must almost by definition be perceived. So the “standard” comparison is an ABX listening test.


Ah, so not like movie or image encoding where we have perceptual scoring algorithms that evolve over time. Thanks!


Reminds me of the old IBM 'RECOVC' codec from around 2000 where they compressed Mel-bank speech.

https://ieeexplore.ieee.org/document/7075313

All the patents around that are long-dead so good time to do an updated version I guess.

If you wanted to do something similar but with way lower bitrates (e.g. 300bps), then look at the NRV codec:

https://www.researchgate.net/publication/224209493_300_bps_n...


Addendum, playing with it at the 'low quality' end, it generates recognizable speech even down to 200bps and in some cases 100bps. crazy.


What's Nvidia specific about it?


Bellard strikes again...

Are we almost converting music to MIDI at this point?

As I understand it the model is learning the landscape of sound combinations that are interesting to humans and as such there will be no combination of raw bytes in the recorded file that will result in white noise (for example) being heard because this is never trained for.

What if it was though?


There are plenty of interesting musical pieces using white noise.


How well does it work on any song outside of its training set?


Well, I hope these samples were not part of the training set. Otherwise, this showcase would be quite useless.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: