This is definitely better than some of the others out there. I threw together some comparisons here at 7kb/s for mp3/opus/aac: https://non.io/TSAC-Comparisons
Happy to add other comparisons if others want any.
Overall, it's FAR better at these lower bit rates, but that doesn't mean it's necessarily good. One issue I see off the bat is that volume is fairly inconsistent in the output for TSAC, which makes stereo in particular quite hard to listen to with the volume "flickering" in each channel independently.
Also I don't seem to be able to access your page, so there might be error.
Finally, when doing opus comparison it's good now to denote if it is using Lace or NoLace decoder post processing filters that became available in opus 1.5 (note, this feature need to be enabled at compile time, and defying decode a new API call needs to be made to force higher complexity decoder) . See https://opus-codec.org/demo/opus-1.5/
I've encoded xHE-AAC at 6 kbit/s mono, which is the closest match I could get. It performs much better and is widely supported (Android 9/iOS 13/macOS 10.15/Windows 11), although there are no free low bitrate encoders available yet. I used the Fraunhofer IIS Pro with EZ CD Audio Converter. It would be great if you could add it: https://filebin.net/x46m1x7n6d2t7e6b
This is the codec that TSAC extended, so it could be a nice comparison to see. I'd also echo Vocos (from sibling comment), it operates on the same Encodec tokens but generally has better reconstruction quality.
the fast mode (you don't have to patch the binary for this one, it seems to not do the CRC check?) and the normal (non-fast) mode sound different, but both quite interesting
Does anyone know how I would recreate this effect with 2 mp3s acting as 2 radio stations? I have tried fading in samples of static but it doesnt feel the same.
- Can't use it in telephony (obvious application for low bitrates); phone handsets and headsets don't have the power to do it in real time.
- Very small files of good quality would be useful in tiny embedded systems that have low flash space: but what systems of that type have the processing power for decoding? Very low storage more or less goes hand in hand with weak processing.
The quality is astonishing for the bit rate, though.
Sometimes you have enough cpu and not enough bandwidth. Remote expeditions, rural schools in underdeveloped parts of the world, etc. You can stream a bunch of stuff (news, audiobooks, daily lectures, etc.) via (otherwise pricy) satellite links, and then a raspberrypi or whatever solar powered device can decode the audio without having to be real-time.
It's not a use for "everybody", but it might reduce the costs for those people who need this (or make new things viable).
Archiving over a long period of time might be a use case.
I often wonder, how much of the data currently in circulation will be lost at some point? HDD/SSD last a couple of years. Most of the data in the cloud will be copied over, but some will be lost. If you extrapolate to a 1000, 1 000 000 years, how much will remain? Will something survive the civilization collapse? I guess most people don't care, but some will ...
One way to make data mediums last longer is to make them lower density, and for that such super-low bitrate could be useful.
Not only the JBIG2 fiasco was not an inherent flaw of JBIG2 itself, but any historical archive would want to use a bounded error model for any lossy compression algorithm anyway. We don't exactly know how much error is tolerable for given content, but we know that some error is definitely tolerable for most contents, and its upper bound can be used to specify the safe and reasonable compression level. Once that constraint has been met, the choice of algorithm is no longer relevant.
If you use simple encoding (e.g. uncompressed bitmap), your archival capacity will be extremely limited, esp. if you use low-density medium (optimized for longevity). There's an obvious trade-off between encoding complexity and how much can you archive.
One approach would be to have a layered strategy - simple (but inefficient) encoding for an initial set of data, accompanied by a bootstrap for the next level which would unlock access to a much larger collection of efficiently stored data.
The only data that survives a civilization-level collapse is that which requires as little decoding as possible; in other words, plaintext. Future archaeologists aren't going to have a working copy of your GAN-based audio decoder. Translate your data into text (in as many major languages as possible), carve it into stone, and stuff it in a cave in the desert.
I would worry more about future archaeologists not being able to access e.g. Nvidia and TSMC engineering secrets than their ability to decode my cat pictures and shitty piano practice.
> Can't use it in telephony (obvious application for low bitrates); phone handsets and headsets don't have the power to do it in real time.
Not now but in another 5 years they will start to and in 10 years all new ones will probably have the power for this. I find it really exciting although it will consume more battery to run that; but if that is less than the radio antenna requires then it might make sense.
There's plenty of use cases, just because you can't think of them doesn't mean there aren't any. Neural network based audio codecs like TSAC, Descript, Encodec, Soundstream are used for music and speech generation, audio upsampling, pre-training acoustic language models, speech-to-speech translation etc.
I think the big use case is satellite telephony and satellite radio/audio/podcast play. You could do all audio applications off a 3KBit/s connection - that’s completely insane.
It would probably have to be optimized a bit further though, both in terms of computing as well as size. Goal would probably be real time encoding on an iPhone SE without breaking too much of a sweat, and a encoder/decoder perhaps less than 200MB?
I am curious how well this does work with a full orchestra music — that’s where encoders usually croak. Give me a sample of the Star Wars theme.
How does satellite telephony work? Alice calls Bob over satellite. Alice's telephone is on AC power, and contains a GPU cluster, and Bob's is the same?
Today's GPU clusters is tomorrow's cell phone. But this codec doesn't require that much power anyway. I didn't try it but on some forum somebody claimed 0.5x speed on a Core(TM) i3-7100U CPU @ 2.40GHz [1]. It sounds plausible that with some more optimization and a bit better hardware, that's a bit more specialized for AI, it could do real time encoding and decoding on cell phones.
I remember playing OPUS files at 16 KBPS over a... 2G? connection with mplayer's caching options. The audio sounded a bit better than MP3@32 or Real Audio back in the day.
As the music was "avant gardé", it almost fitted the genre.
Innovation goes in steps and iterations ;) When mp3 came out, I could just barely play a song encoded from 44.1kHz/16 bit stereo on my PC, taking almost 100% CPU. Today they can be played on a cheap microcontroller.
I like that they share their work, it can lead to something some day.
MP3's were playable on cheap boom box stereos, and portable CD players, 20+ years ago. Such consumer devices capable of decoding MP3's appeared within less than half a decade of MP3 itself, by my recollection.
I think you are correct on that one. How long will it take to run this neural net on cheap consumer devices? It might take more than 5 years. But if all the new AI stuff is not a hype, but continues to be used, we will probably see hardware for running it on cheap circuits in a not to distant future. Maybe using a GPU+RAM like structure. Maybe the analog circuits with analog flash will win? The future will show us :)
Maybe add this URL to the calendar on today's date in 5 years an go back and reply with the answer :-D
It's a neural network, not a traditional compression algorithm. It would be difficult to implement this efficiently in an ASIC AFAIK, but if there are any hardware designers that disagree please chime in.
Traditional codecs also use a lot of “magic” tables with numbers (see e.g. AMR codecs used in GSM telephony).
I think this codec could be optimized to run relatively efficiently on the various AI accelerator chips modern phones have, which is “kind-of” doing it in hardware.
Ham Radio enthusiasts love to do stuff with $1000 radios. If this can run on any reasonable laptop it could be amazing.
They're putting neural accelerators in everything these days, I wouldn't be surprised if they got it to where it could work on a phone, in which case you could do voice over Meshtastic.
Since music quality / stereo are not required, a speech codec could be used. I think this TSAC outperforms most of them on raw bit rate, but not energy efficiency and speed. E.g. SILK goes down to 6 kbps; that could be a contender.
Or maybe you do want really good quality in order to fingerprint the voices. Vocoder artifacts can give parties plausible deniability (that's not my voice).
Clicked the download link wanting to take a look at the source... and was a bit perplexed before quickly canceling it. 237MB, compressed, for an audio codec!? At that point one can't help but think that the samples are already in the decoder itself.
> one can't help but think that the samples are already in the decoder itself
In a certain sense, maybe they are. Or more accurately, small fragments of samples, and just how to mix them together, is what is transmitted. It reminds me of pre-generated dictionaries with classic LZ compression. If an algorithm is going to work on mostly English text, then it might make sense to include an English dictionary with the algorithm. Brotli does this [Wikipedia]:
> Unlike most general-purpose compression algorithms, Brotli uses a predefined dictionary, roughly 120 KiB in size, in addition to the dynamically populated ("sliding window") dictionary. The predefined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents
The difference is that Codec2 encodes audio at 8 kHz sampling rate while this TSAC codec encodes audio at 44.1 kHz, which makes a pretty big difference in terms of audio fidelity.
So yeah, it's not exactly a compact stand-alone implementation, but on the other hand it does advanced GPU stuff so I guess nobody expected it to ... or perhaps I did, just a little, based on the author's reputation. :)
Compression is getting so heavy that soon it isn't possible to perform it on normal hardware. AV1 already proved that, the future audio/video codecs will be even heavier.
Decompression is also getting heavier. Poor mobile devices.
I'm starting to appreciate well written algorithms which don't require massive computing power. JPEG XL is a good example. It has the same compression ratio as AVIF, but requires less processing power.
If I'm understanding the specs correctly, it is basically a LLM, but for audio. So it requires some serious power to encode it, because it is using the latest AI hype to achieve the result.
One of the DAC authors here (the codec that this builds off of). Very cool work! Would love to see some more detail on the modifications to DAC. Boosting the capacity with a transformer makes sense to me.
Makes me happy to see DAC getting built on! Thanks!
I might be missing something obvious, but it's not clear to me how to get an mp3 out of this on Ubuntu 22.04.
Following the docs, `./tsac c myfile.mp3 myfile.tsac` generates a tsac file that's unplayable with mpv. Trying ffmpeg to convert to mp3 didn't work: `ffmpeg -i myfile.tsac compressed.mp3` ("myfile.tsac: Invalid data found when processing input"). Using a wav input file has the same result.
I can use `./tsac d myfile.tsac output.wav` (I don't really want to decompress anything, but worth a try) but then after compressing `output.wav` with `ffmpeg -i output.wav output.mp3`, output.mp3 is the same size as if I hadn't used tsac (of course). If I use ffmpeg with a low bitrate like `-b:a 16k`, I get the usual low-quality gargle rather than the tsac output.
FYI (and in case Mr Bellard is reading), for the "Greatest Love of All" demo, the sample labeled "mono 5.02 kb/s" is in fact linked to the 6.79 kb/s stereo sample. The correct file is available at https://bellard.org/tsac/Greatest_Love_mono.wav
This is quite similar to the models used by all the AI music generators. Some feed the tokens into a language model to generate music, some replace the tokenization part with an alternative that gives a continuous representation for diffusion models.
New advancement of media compression seems always focusing on low bitrate, be it audio, video or image.
Which is totally fair given their applications, but I always wonder how much improvement they bring in high bitrate scenario. For example, are there codecs that have much better (perceptible) quality than Apple AAC 256kbps (or achieving similar quality at, say, 160kbps?) How much better are AV1 at 10Mbps compared to H265/264 (the improvement of H265 compared to H264 in "transparent" encoding was pretty disappointing IMHO).
> are there codecs that have much better (perceptible) quality than Apple AAC 256kbps (or achieving similar quality at, say, 160kbps?)
Opus achieves ABX transparency at around 128kbps (as in, the threshold where the vast majority of users taking a fidelity test are unable to tell the difference between the opus-encoded and lossless version).
> NOTE:Opus doesn't support 44.1kHz sample rates, so encodes to 48kHz sample rate. As this causes browser playback issues, it has been resampled back to 44.1kHz. This may affect the sound quality, so this test should be taken with caution.
Is very surprising to me, in two ways.
Firstly I knew 44100 is a relic due to historical reasons, but it's still a quite widely used sample rate in audio world. I have no idea Opus does not support it.
Secondly, it seems to imply browser can't playback 48kHz audio properly. I didn't dig the details, but this sounds weird. Just like 44100, 48k is a very common sample rate, I can't imagine browser would have trouble with it (or any arbitrary sample rate, to be honest).
Like jasomill mentions, the browser playback issues statement has not been true on desktops for a long time. The most recent (or only) example I know of is iPhones, which finally added passable support for non-44.1kHz audio somewhere between iOS 15.7 (late 2022) and last august. Until then they'd sound like a broken vinyl deck when playing 48 kHz audio, oscillating in playback rate and crackling like crazy -- especially when passing through an AudioContext.
The browser audio limitation is presumably a workaround to some bug or performance limitation that was relevant at some point in history (the site was created in 2014).
Kinda related, I was exploring how much complexity you really need with https://qoaformat.org/ - compresses at 278 kbits/s, but is much simpler than even MP2.
> The Transformer model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.
How is this possible? Does it use floating point and concurrency?
Cross-platform floating point determinism is seriously difficult. The Rapier physics engine could do it [0] at the expense of disabling simd and multithreading. It also works only on platforms that strictly comply to IEEE 754-2008 which I think that GPUs usually don't qualify (regarding subnormal numbers etc). Another thing that may have issues is fused multiply-add which may give higher precision than doing multiplication and addition separately (I think some platforms don't have FMA in hardware)
For example, it seems that TSAC currently runs on CPUs and nvidia GPUs. Could porting to AMD GPUs affect determinism?
It's possible but you have to make sure that floating point operations always happen in the same order (for example you could operate on blocks concurrently then merge them serially). You also have to be careful with optimizations like FMA because they produce a different result than multiply then add.
I attempted some ML-as-'compression' experiments ~2 years ago, ended up hitting a wall. Check out samples/pitch here: https://lorinhalpert.com/ipoc/ala/
If someone has audio encoding, playback, and/or DSPs experience email me to be invited to our our Discord server so we can take another crack at it! :)
This appears to sit just in the middle of something that could be used for music, with a higher bit rate, still much lower than other competing codecs, and something very effective for voice communication, but shrinking the bandwidth (thus the bit rate) also to limit artifacts. Not an expert in the field, anyway I think the supplied examples aren't the best ones to show its potential.
Reading this I was wondering how far are video compression (with transformers), turns out decoding is still too expensive in practice (under 10 FPS for 1080p video).
>The Transformer model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads.
That's neat. So even though it's "AI-based" its output is guaranteed to be the same for a given input?
Are we almost converting music to MIDI at this point?
As I understand it the model is learning the landscape of sound combinations that are interesting to humans and as such there will be no combination of raw bytes in the recorded file that will result in white noise (for example) being heard because this is never trained for.
This is definitely better than some of the others out there. I threw together some comparisons here at 7kb/s for mp3/opus/aac: https://non.io/TSAC-Comparisons
Happy to add other comparisons if others want any.
Overall, it's FAR better at these lower bit rates, but that doesn't mean it's necessarily good. One issue I see off the bat is that volume is fairly inconsistent in the output for TSAC, which makes stereo in particular quite hard to listen to with the volume "flickering" in each channel independently.