Hacker News new | past | comments | ask | show | jobs | submit login

For those too impatient to read the details, check out the "Hear for yourself" examples toward the bottom of the page. They're reproducing decent sounding speech at 1.6 kbps.

1.6 kbps is nuts! I like to re-encode audio books or podcasts in Opus at 32 kbps and I consider that stingy. The fact that speech is even comprehensible at 1.6 kbps is impressive. As the article explains, their technique is analogous to speech-to-text, then text-to-speech.

The original recordings are a little stiff, and the encoded speech is a little more stiff. It isn't perfect, but it's decent. It'll be interesting to hear this technique applied to normal conversation. If regular speech holds up as well as their samples, it should be perfectly adequate for conversational speech. At 1.6 kbps, which is absurd.

Also, I wonder how well this technique could be applied to music. My guess is that it won't do justice to great musicians ... but it might be good enough simple pop tunes.




Actually, this won't work at all for music because it makes fundamental assumptions that the signal is speech. For normal conversations, it should work, though for now the models are not yet as robust as I'd like (in case of noise and reverberation). That's next on the list of things to improve.


Here we go! This is the first minute or so of Penny Lane by The Beatles converted down to a 10KB .bin and then back to a .wav: http://no.gd/pennylane.wav .. unsurprisingly the vocals remain recognizable, but the music barely at all.


As imagined by Marilyn Manson...


Pretty much! It shows off how the codec works to a great extent though as it seems to be misinterpreting parts of the music to be the pitch of the speech, so Paul's voice sounds weird at the start of most lines but okay throughout the lines.

I've also run a BBC news report through the program with better results although it demonstrates that any background noise at all can throw things off significantly: https://twitter.com/peterc/status/1111736029558517760 .. so at this low bitrate, it really is only good for plain speech without any other noise.


Well, in the case of music, what happens is that due to the low bit-rate there are many different signals that can produce the same features. The LPCNet model is trained to reproduce whatever is the most likely to be a single person speaking. The more advanced the model, the more speech-like the music is likely to turn

When it comes to noisy speech, it should be possible to improve things by actually training on noisy speech (the current model is trained only on clean speech). Stay tuned :-)


Can you try it with Tom's Diner by Suzanne Vega? It's sung without any instruments, and an early version of MP3 reportedly was a disaster on that song.


Here you go: http://no.gd/vega2.wav

It holds up ridiculously well considering the entire song compresses down to 25392 bytes.


The lyrics of the song are 1200 characters long, so this version of the song only takes up twenty times more space than the written lyrics.


At some point she sings "loose" instead of "nice" in the compressed version, and a bit later it also sounds like "lulk" instead of "milk". So it's a bit lossy even with respect to the lyrics!


Compare it with this now: https://youtu.be/lHjn8ffnEKU :-)


Could you also try "I Feel Love" by Donna Summer?

I am curious how it sounds when there is a really active bassline and lead synth.


http://no.gd/donna2.mp3

The vocal sections just sound like someone clearing their throat out.


I'm getting a 404 on this


Curious, it definitely works, but the domain is "weird" enough that certain firewalls or proxies may have trouble, perhaps. I've put it at https://gofile.io/?c=F5gle3 as an alternative.


That one works! And now I'm going to have nightmares.


    the music barely at all.
I suspect the reason that excerpt sounds so bad is because the music has several instruments playing at once. One doesn't generally design a vocoder to deal with more than one voice. As that except plays, you can hear that the most prominent instruments (eg: the bass at several moments) sound pleasing, albeit speech-like.

It would probably different from the original music, but pleasant, if one processed each track separately.


Right. This form of compression assumes a primary single pitch, plus variations from that tone. You can hear it locking into different components of the song and losing almost everything else.

Heavy compression of voice is vulnerable to background noise.

I miss the classic telco 8K samples per second, 8 bits. We used to think that was crappy audio.


Hilariously nightmarish. I'm going to use this for my alarm clock...


Sounds like a typical LPC encoder at a low bitrate, like maybe 5 kbps.


I tried it with music and the results were spooky. Very ethereal and ghostly. It was only with some classical music though, I might have to do a pop song next and share the results!


I’ve done experiments with Opus that produce intelligible (but ugly) speech at 2.3 kbit/s. It involves downsampling the Opus stream at the packet level—e.g. transmit only one out of every three packets. It was surprisingly easy. Nothing as sophisticated as what’s going on here.

Also based on work by Xiph. Possibly using the same LPC used here.


For comparison, adaptive GSM encodings, which are in use for cellphones today, are also in the single-digit kbps.

https://en.wikipedia.org/wiki/Adaptive_Multi-Rate_audio_code...


I use --vbr --bitrate 16 and it feels indistinguishable from the original for podcasts. As opusenc takes only wav for input and does not use multiple cores, I had to write scripts for parallel re-encoding of stuff.


I like to use Makefiles for parallel encoding.

    OPUSFLAGS = --vbr --bitrate 16
    all: $(patsubst %.wav,%.opus,$(wildcard *.wav))
    .PHONY: all
    %.opus: %.wav
        opusenc $(OPUSFLAGS) $< $@
Make -j4 or whatever. There are a few other ways to do this (e.g. xargs).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: