Hacker News new | past | comments | ask | show | jobs | submit login
RNNoise: Learning Noise Suppression (xiph.org)
257 points by clouddrover on Sept 29, 2017 | hide | past | favorite | 54 comments



Oddly, I found the RNNoise suppression more distracting than the less sophisticated Speex suppressor. If you'll forgive the figurative language; compared with the noisy tracks, speex de-emphasises the noise into a quieter but smoother robotic 'bokeh', which allowed me to concentrate on the main speakers.

RNNoise on the other hand seemed to detect silences well, but left artifacts in the speech such that it had a choppy and artificial feel. Lacking the smoothness in the background I found I was more distracted by the distortions in the words.


Interesting -- and unexpected. I also wrote the Speex suppressor and one of the things that specifically annoyed me about it was the robotic noise and the pseudo-reverberation it adds to the speech, but it seems like some people (like you) like that. Trying to understand exactly what you don't like about RNNoise... is it how the remaining background noise sounds or how sharply it turns on/off?

I did a quick hack to RNNoise to smooth out the attenuation and prevent it from cancelling more than 30 dB. I'd be curious if it improves or makes things worse for you (compared to the samples in the demo): https://jmvalin.ca/misc_stuff/rnn_hack1/


I also much prefer the Speex version. The robotic noise is consistent and easy to ignore. The NN version has a "choppy" feel to it that catches my attention. It reminds me of H264 vs VP9 video codecs at the same bitrate. VP9 is supposed to be better, but H264 artifacts are more artificial looking. VP9 artifacts look more natural, almost like mold/decay on film, which I find more distracting.

I prefer your hacked RNNoise version to the original, but I still prefer the Speex version. Robotic is predictable, and predictable is good. I don't want denoising artifacts to feel like there's some intelligent agent behind them, just as I don't want any software tool to feel intelligent. The smarter the tool the more jarring it is when it misreads my intentions. It might help average performance but it harms worst-case performance, and worst-case performance is subjectively more important because humans pay attention to outliers.


Subjectively it seems to me that the RNNoise sample doesn't trigger my brain to attempt to fill in the gaps.

With the Speex/raw ones I have all the data so if I listen to it again over and over I can get more out of it eventually.

With the RNNoise one I obviously don't even have enough extra data to even try doing that so all I can do is blame the algorithm.

Perhaps what you really want is an algorithm that lets through a bit more of the 'possible noise' for the human brain to have another go at.


What you're describing is more or less why noise suppression algorithms in general cannot really improve intelligibility of the speech. Unless they're given extra cues (like with a microphone array), there's nothing they can do in real-time that will beat what the brain is capable of with "delayed decision" (sometimes you'll only understand a word 1-2 seconds after it's spoken). So the goal of noise suppression is really just making the speech less annoying when the SNR is high enough not to affect intelligibility.

That being said, I still have control over the tradeoffs the algorithm makes by changing the loss function, i.e. how different kinds of mistakes are penalized.


Perhaps being more lenient in noisier situations could be an interesting tradeoff then. At lower noise levels it's already pretty good...


I found terminal sounds, fricatives and sibilants to be at a mininal distracting with RNN for "car" and "street", and at worst unintelligable. In particular, the terminal sound in "Christmas" was completely lost in the noise for me at 10dB with RNN, but was perfectly fine in Speex.

For "car" RNN sounded as good or better than Speex at all noise levels.

That rnn_hack is significantly better for me. 5dB on that sounds strictly better than 10dB on the original to my ear for "babble" and "street". I also noticed that the for the parts that sound the worst to me at 10-15dB in the original RNN, the signal is completely missing in the 0dB RNN version, so perhaps the signal is in the same band as the noise at that part?

Either way it's a tough tradeoff because I suspect that low bitrate encodings will love the nearly empty signal in the bands that are generated by the original, but the seemingly rectangular cutoff/introduction of the noise was much more jarring to me than the reverberation added by Speex (though I didn't like that in Speex, it didn't seem to add to my effort to understand the way that.


It reminds me of the introduction of line noise into VoIP systems to replicate the natural electrical noise present in POTS systems. Without something, the hard attenuation brings more attention to the noise that still exists.


I think the biggest issue for me was the sharpness of the cutoff -- it actually sounded like a simple noise gate to me. The hack here helps a lot in smoothing out the sharp attack.


FWIW I much prefer the RNNoise version to the Speex version. With the Speex version the nosie is much more consistently present/noticeable/distracting.


> a choppy and artificial feel.

To me it sounds like the kind of flange-y wafty MP3 glitches you used to get. At 0dB on the babble sample, it's painful to listen to whilst Speex is perfectly fine.

For me, Speex wins on all samples.


I didn't do any technical analysis, but to me RNN sounded as if they had just cut all noise between the narrators speech, and played all sound during their words, giving you a sharp amplitude difference and "stabbing" sensation.

At least that's my take on why Speex is subjectively more pleasant, with which I agree.


That's it, yeah, a kind of gated ramp effect at the beginning and end of words.


That means if they had access to the gate we can change that to our taste. Hope that is an option in the function.


Sounds like the difference between (e.g. soft-knee) compression and a noise gate.


The Speex suppressor is subjectively much nicer in all babble scenarios, but in some scenarios the RNN is better. E.g. it really excels at 20db car noise.


Personally i prefer speex even for car noise, as the rnn adds clicks and scratches that annoy me a lot more than more frequent soft distortions.


I didn't notice any issue with RNN at the lower levels of background noise, but as the background got noisier, I too heard what sounded like speaker distortion which seemed more distracting than the background noise.

I was slightly confused by the text that claimed that it was expected for the intelligibility to go down though, so maybe that's all working as intended.


I was curious how RNNoise would perform on a noisy street scene. I grabbed a section from a random noisy video and ran it through RNNoise as well as light naive use of the noise removal plugin in Audacity sampled from the 43rd second. The speaker distortion, as noted by ZeroGravitas from their fancy example, is quite evident but I'm still pretty impressed.

audacity screenshot https://d4344e4d9b25f298d9ea-790118db7dd23376c2de685644429e7...

input https://d4344e4d9b25f298d9ea-790118db7dd23376c2de685644429e7...

RNNoise https://d4344e4d9b25f298d9ea-790118db7dd23376c2de685644429e7...

naive audacity filter https://d4344e4d9b25f298d9ea-790118db7dd23376c2de685644429e7...

source https://www.youtube.com/watch?v=4HpF-IoK2y8


I'm not sure how the audacity works exactly, but keep in mind that one goal of RNNoise is real-time, so it cannot look ahead when denoising. OTOH, if you're denoising an entire file, then you should look at the whole file at a time. This makes it easier to make accurate decisions about what to keep and what to discard.


Thank you for these samples, very cool.

To me it sounds like we've got some room for improvement all-around... but color me impressed. I'm also always impressed by Audacity's noise removal when I use it for stupid simple voice-overs. I'd bet this deep learning approach will do nothing but improve, quickly.


Well there are different tools to clear up a audio track.

$50 will get Audio Cleaning Lab. If I was home I would do a quick auto clean and then hand scrube the file. http://www.magix.com/us/audio-cleaning-lab/detail/

$1199 will get you iZotope’s RX. I actually normally get better results form the $50 one then when I use this tool at a friend's shop. https://www.izotope.com/en/products/repair-and-edit/rx-post-...

Spectral editors are amazing for removing certain sounds and keeping the over all sound intact. This is where we need to move into. Editing at the Spectral level, which will have a much higher CPU overhead.

Here is a great article showing the different hands on techniques for noise removal. https://www.soundonsound.com/techniques/noise-reduction-tool...


It would be great to have a good open source noise supressing tool in VST format. The leading software solutions are fairly expensive, e.g. Izotope RX. Only Accusonus Era-N is kind of affordable, but at the price of not beeing tweakable at all.

By the way, good noise suppression hardware is also comparatively expensive, see for example the Cedar DNS 2 [1]. There could be some business opportunities in that area.

[1] https://www.cedar-audio.com/products/dns2/dns2.shtml


Xiph's implementation of RNNoise is licensed under the 3-Clause BSD license, so it would be easy to wrap a VST around it with a simple GUI.


I also really, really want that. Things that go some of the way are Audacity (which has noise reduction and spectral editing but is not remotely on a par with RX) and https://github.com/lucianodato/noise-repellent.

If somebody credible put together a kickstarter for a FOSS equivalent of RX then I'd back the hell out of it. Mostly I think what's needed is a GUI around OSS stuff that already exists, either as plugins or code that can be borrowed.


Maybe you never seriously looked at it because it's free and somewhat underperforms compared to commercial audio editors on other features, but Audacity's Noise Reduction filter is absolutely top-notch.


Long time ago I used to make amateur remixes, and one tricky part was to isolate vocals from the remixed track. To do that I was using the noise removal tool: select a part of the track without vocals, run a spectral analysis on it and then substract the result to the whole track. Most of the time the result was terribly mangled, but sometimes I got something usable.

this demo got me thinking: if I want to remove something very specific from one track instead of learning a generalized filter, can I train this model with a smaller dataset, like a few seconds from that track?


I'm not sure that would work well as it would need to understand the cycle of the (unwanted) backing - i.e. in a simple backing track there might be a kick drum, then snare, and the system would need to know where those 'should' be in relation to the backing - I don't think this would work that way - particularly due to the way that it's achieving the removal of the unwanted noise (altering the response of each band of frequencies).

I'd think it would be possible to create something that would do what you're looking for, but it would be much more complex than the above (and -way- beyond what I'm capable of at the moment, maybe in a couple of years I'll be able to do something like it).

I've had more luck with taking the backing and using phasing to remove it from different sections of a song - if you get a track where the backing is simple, sequenced and samples/repeatable synths (so that the sound is identical each time it happens), then it's possible to take that non-vocal section, and align it with the vocal section on another track and reverse its phase to get cancellation; You have to be precise and get lucky in terms of the rest of the track, but it is possible. There is, of course, the old stereo swap and reverse phase trick which removes everything that's not panned centrally; that can get you a lot of mileage.

As mentioned, though, in another comment, getting hold of acappellas/stems can be much better, and having listened to some of some classic tracks, you can learn a lot about production in a short time by doing so.


I think a better approach would be to get a dataset of songs along with their acappella versions and fine tune the network to that.


The rule of thumb I've heard is that neural nets only start to work when you have 10s of thousands of examples.


The RRNoise suppression is less appealing to my ear than the Speex suppression... But:

- the approach is pretty cool!

- as mention in the article, it might be very useful when applied to multiple speakers (conferencing)

- it might be very interesting for speech recognition softwares

Also, as a sound guy, when I have a noisy signal I sometimes remove it a bit too heavily -> I mask the artifacts with some background music. I will definitely try that with the RNNoise suppression !


From the article:

    As strange as it may sound, you should
    not be expecting an increase in intelligibility.
I thought one of the reasons hearing aids were so bad was that they pick up noise equally. Wouldn't this method have a direct impact on making hearing aids better?

I also have a real hard time differentiating people talking in Google hangouts, say, especially if they're using silverware on porcelain. Wouldn't this type of noise suppression help in this case as well?

Seems like pretty awesome stuff.


My comment about intelligibility refers to a human (with normal audition) directly listening to the output. When the output is used in a hearing aid, a cochlear implant, or a low bitrate vocoder, then noise suppression may be able to help intelligibility too.


Hearing aids have gotten way more advanced in the past couple decades. It used to be that they were just parametric equalization. They now do crazy fancy things involving not just noise-reduction but also directionality.


From what I can tell it seems the RNN learned when the speaker was talking. It then just make sharp cuts to blank out when the speaker is not talking. It does not appear to have learned how to extract just the frequencies of the speaker but rather just when a speaker is speaking.

I feel this could be taken a step more in such that when the speaker and overlapping loud sound happens at the same time it is able to extract just the speakers voice.

Now obviously this is easier said than done.


A lot of people get the impression it's only cancelling where there's no speech, but it's also cancelling during speech -- just not as much. If you look at the spectrogram at the top of the demo, you can see HF noise being attenuated when there's LF speech and vice versa.


Wearing my Bose QC25 and still hearing my colleagues talking - I would really like to have ANC that filters out speech.


ANC is basically realtime phase reversal of the waveform and needs to be almost zero latency. Hard to get a neural network to run fast enough, esp on an embedded chip


Would be cool to try to train it using the technique from https://blog.openai.com/deep-reinforcement-learning-from-hum...


One application that would be awesome with NN is instrument separation.

That way one could build much better music visualizations programs, and also be a little more creative. I know I have some ideas if I could do it...


There's definitely some more work to be done on that topic but it's been researched at least since 1995 or so.. search for "source separation".


This could be incredibly useful for processing spectra produced off a mass spectrometer (which are notoriously noisy).


I don't see a link to the dataset on that page. Is the data publicly available? I would love to play with it.


For training I've had to use some non-free data, but there's also some free stuff around. The speech from the examples is from SQAM (https://tech.ebu.ch/publications/sqamcd) and I've also used a free speech database from McGill (http://www-mmsp.ece.mcgill.ca/Documents/Data/). Hopefully if a lot of people "donate their noise", I can make a good free noise database.


Thanks!


I used to be interested in such things, but then I found a really nice Funk/Acid Jazz/House/Soul playlist on Youtube (50 videos). Some of them I don't like, but overall - very enjoyable and puts me in a good mood when programming. It helps I'm new to Funk.

I think Funk and related genres are particularly suited for tasks that demand concentration. Funk de-emphasises melody in favor of rhythm. Melody calls for "active" listening. Funk is at the same time predictable and varied. I spend very little time clicking "next track". For me it's very stimulating listening.

So, a problem that could potentially be solved by neuroscience research and programmers (I understand this is interesting in itself) has been solved by good old playlists for me. And experimentation (trying new music).


> Funk de-emphasises melody in favor of rhythm. Melody calls for "active" listening.

I disagree with the assessment that if melody calls for active listening that rhythm somehow does not. It really depends on the listener and what they value.

There was a time when I might have agreed with you. That was before I learned to play drums.


It may be just me, but the tracks that distract me the most are the ones with a solid melody line. Such as Willow's Song, easily the most beautiful song about milking a bull: https://www.youtube.com/watch?v=8UOtscTCJBk

Or "Peter" from Holderlin's Traum: https://www.youtube.com/watch?v=iGn6nLaTfqU&feature=youtu.be...

Out of curiosity, could you recommend some bands with very good drums ? One that springs to mind is Guru Guru. When I stop to think about it it seems it's the drums that seal the deal for me, in many tracks. And Embryo, especially on Embryo's Rache.


For good acoustic drums I'm a fan of the tabla/darbuka.

https://www.youtube.com/watch?v=1yVWmeFM3-Y

https://www.youtube.com/watch?v=GIUlh8qweAc

Something more western - Mastadon - https://www.youtube.com/watch?v=hwgqenxNUfs


My view/experience is that any music with lyrics = cognitive load / distraction.


Yes. My belief is that that is due to the "irrelevant sound [or speech] effect".

(What about non-lyrical music? I'm not sure. The research literature on music interfering/non-interfering with cognition is old, large, and highly equivocal in my impression: https://www.gwern.net/Music-distraction )


> I found a really nice Funk/Acid Jazz/House/Soul playlist on Youtube (50 videos)

Link? :)


yes please @b0rsuk! I'd like the link too. I've been digging funk these days.

Here is an instrumental Funk playlist on Spotify that I have been passively listening to: https://open.spotify.com/user/spotify/playlist/37i9dQZF1DX8f...


Okay, I'm posting. I've been afraid to do so because I might violate some unspoken Hacker News rule and be smitten with downvotes.

I think my adventure started with very generic youtube queries for 'afro' or something like that. Then I stumbled upon Black Merda:

https://www.youtube.com/watch?v=LHSFsWZM1Gk&list=RDLHSFsWZM1... ... and I already liked psychedelic / krautrock.

But it turned out Black Merda is on a very long playlist itself, a playlist I'm disturbingly compatible with !

I'm far from even the half of the playlist, but some of my favorites (other than Black Merda which is awesome): https://www.youtube.com/watch?v=q59ZZtiLgYU&list=RDLHSFsWZM1...

https://www.youtube.com/watch?v=ubyOg-K62co&list=RDLHSFsWZM1...

https://www.youtube.com/watch?v=S6lDGgs7jAc&list=RDLHSFsWZM1...

https://www.youtube.com/watch?v=eaRhetAoqEw

Watermelon Man https://www.youtube.com/watch?v=3FzNpto-jnU&t=1164s

The Variations - Saying It Doing It https://www.youtube.com/watch?v=5OQl4WTkWTc&t=2549s

Only this track, really: https://www.youtube.com/watch?v=-gXrS6eKfjk&t=841s

Plus, there's the joy of hearing silly things like "Back when Dinosaurs ruled the Earth there was a disco not many people knew about." I didn't have so much fun since planet Gong.

I also got some nice recommendations from listening Sisters of Mercy, including Billy Idol. https://www.youtube.com/watch?v=AAZQaYKZMTI And I'm amazed how underrated Max Sedgley is. Must be the last name. https://www.youtube.com/watch?v=ugEgKA24dig Makes me want hone my Inkscape skills.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: