Oddly, I found the RNNoise suppression more distracting than the less sophisticated Speex suppressor. If you'll forgive the figurative language; compared with the noisy tracks, speex de-emphasises the noise into a quieter but smoother robotic 'bokeh', which allowed me to concentrate on the main speakers.
RNNoise on the other hand seemed to detect silences well, but left artifacts in the speech such that it had a choppy and artificial feel. Lacking the smoothness in the background I found I was more distracted by the distortions in the words.
Interesting -- and unexpected. I also wrote the Speex suppressor and one of the things that specifically annoyed me about it was the robotic noise and the pseudo-reverberation it adds to the speech, but it seems like some people (like you) like that. Trying to understand exactly what you don't like about RNNoise... is it how the remaining background noise sounds or how sharply it turns on/off?
I did a quick hack to RNNoise to smooth out the attenuation and prevent it from cancelling more than 30 dB. I'd be curious if it improves or makes things worse for you (compared to the samples in the demo):
https://jmvalin.ca/misc_stuff/rnn_hack1/
I also much prefer the Speex version. The robotic noise is consistent and easy to ignore. The NN version has a "choppy" feel to it that catches my attention. It reminds me of H264 vs VP9 video codecs at the same bitrate. VP9 is supposed to be better, but H264 artifacts are more artificial looking. VP9 artifacts look more natural, almost like mold/decay on film, which I find more distracting.
I prefer your hacked RNNoise version to the original, but I still prefer the Speex version. Robotic is predictable, and predictable is good. I don't want denoising artifacts to feel like there's some intelligent agent behind them, just as I don't want any software tool to feel intelligent. The smarter the tool the more jarring it is when it misreads my intentions. It might help average performance but it harms worst-case performance, and worst-case performance is subjectively more important because humans pay attention to outliers.
What you're describing is more or less why noise suppression algorithms in general cannot really improve intelligibility of the speech. Unless they're given extra cues (like with a microphone array), there's nothing they can do in real-time that will beat what the brain is capable of with "delayed decision" (sometimes you'll only understand a word 1-2 seconds after it's spoken). So the goal of noise suppression is really just making the speech less annoying when the SNR is high enough not to affect intelligibility.
That being said, I still have control over the tradeoffs the algorithm makes by changing the loss function, i.e. how different kinds of mistakes are penalized.
I found terminal sounds, fricatives and sibilants to be at a mininal distracting with RNN for "car" and "street", and at worst unintelligable. In particular, the terminal sound in "Christmas" was completely lost in the noise for me at 10dB with RNN, but was perfectly fine in Speex.
For "car" RNN sounded as good or better than Speex at all noise levels.
That rnn_hack is significantly better for me. 5dB on that sounds strictly better than 10dB on the original to my ear for "babble" and "street". I also noticed that the for the parts that sound the worst to me at 10-15dB in the original RNN, the signal is completely missing in the 0dB RNN version, so perhaps the signal is in the same band as the noise at that part?
Either way it's a tough tradeoff because I suspect that low bitrate encodings will love the nearly empty signal in the bands that are generated by the original, but the seemingly rectangular cutoff/introduction of the noise was much more jarring to me than the reverberation added by Speex (though I didn't like that in Speex, it didn't seem to add to my effort to understand the way that.
It reminds me of the introduction of line noise into VoIP systems to replicate the natural electrical noise present in POTS systems. Without something, the hard attenuation brings more attention to the noise that still exists.
I think the biggest issue for me was the sharpness of the cutoff -- it actually sounded like a simple noise gate to me. The hack here helps a lot in smoothing out the sharp attack.
FWIW I much prefer the RNNoise version to the Speex version. With the Speex version the nosie is much more consistently present/noticeable/distracting.
To me it sounds like the kind of flange-y wafty MP3 glitches you used to get. At 0dB on the babble sample, it's painful to listen to whilst Speex is perfectly fine.
I didn't do any technical analysis, but to me RNN sounded as if they had just cut all noise between the narrators speech, and played all sound during their words, giving you a sharp amplitude difference and "stabbing" sensation.
At least that's my take on why Speex is subjectively more pleasant, with which I agree.
The Speex suppressor is subjectively much nicer in all babble scenarios, but in some scenarios the RNN is better. E.g. it really excels at 20db car noise.
I didn't notice any issue with RNN at the lower levels of background noise, but as the background got noisier, I too heard what sounded like speaker distortion which seemed more distracting than the background noise.
I was slightly confused by the text that claimed that it was expected for the intelligibility to go down though, so maybe that's all working as intended.
I was curious how RNNoise would perform on a noisy street scene. I grabbed a section from a random noisy video and ran it through RNNoise as well as light naive use of the noise removal plugin in Audacity sampled from the 43rd second. The speaker distortion, as noted by ZeroGravitas from their fancy example, is quite evident but I'm still pretty impressed.
I'm not sure how the audacity works exactly, but keep in mind that one goal of RNNoise is real-time, so it cannot look ahead when denoising. OTOH, if you're denoising an entire file, then you should look at the whole file at a time. This makes it easier to make accurate decisions about what to keep and what to discard.
To me it sounds like we've got some room for improvement all-around... but color me impressed. I'm also always impressed by Audacity's noise removal when I use it for stupid simple voice-overs. I'd bet this deep learning approach will do nothing but improve, quickly.
Spectral editors are amazing for removing certain sounds and keeping the over all sound intact. This is where we need to move into. Editing at the Spectral level, which will have a much higher CPU overhead.
It would be great to have a good open source noise supressing tool in VST format. The leading software solutions are fairly expensive, e.g. Izotope RX. Only Accusonus Era-N is kind of affordable, but at the price of not beeing tweakable at all.
By the way, good noise suppression hardware is also comparatively expensive, see for example the Cedar DNS 2 [1]. There could be some business opportunities in that area.
I also really, really want that. Things that go some of the way are Audacity (which has noise reduction and spectral editing but is not remotely on a par with RX) and https://github.com/lucianodato/noise-repellent.
If somebody credible put together a kickstarter for a FOSS equivalent of RX then I'd back the hell out of it. Mostly I think what's needed is a GUI around OSS stuff that already exists, either as plugins or code that can be borrowed.
Maybe you never seriously looked at it because it's free and somewhat underperforms compared to commercial audio editors on other features, but Audacity's Noise Reduction filter is absolutely top-notch.
Long time ago I used to make amateur remixes, and one tricky part was to isolate vocals from the remixed track. To do that I was using the noise removal tool: select a part of the track without vocals, run a spectral analysis on it and then substract the result to the whole track. Most of the time the result was terribly mangled, but sometimes I got something usable.
this demo got me thinking: if I want to remove something very specific from one track instead of learning a generalized filter, can I train this model with a smaller dataset, like a few seconds from that track?
I'm not sure that would work well as it would need to understand the cycle of the (unwanted) backing - i.e. in a simple backing track there might be a kick drum, then snare, and the system would need to know where those 'should' be in relation to the backing - I don't think this would work that way - particularly due to the way that it's achieving the removal of the unwanted noise (altering the response of each band of frequencies).
I'd think it would be possible to create something that would do what you're looking for, but it would be much more complex than the above (and -way- beyond what I'm capable of at the moment, maybe in a couple of years I'll be able to do something like it).
I've had more luck with taking the backing and using phasing to remove it from different sections of a song - if you get a track where the backing is simple, sequenced and samples/repeatable synths (so that the sound is identical each time it happens), then it's possible to take that non-vocal section, and align it with the vocal section on another track and reverse its phase to get cancellation; You have to be precise and get lucky in terms of the rest of the track, but it is possible. There is, of course, the old stereo swap and reverse phase trick which removes everything that's not panned centrally; that can get you a lot of mileage.
As mentioned, though, in another comment, getting hold of acappellas/stems can be much better, and having listened to some of some classic tracks, you can learn a lot about production in a short time by doing so.
The RRNoise suppression is less appealing to my ear than the Speex suppression... But:
- the approach is pretty cool!
- as mention in the article, it might be very useful when applied to multiple speakers (conferencing)
- it might be very interesting for speech recognition softwares
Also, as a sound guy, when I have a noisy signal I sometimes remove it a bit too heavily -> I mask the artifacts with some background music. I will definitely try that with the RNNoise suppression !
As strange as it may sound, you should
not be expecting an increase in intelligibility.
I thought one of the reasons hearing aids were so bad was that they pick up noise equally. Wouldn't this method have a direct impact on making hearing aids better?
I also have a real hard time differentiating people talking in Google hangouts, say, especially if they're using silverware on porcelain. Wouldn't this type of noise suppression help in this case as well?
My comment about intelligibility refers to a human (with normal audition) directly listening to the output. When the output is used in a hearing aid, a cochlear implant, or a low bitrate vocoder, then noise suppression may be able to help intelligibility too.
Hearing aids have gotten way more advanced in the past couple decades. It used to be that they were just parametric equalization. They now do crazy fancy things involving not just noise-reduction but also directionality.
From what I can tell it seems the RNN learned when the speaker was talking. It then just make sharp cuts to blank out when the speaker is not talking. It does not appear to have learned how to extract just the frequencies of the speaker but rather just when a speaker is speaking.
I feel this could be taken a step more in such that when the speaker and overlapping loud sound happens at the same time it is able to extract just the speakers voice.
A lot of people get the impression it's only cancelling where there's no speech, but it's also cancelling during speech -- just not as much. If you look at the spectrogram at the top of the demo, you can see HF noise being attenuated when there's LF speech and vice versa.
ANC is basically realtime phase reversal of the waveform and needs to be almost zero latency. Hard to get a neural network to run fast enough, esp on an embedded chip
For training I've had to use some non-free data, but there's also some free stuff around. The speech from the examples is from SQAM (https://tech.ebu.ch/publications/sqamcd) and I've also used a free speech database from McGill (http://www-mmsp.ece.mcgill.ca/Documents/Data/). Hopefully if a lot of people "donate their noise", I can make a good free noise database.
I used to be interested in such things, but then I found a really nice Funk/Acid Jazz/House/Soul playlist on Youtube (50 videos). Some of them I don't like, but overall - very enjoyable and puts me in a good mood when programming. It helps I'm new to Funk.
I think Funk and related genres are particularly suited for tasks that demand concentration. Funk de-emphasises melody in favor of rhythm. Melody calls for "active" listening. Funk is at the same time predictable and varied. I spend very little time clicking "next track". For me it's very stimulating listening.
So, a problem that could potentially be solved by neuroscience research and programmers (I understand this is interesting in itself) has been solved by good old playlists for me. And experimentation (trying new music).
> Funk de-emphasises melody in favor of rhythm. Melody calls for "active" listening.
I disagree with the assessment that if melody calls for active listening that rhythm somehow does not. It really depends on the listener and what they value.
There was a time when I might have agreed with you. That was before I learned to play drums.
It may be just me, but the tracks that distract me the most are the ones with a solid melody line. Such as Willow's Song, easily the most beautiful song about milking a bull:
https://www.youtube.com/watch?v=8UOtscTCJBk
Out of curiosity, could you recommend some bands with very good drums ? One that springs to mind is Guru Guru. When I stop to think about it it seems it's the drums that seal the deal for me, in many tracks. And Embryo, especially on Embryo's Rache.
Yes. My belief is that that is due to the "irrelevant sound [or speech] effect".
(What about non-lyrical music? I'm not sure. The research literature on music interfering/non-interfering with cognition is old, large, and highly equivocal in my impression: https://www.gwern.net/Music-distraction )
Plus, there's the joy of hearing silly things like "Back when Dinosaurs ruled the Earth there was a disco not many people knew about." I didn't have so much fun since planet Gong.
RNNoise on the other hand seemed to detect silences well, but left artifacts in the speech such that it had a choppy and artificial feel. Lacking the smoothness in the background I found I was more distracted by the distortions in the words.