I noticed that RNNoise doesn't appear to be an open model, you can't re-train it from scratch from the source data, which isn't publicly documented (or doesn't exist?), even if you had enough hardware.
The documentation is a bit poor. The original data is available for download (with more info about the entire process, most of which is outside of my grasp as I am not an ML person) in the demo blog post: https://jmvalin.ca/demo/rnnoise/ (towards the bottom of the page)
Coming back with information from #xiph on freenode:
16:57 <ArsenArsen> where and under what license is the training data used for RNNoise?
18:38 <rillian> ArsenArsen: There's a copy of what I believe is the training data on the xiph server, but afaik it's never been published
18:39 <rillian> the original submission page has an EULA waiving copyright and liability claims, and agreeing that it _may_ be released CC0.
18:40 <rillian> it looks like that didn't actually happen.
18:41 <rillian> there may have been concerns about auditing it for privacy issues, but there's a lot of audio to listen to, 6.5G compressed
18:41 <rillian> jmspeex, TD-Linux: what's the status of publishing the rnnoise training data?
18:43 <jmspeex> Are you talking about the data that was used to train the default RNNoise model or the noise that got collected with the demo?
18:43 <rillian> jmspeex: I think debian just cares about the training data for the default model.
18:44 <jmspeex> There was never plan to release that -- it includes data from databases we cannot release
18:44 <jmspeex> but I don't see what the issue is. Distributing the model is not the same as distributing the data
18:45 <rillian> ah, I see. I didn't realize you'd used proprietary sources as well.
The other source of data mentioned in the paper is the NTT Multi-Lingual Speech Database for Telephonometry, which seems to be commercial, so presumably under a proprietary license.
No, exactly none of that data was used for training. The training was done before the demo that was asking for noise contributions. The contributions are CC0, but were never used (i.e. totally unknown dataset quality).
There is training instructions in the repository. The training scripts appear to be using some pretty standard ML libraries (I'm seeing keras and mentions of tensorflow), so I imagine that the requirements are the same as those.