I'm an MLE, I would probably chop the songs into short segments, add noise (particularly trying to layer in people talking, room noise, and apply frequency-based filtering), and create a dataset like that. Then I would create contrastive embeddings with a hinge loss with a convnet on the spectrogram.
Ultimately this looks the same, but the "hashes" come from a convnet now. But you still are doing some nearest neighbor thing to actually choose the best match.
I imagine this is what 90% of MLEs would do, not sure if it would work better or worse than what Shazaam did. Prior to knowing Shazaam works, I might think this is a pretty hard problem, knowing Shazaam works, I am very confident the approach above would be competitive.
So you want a location-sensitive hash, or embedding, and you want it to be noise resistant.
The ml approach is to define a family of data augmentations A, and a network N, such that for some augmentation f, we have N(f(x)) ~= N(x). Then we learn the weights of N, and on real data have N(x')~=N(x).
The denoising approach is to define a set of denoising algorithms D and hash function H, so that H(D(x'))~=H(x). This largely relies on D(x')~=x, which may have real problems.
So the neutral network learns the function we actually need, with the properties we want, where the denoiser is designed for a proxy problem.
But that's not all...
Eventually our noise model needs extending (eg, reverb is a problem): the ML approach adds a new set of augmentations to A. This is fine: it's easy to add new augmentations.
But the denoiser might need some real algorithm work, and hope that there's no bad interaction with other parts of the pipeline, or too much additional compute overhead. (And de-reverb is notoriously hard.)
That could work, I think denoising a song to be a perfect match to the original recording is probably a very hard problem, so hard that your model will still need to be robust to some deviation from the original track, and therefore you need to do what I said above anyway.
Generally it's much easier to generate noised pairs from clean input than it is to do the reverse, i.e. go record lots of noised inputs from the wild and match to the original song. So the denoising problem you mention would be tougher still due to covariate shift. I think the features you learn trying to fingerprint the song through noise will probably be a bit more robust, but I don't have a mathematical proof.
Because then you’re training it on data that is more similar to the operating environment for the application. It’s a better fit for purpose. If the target environment was a clean audio signal, you’d optimise for that instead.
Adding noise is generally helpful for regularization in ML. Most modern deep learning approaches do this in one way or the other - mostly dropout. It improves generalization capabilities of the model.
To start from an original song and move it towards something that resebme a real life recording ? IOW : make the NN learn to distinguish between the song sound and its environment ?
Ultimately this looks the same, but the "hashes" come from a convnet now. But you still are doing some nearest neighbor thing to actually choose the best match.
I imagine this is what 90% of MLEs would do, not sure if it would work better or worse than what Shazaam did. Prior to knowing Shazaam works, I might think this is a pretty hard problem, knowing Shazaam works, I am very confident the approach above would be competitive.