Is there any scenario where the discrepancy between FFT and its analog cousin cannot be resolved by upping the sampling density? Distortion is always there, whether it comes from an imprecise sensor/instrument or from extrapolating sampled data. In the audio space, nothing above 20 kHz is audible anyways, so even a bog-standard 44.1 kHz sampling rate should do "good enough" for most DSP operations there.
Transients. FFT is time-symmetrical, so signals that start because of an event (like a mallet hitting something) have a lot of "nothing audible "above 20kHz", because of the abrupt start. Forcing nyquist limit on such signal alway causes pre-ringing, that breaks causality - you have sound starting before the event happened.
(edit) e.g. bandwidth limited signal with only a single non-zero sample does not represent a rectangular function, but a sinc.
> Distortion is always there, whether it comes from an imprecise sensor/instrument or from extrapolating sampled data.
Specifically regarding the latter part of "extrapolating sampled data", I would highly recommend watching this video: https://xiph.org/video/vid2.shtml. As long as your input signal is low-passed to below 22kHz, the 44.1kHz sampling is perfect. No information is lost, no distortion.
I however am not qualified enough to tell you how the naive FFT filter approach changes in distortion as you raise the sampling frequency.