If the sample clock edges aren't very (very very very) regular, on a sample-and-hold ADC, the waveform isn't sampled evenly and that manifests as noise that swamps the detail provided by the higher bit depth.
This is called "sample aperture jitter". Requirements scale linearly with frequency and exponentially with bit depth.
Sure enough, 32 bits sampling a 96kHz signal, which is the Nyquist frequency of 192kHz sampling rate, is 0.3fs. At 24 bits, it's more like 100fs, which is much more doable, but still not easy.
Which is why audio bit depths usually don't go to 32 bits, despite formats like FLAC supporting that.
The practical upshot of this and other noise sources is that higher audiophile-grade bit depths and sampling frequencies are quite likely to have at least some of those bits swamped out by noise on real hardware.
This is just getting the audio recorded. Playing it back as physical sound waves adds something between quite a bit and radically more noise to the signal, even if there's never any lossy compression.
It seems like you are arguing with someone that 32 bits per sample is too much resolution and I agree, but I'm not sure who you are are arguing with or who is saying that.
It was an interesting (to me) footnote to the point that even if you massively overspec your audio stream to the point of physically being unable to record the audio at that quality (the footnote being why it's unfeasible), you can still easily fit many such streams down a single modern-ish digital link.
The point isn't that you shouldn't record audio at 32-bit depth (which you probably shouldn't if you expect it to bring much benefit, but that's by-the-by), it's that even if you did, and you have a 7.1 system with 8 uncompressed streams, you still won't be anywhere near the point where USB 3 cable grades will start to matter.
You're the one who asked for clarification on the footnote specifically.
This is called "sample aperture jitter". Requirements scale linearly with frequency and exponentially with bit depth.
These a calculator here:
https://www.analog.com/en/design-center/interactive-design-t...
https://www.analog.com/en/technical-articles/aperture-jitter...
Sure enough, 32 bits sampling a 96kHz signal, which is the Nyquist frequency of 192kHz sampling rate, is 0.3fs. At 24 bits, it's more like 100fs, which is much more doable, but still not easy. Which is why audio bit depths usually don't go to 32 bits, despite formats like FLAC supporting that.
The practical upshot of this and other noise sources is that higher audiophile-grade bit depths and sampling frequencies are quite likely to have at least some of those bits swamped out by noise on real hardware.
This is just getting the audio recorded. Playing it back as physical sound waves adds something between quite a bit and radically more noise to the signal, even if there's never any lossy compression.