Can you recommend any good references to begin understanding the Spectrogram ? I work in DL based Noise cancellation - major part of my work involves analyzing spectrograms - I find it very difficult to do my work without having an ability to critically analyze these images. Any help from anybody ?
What do you mean by "understanding the Spectrogram"? The graph itself is straightforward: x axis is time, y axis is frequency. The intensity of each pixel represents the intensity of a certain frequency component and a certain point in time.
If you're referring to generating spectrograms with Fourier transforms, you will need some math background to properly do the calculation by hand. It largely just boils down to "find the amount of each frequency over time"
Last question, if this is the premise your work, shouldn't you know about it already?
o The tall vertical lines reflect "plosives" - sudden releases of sound energy often at the begining of words from having mouth/airway closed then open, as in the first letter of "put" or "tea"
o The high frequencies come from "fricatives" like the first letter of "see" or "free" where air is being passed through the teeth or almost closed lips
o The lower frequencies are where most of the recognizable speech content is, corresponding to the way the resonant frequencies of the mouth and throat are being changed (articulation) by moving the tongue, lips and teeth. Specifically the speech content is in changes to the "formants" which are the changing resonant frequencies showing up as bright mostly horizontal bands in the lower frequencies
Noise may show up in various ways depending on what the noise source is. A fixed frequency spectrum background hum is going to show up as one or more horizontal frequency bands across the entire spectrogram. High frequency noise is going to show up as much more energy in the higher frequencies, which don't have a lot of energy for clean speech (fricatives only).
Thanks for sharing this! I didn’t know about these terms before. Every consider writing a blog post/tutorial on your knowledge of human speech in spectrograms? This is much more digestible than most of what’s out there
Thanks for your effort in sharing the link- am kind of comfortable with most of the theoretical aspects of STFT/FFT/MelScale etc.. but when i look at the spectrogram i still feel am missing something.
When i look at the spectrogram i want to know how clear is the quality of the speech in the audio - is there background noise - Is there a reverb - Is there a loss anywhere - I have a feeling that these are possible to be learnt from analyzing spectrograms but not sure how to do it. Hence the question.
Look for clear and distinct frequency bands corresponding to the vocal range of human speech (generally around 100 Hz to 8 kHz).
If the frequency bands are well defined and distinct then the speech is likely clear and intelligible.
If the frequency bands are blurred or fuzzy then the speech may be muffled or distorted.
Note that speech like any audio source consists of multiple frequencies, a fundamental frequency and its harmonics.
Background noise can be identified as distinct frequency bands that are not part of the vocal range of human speech.
E.g. if you see lots of bright lines below or above the human vocal range then there's lots of background noise.
Especially lower frequencies can have a big impact on the perceived clarity of a recording whereas high frequencies come of as being more annoying.
Noise within the frequency range of human speech is harder to spot and you should always use your ears to decide whether it's noise or not.
You can also use a spectrogram to check for plosives (e.g. "s" "k" "t" sounds) as they also can make a recording sound bad/harsh.
Unfortunately I think the answer is “we don’t know” we have loads of techniques (ex: band pass filter) and hypotheses (ex: harmonic frequencies and timbre) but we haven’t been able to implement them perfectly which seems to be why deep learning has worked so well.
Personally I hypothesize that the reason it’s so hard is that the sources are intermixed sharing frequencies so isolating to certain frequencies doesn’t isolate a speaker. We’d need something like beam forming to know how much amplitude of each frequency to extract. I’d also hypothesize that humans, while able to focus on a directional source, also cannot “extract” clean signal either (imagine someone talking while a pan crashes on the floor - it completely drowns out what the person said)
Speech is pretty well understood - there are two complementary aspects to it, speech production (synthesis) and speech recognition (via the changing frequency components as show up in the spectrogram).
When we recognize speech is almost as if we're hearing the way the speaker is articulating words, since what we're recognizing is the changing resonant frequencies ("formants") of the vocal tract corresponding to articulation, as well as other articulation clues such as the sudden energy onset of plosives or high frequencies of fricatives (see my other post in this topic for a bit more info).
High quality (that is, highly intelligible) speech synthesis has been available for a long time based on this understanding of speech production/recognition. One of the earliest speech synthesizers was the DECTalk (from Digital Equipment) introduced in 1984 - a formant-based synthesizer based on the work of linguist Denis Klatt.
The fact that most of the information in speech comes from the formants can be proved by generating synthetic formant-only speech just consisting of sine waves at the changing formant frequencies. It doesn't sound at all natural, but nonetheless very easy to recognize.
The starting point for human speech recognition is similar to a spectrogram - it's a frequency analysis (cf FFT) done by the ear via the varying length hairs in the inner ear vibrating according to the frequencies present, therefore picking up the dominant formant frequencies.
Agreed theoretically however if I gave you two spectrograms, would you be able to tell which one is clear speech and which one is garbled? I’d bet we’d be able to come up with one that wouldn’t pass the sniff test.
If you know of any implementations that can look at a spectrogram and say “hey there’s peaks at 150hz, 220hz and 300hz with standard deviations of 5hz, 7hz, and 10hz, decreasing in frequency over time thus this is a deep voice saying ‘ay’” and get it right every time I’d be really interested in seeing it (besides neural networks)
Maybe an expert linguist (not me) could do a pretty good job of distinguishing noisy speech in most cases, but a neural net should certainly be able to be super-human as this.
Some sources of noise like the constant background hum (e.g. computer fan) are easy to spot though.