A short overview of using HMMs for speech recognition:
Speech is temporal, a speech sample can be represented as a sequence of data points. Thus, a simple way to compare two speech samples can be an algorithm to compare their corresponding sequences. One such algorithm is DTW, which is an equivalent of the Levenshtein distance algorithm for comparing strings.
So now we have a method of comparing two speech samples. Thus, a simple way to recognize an unknown speech sample can be to keep samples of all possible utterances (phonemes, words, sentences) and just return the best match. That’s not possible - so in speech recognition, they only keep samples of phonemes, and try to find the most likely concatenation of them relative to the unknown sample. The search space is huge, but luckily dynamic programming algorithms exist to make the search fast (they’re called Connected Word Recognition algorithms).
The problem with this approach is that it does not scale. There are tons of variations possible for the same spoken unit (different styles and durations). This is where the Hidden Markov Model comes in. A single HMM can be used to represent all variations of that unit. For example, it can be used to represent multiple speech samples of the word “apple”, or of the phoneme “æ” and so on. There are methods for evaluating the similarity of an HMM and a speech sample, and for training an HMM using multiple samples.
You can also connect phoneme HMMs together to get a big HMM representing words/sentences. The same CWR algorithms as before apply.
Speech is temporal, a speech sample can be represented as a sequence of data points. Thus, a simple way to compare two speech samples can be an algorithm to compare their corresponding sequences. One such algorithm is DTW, which is an equivalent of the Levenshtein distance algorithm for comparing strings.
So now we have a method of comparing two speech samples. Thus, a simple way to recognize an unknown speech sample can be to keep samples of all possible utterances (phonemes, words, sentences) and just return the best match. That’s not possible - so in speech recognition, they only keep samples of phonemes, and try to find the most likely concatenation of them relative to the unknown sample. The search space is huge, but luckily dynamic programming algorithms exist to make the search fast (they’re called Connected Word Recognition algorithms).
The problem with this approach is that it does not scale. There are tons of variations possible for the same spoken unit (different styles and durations). This is where the Hidden Markov Model comes in. A single HMM can be used to represent all variations of that unit. For example, it can be used to represent multiple speech samples of the word “apple”, or of the phoneme “æ” and so on. There are methods for evaluating the similarity of an HMM and a speech sample, and for training an HMM using multiple samples.
You can also connect phoneme HMMs together to get a big HMM representing words/sentences. The same CWR algorithms as before apply.