On the CallHome dataset humans confuse words 4.1% of the time, but delete 6.5% of words, most commonly deleting the word "I".
Their ASR system confuses 6.5% of words on this dataset, but only deletes 3.3% of words, so depending on how you view this their claim about being better than humans isn't definitely true, if you consider the task to be speech recognition, rather than transcription.
Also, while the overall "word error rate" is lower than humans it's not clear if this is because the transcription service they used is not seeking perfect output, but rather good enough output and the errors the transcription service makes may not be as bad as the errors the ASR system makes in terms of how well you can recover the original meaning from the transcription.
It's clearly great work, but reaching human parity is marketing fluff.
The conversational speech has been very difficult. The switchboard task has been around for over 20 years. We used the exact same published data and evaluation protocols. The Human Parity claim on the switchboard task is as scientific as we can be based on the best knowledge of ours.
I was recently at a bar where they showed a movie with incomprehensible subtitles (English to English). I assume this was because they skimped and bought automatic subtitling.
I think one important aspect is while humans miss words, they often get the sentence meaning correct. When computers miss words, they tend to substitute words that sound similar. That's readable if you have time but not necessarily as a stream of text going by...
It might have also been a 'bootleg' DVD from China or downloaded a version from there. I've had quite a few where they had done a horrible job translating into Chinese for the subtitles already, and then did a literal machine translation back to English.
The character on the screen said "Hello", the English subtitle said "You good." Which would be the literal translation of nihao.
From an ML perspective, recently there's been a lot more work done on language modeling (e.g. billion word corpus). Good language models can usually fix those types of perspective. So hopefully you can expect to see that change in the future!
> I was recently at a bar where they showed a movie with incomprehensible subtitles (English to English). I assume this was because they skimped and bought automatic subtitling.
I don't think that is possible. Like, worse subtitling isn't even an option.
I'm not an expert on this at all, but my suspicion is that humans are more tolerant of missing words than we are of mistaken words. Especially a word like "I", we just assume it if it's missing.
So if this speech recognition is for creating a transcript for humans, this (in my uneducated opinion) isn't as good as humans do, at least not yet.
If nothing's changed in the past 15 years, anything under 95% accuracy is not accurate enough for automation without human intervention (copy editing).
Looks like they used various neural networks and trained based on data they currently already transcribe professionally. So it had to be trained and only represents a subset of people who speak and have their works transcribed at Microsoft events.
Being Microsoft, I'm sure the place is diverse, but there's no mention in the paper on accents or dialects that I can see (might just be missing it).
On the CallHome dataset humans confuse words 4.1% of the time, but delete 6.5% of words, most commonly deleting the word "I".
Their ASR system confuses 6.5% of words on this dataset, but only deletes 3.3% of words, so depending on how you view this their claim about being better than humans isn't definitely true, if you consider the task to be speech recognition, rather than transcription.
Also, while the overall "word error rate" is lower than humans it's not clear if this is because the transcription service they used is not seeking perfect output, but rather good enough output and the errors the transcription service makes may not be as bad as the errors the ASR system makes in terms of how well you can recover the original meaning from the transcription.
It's clearly great work, but reaching human parity is marketing fluff.