The actual paper has a section on error analysis that is particularly enlighteni...

xdh168 · on Oct 19, 2016

The conversational speech has been very difficult. The switchboard task has been around for over 20 years. We used the exact same published data and evaluation protocols. The Human Parity claim on the switchboard task is as scientific as we can be based on the best knowledge of ours.

derekja · on Oct 19, 2016

Ah, you finally made a hackernews account! Congrats, XD, that's a great milestone you guys hit.

joe_the_user · on Oct 18, 2016

I was recently at a bar where they showed a movie with incomprehensible subtitles (English to English). I assume this was because they skimped and bought automatic subtitling.

I think one important aspect is while humans miss words, they often get the sentence meaning correct. When computers miss words, they tend to substitute words that sound similar. That's readable if you have time but not necessarily as a stream of text going by...

ChrisClark · on Oct 18, 2016

It might have also been a 'bootleg' DVD from China or downloaded a version from there. I've had quite a few where they had done a horrible job translating into Chinese for the subtitles already, and then did a literal machine translation back to English.

The character on the screen said "Hello", the English subtitle said "You good." Which would be the literal translation of nihao.

Arkaad · on Oct 19, 2016

see also the infamous Star Wars Episode 3 Chinese bootleg "Do no want".

joe_the_user · on Oct 19, 2016

It English subtitling on an movie in English, as they often have at bars.

codekansas · on Oct 19, 2016

From an ML perspective, recently there's been a lot more work done on language modeling (e.g. billion word corpus). Good language models can usually fix those types of perspective. So hopefully you can expect to see that change in the future!

aswanson · on Oct 19, 2016

Hilarious that the grandparent states performance gained by deleting "I"from the training corpus and the child comment starts with that pronoun.

cloudjacker · on Oct 18, 2016

> I was recently at a bar where they showed a movie with incomprehensible subtitles (English to English). I assume this was because they skimped and bought automatic subtitling.

I don't think that is possible. Like, worse subtitling isn't even an option.

john_oshea · on Oct 18, 2016

"Trainspotting" would be an interesting challenge: <https://www.theguardian.com/books/2008/may/31/irvinewelsh>

WildUtah · on Oct 19, 2016

Don' You Go Rounin' Roun To Re Ro:

http://www.nbc.com/saturday-night-live/video/british-movie/n...

AnimalMuppet · on Oct 18, 2016

I'm not an expert on this at all, but my suspicion is that humans are more tolerant of missing words than we are of mistaken words. Especially a word like "I", we just assume it if it's missing.

So if this speech recognition is for creating a transcript for humans, this (in my uneducated opinion) isn't as good as humans do, at least not yet.

rhizome · on Oct 18, 2016

If nothing's changed in the past 15 years, anything under 95% accuracy is not accurate enough for automation without human intervention (copy editing).

nshm · on Oct 18, 2016

From the paper eight different recognizers combined together approached results of a single person reviewed once. Not really a parity.

rattray · on Oct 18, 2016

Isn't it substantially simpler + cheaper to combine 8 computer programs than to combine >1 people?

djsumdog · on Oct 18, 2016

Looks like they used various neural networks and trained based on data they currently already transcribe professionally. So it had to be trained and only represents a subset of people who speak and have their works transcribed at Microsoft events.

Being Microsoft, I'm sure the place is diverse, but there's no mention in the paper on accents or dialects that I can see (might just be missing it).