A consumer focused human parity ASR service will disrupt so many industries, inc...

skoocda · on Nov 25, 2016

We've chatted - just an update that I'm implementing diarization this weekend :)

imaginenore · on Nov 25, 2016

I hope you realize your business is about to go out of business. The only reason you can charge people now is because the automatic recognition sucks compared to humans.

braindead_in · on Nov 25, 2016

We do super-human-parity transcripts. Our transcripts are insanely accurate, even for challenging files. I'm sure computers will be able to do that one day, but Singularity would have already happened by then, wiping out many businesses. I for one look forward to Singularity and hope that we will contribute to it in some way.

imaginenore · on Nov 25, 2016

What's super-human parity? And how do you achieve it using humans?

epistasis · on Nov 25, 2016

Presumably more accurate than a single human, and you can do it with multiple humans and reaching a consensus. I remember an anecdote in physics class where an experiment required counting a certain number of events in time. A single person would occasionally blink and miss an event. But if you had two people, and you count how many people observed each event, you can solve for super-human accuracy using the estimated error rates of each person.

See also this usage in the context of ML:

https://arxiv.org/pdf/1602.05314v1.pdf

gwern · on Nov 26, 2016

Ensembles are well-known to be more accurate. But this is not an advantage exclusive to humans: NNs ensembled will do better than any of the individual NNs.

There's no reason one couldn't train 5 or 10 RNNs for transcription and ensemble them. (Indeed, one cute trick this ICLR was how to get an ensemble of NNs for free so you don't have to spend 5 or 10x time training: simply lower the learning rate during training until it stops improving, save the model, then jack the learning rate way up for a while and start lowering it until it stops improving, save that model, and when finished, now you have _n_ models you can ensemble.) And computing hardware is cheaper than humans, so it will be cheaper to have 5 or 10 RNNs process an audio file than it would be to have 2 or 3 humans independently check, so the ensembling advantage is actually bigger for the NNs in this scenario.

Humans still have the advantage of more semantic understanding, but RNNs can be trained on much larger corpuses and read all related transcripts, so even there the human advantage is not guaranteed.

visarga · on Nov 26, 2016

Yeah, but you don't want to run an ensemble of 10 RNNs on your phone, or in the cloud for that matter, when you got billions of queries. It's too expensive.

In practice the ensemble model is compactly transferred into a single network. In order to do that, they train a new network to copy the outputs of the ensemble, exploiting "dark knowledge".

Recurrent Neural Network Training with Dark Knowledge Transfer - https://arxiv.org/abs/1505.04630v5

bertiewhykovich · on Nov 25, 2016

I'm rooting against you, pal. Cheers.