Hacker News new | past | comments | ask | show | jobs | submit login

A consumer focused human parity ASR service will disrupt so many industries, including mine. I run a human powered transcription service where we transcribe files with high accuracy. I am just waiting for the day when our transcribers can work off a auto-generated transcript instead of typing it all up manually. I'll pay good money for a service where I can just send a file and get a 80-90% accurate transcript with speaker diarization.



We've chatted - just an update that I'm implementing diarization this weekend :)


I hope you realize your business is about to go out of business. The only reason you can charge people now is because the automatic recognition sucks compared to humans.


We do super-human-parity transcripts. Our transcripts are insanely accurate, even for challenging files. I'm sure computers will be able to do that one day, but Singularity would have already happened by then, wiping out many businesses. I for one look forward to Singularity and hope that we will contribute to it in some way.


What's super-human parity? And how do you achieve it using humans?


Presumably more accurate than a single human, and you can do it with multiple humans and reaching a consensus. I remember an anecdote in physics class where an experiment required counting a certain number of events in time. A single person would occasionally blink and miss an event. But if you had two people, and you count how many people observed each event, you can solve for super-human accuracy using the estimated error rates of each person.

See also this usage in the context of ML:

https://arxiv.org/pdf/1602.05314v1.pdf


Ensembles are well-known to be more accurate. But this is not an advantage exclusive to humans: NNs ensembled will do better than any of the individual NNs.

There's no reason one couldn't train 5 or 10 RNNs for transcription and ensemble them. (Indeed, one cute trick this ICLR was how to get an ensemble of NNs for free so you don't have to spend 5 or 10x time training: simply lower the learning rate during training until it stops improving, save the model, then jack the learning rate way up for a while and start lowering it until it stops improving, save that model, and when finished, now you have _n_ models you can ensemble.) And computing hardware is cheaper than humans, so it will be cheaper to have 5 or 10 RNNs process an audio file than it would be to have 2 or 3 humans independently check, so the ensembling advantage is actually bigger for the NNs in this scenario.

Humans still have the advantage of more semantic understanding, but RNNs can be trained on much larger corpuses and read all related transcripts, so even there the human advantage is not guaranteed.


Yeah, but you don't want to run an ensemble of 10 RNNs on your phone, or in the cloud for that matter, when you got billions of queries. It's too expensive.

In practice the ensemble model is compactly transferred into a single network. In order to do that, they train a new network to copy the outputs of the ensemble, exploiting "dark knowledge".

Recurrent Neural Network Training with Dark Knowledge Transfer - https://arxiv.org/abs/1505.04630v5


I'm rooting against you, pal. Cheers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: