Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, but it'd be like getting a blood test and trying to interpret it yourself :).

There are so, so many variables - and many engines could be optimized for 'your voice and the things you're going to say right now given cpu/memory, quality of microphone, background noise etc..'

The game of NLP is inherently about dealing with 'noisy channels' (in the academic sense) in which there is kind of a probabilistic guarantee of imperfection. So then it comes down to creating the best products in a given context, which is almost always 'less than optimized' for any individual.

So there's the model size, cpu/ram, quality of signal (microphone, network) just to start.

Optimizing for standard english means probably reducing quality for people with accents. Maybe in a specific context you could go from 80% accuracy to 85% accuracy for 'most of us ' - but then yo go from 60% to 40% for anyone with an accent.

And if you reduce the accepted vocabulary, you can get way better results. Of course, we all might want to say words like 'obvolute' and 'abalienate' every so often.

Kind of thing.

It's really fun from an R&D perspective, but it's a product managers nightmare. Consumer expectations with these technologies are really challenging because of inherent ambiguity in a system, people kind of want perfection. And there are always corner cases where it would seem like things should be say, but they're not - because the word you're saying is common, and you'r saying it 'perfectly clear' ... but little do you know there are 2 or 3 other very rare words that sound 'just like that' ergo ... problems.

From a product perspective, it basically always feels 'broken' which is such a terrible feeling :).

But it can be fun if you like really hard product challenges which have less to do with tech and more to do with pure user experience, expectations, behaviours etc.




I think this is a great summary of where we are at now, but I think stronger broad-coverage language models (=expectations for what people say, better generative models of speakers) are feasible (for a couple 10s-100s of millions in R&D) to bring ASR up to parity with people. It's pretty clear we are getting close to the limits of what acoustics can offer, and it's the language model that is the next frontier both in terms of accuracy and real-time performance.


Agreed. We are getting close to a nice reality as all of the pieces are getting better.


For speakers of non-standard English, a quick test might be a useful sanity check. Many speech-to-text algorithms fail catastrophically with certain accents.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: