Hacker News new | past | comments | ask | show | jobs | submit login

Are there any speech to text systems out there that could do this reliably, say 80% accuracy?



Look at any con video on YouTube and turn on the "English Automatic Captions" - they're generated by some pretty good voice recognition software. But as you'll see, the results are a long way from perfect.

My favourite example of misrecognition is one of Travis Goodspeed's talks, where YouTube's VR output "Geek women are expensive, but not prohibitively so."

Voice recognition is OK but it falls a long way short of a usable level of accuracy, and even the accuracy it can muster goes right down the toilet if there is any background noise or music, or if the speaker is in any way unclear (accents, rushed or slurred speech, etc). There's a long way to go before you can just get a usable transcript of speech automatically.

Quite a lot of voice recognition engines seem to have been trained on thousands of hours of C-SPAN or Meet The Press or something, because when recognition conditions get challenging, some engines start to degenerate into outputting nonsense like "congress Muslims Kenya capitol great today Cheney".

There is no substitute for a human pair of ears and a lightning-fast means of text entry like a steno keyboard - nice to see Plover getting a mention in the source article too.


This is far beyond what a computer could do. For instance, see this transcript:

https://github.com/hausdorff/bangbangcon.github.io/blob/gh-p...

For instance, she'll separate out conversations:

>> Can you move the mic closer to your mouth? >> Yes. Is this better? Is this better? Okay. I will talk like this, then. >> You can move the mic. >> Like... >> Take it off the stand and hold it up to your face.

She can also figure out when something is an acronym (like LARP), make sure everything is capitalized correctly (Python, Ruby), separate out what's being said into paragraphs when the speaker starts talking about something new, and a ton of other things.


From the FAQ on Mirabai Knight's website: http://stenoknight.com/FAQ.html#cartspeech

"Automatic speech recognition is not currently a substitute for human transcription, because computers are unable to use context or meaning to distinguish between similar words and phrases and are not able to recognize or correct errors, leading to faulty output. The best automatic speech recognition boasts that it's 80% to 90% accurate, but that means that, at best, one out of ten words will be wrong or missing, which results in a semantic accuracy rate that's often far lower than 90%, depending on which word it is."

(this is a subset of the answer to "Will speech recognition make CART obsolete?")


The thing in that category that I have found useful is a speech recognition system trained on the speaker, using the same headset it was trained on. (This was with IBM ViaVoice almost 10 years ago with maybe 30 minutes of training; Dragon is said to be more accurate.) It was substantially worse than human, but good enough to usually get the meaning across, which YouTube's auto-captions mostly don't.


I feel like all college lectures would become immediately more valuable if this could be done accurately. There were so many lectures where I missed a detail, lost track of what the professor was explaining, and then zoned out for the rest of the class.


For simple English (and other major languages), I would assume so. But with all the jargon, nonstandard language, and acronyms that ar used at tech (and other fields to a lesser extent) conferences... I would expect significant decreases in accuracy.


80% accuracy is unacceptable for a deaf or hearing impaired person to use. It really needs to be 95-98% or so for a good understanding of the entire thing.


If you actually repeat the speech into the microphone, you can go quite a bit higher.


This is used quite a lot in the UK for captioning live news broadcasts - it's called respeaking and relies on a speaker basically repeating the speech in a flat monotone voice, such that some voice recognition software can more easily make it out.

It works to some degree, and has the advantage for the subtitling companies that respeakers are easier to train and don't need to be paid as much as a proper stenographer. The disadvantage is that the output is much slower and subject to a rather greater delay. There is still nothing that beats steno - but respeaking is cheaper and people don't complain enough about the inaccuracies.


In Italy it was the other way around - steno was cheaper for some reason, but there were issues with foreign words / names / etc. I used to do this stuff for a while and I saw the most incredible things go on air on both sides...


Id like to explore if there are any automated solutions?

What software/sdk's have you used?


Even the best automated solutions rely on recognising someone who is speaking clearly and precisely, ensuring that every word is well spaced and completely clear.

There are no automated solutions that can universally do a good job of transcribing natural speech from people who aren't specifically "speaking to be recognised", if that makes sense.

Maybe one day, but not yet. It's a problem waiting to be solved, so the reward for the first who can really crack it will be substantial.


Nuance's SDK, but unfortunately "automated" is out of question with that...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: