Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey, I'm the blog post author (Deepgram cofounder too)! Sorry if the accuracy wasn't clear enough. The numbers quoted are for search accuracy, not for speech-to-text accuracy. It's a subtle thing but makes a lot of difference (and it is the thing that matters most when you are in trying-to-find-something mode).

If you ran phone calls or tape recorder audio through a speech-to-text engine then the word accuracy rate is like 10-50% (i.e. abysmal). When you try to search for keyphrases like "frolicking kitten" the likelihood of a text match with STT is ~20%.

If you ran that same search with Deepgram then 80% of the time you'd find what you are looking for since Deepgram doesn't have to guess at what is being said, it takes the inverse approach and matches 'how it sounds' using deep learning voodoo magic™.



Is it intentional that your website has no pricing information or specific details what one would get after signing up for an API key? Some blogposts reference pricing, but no idea if those are current and what the APIs look like. Not very inviting to play with stuff :/


I've been using Deepgram for building https://www.findlectures.com - I don't know about the marketing site but there is pricing within the app.


If you sign up it's 5 hours for free then 5$ for 6.7 hours. I figured it out just now.


So their old blogposts are inaccurate...


There are a number of papers tackling a similar task [0][1][2] for anyone who is interested. There isn't enough information to tell exactly what is going on with Deepgram, but one way to approach this would be to construct a shared embedding space for words/phrases and speech. These types of embedding spaces are powerful [3][4][5], but not magic.

Cool demo, looking forward to seeing more detail about what is going on. However I would quibble with the STT WER quoted above. Maybe in noisy environments with unknown speakers (and no voice normalization) this is accurate, but the kinds of clean speech in the demo perform really well in modern recognition engines (on benchmark data, to be fair c.f. MSR 6.3% and IBM at ~6.9%).

Most word searches over speech to text work over soft matches (or ideally beam search over most likely partial phoneme/word part matches), rather than hard matches so it seems like a bit of a straw man comparison in this case.

[0] http://research.google.com/pubs/pub42543.html

[1] https://arxiv.org/abs/1510.01032

[2] https://sigport.org/sites/default/files/gloveNNLM_kaudhkhasi...

[3] https://arxiv.org/abs/1502.03044

[4] http://www-personal.umich.edu/~reedscot/files/icml2016.pdf

[5] https://arxiv.org/abs/1411.2539




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: