Hey all - I made this and am wondering if anyone here has any experience with po...

kastnerkyle · on March 6, 2015

Well you already got pocketsphinx set up so that is a start (I recently did this here: https://github.com/kastnerkyle/ez-phones but it was a little annoying to script). There are a few ways to train/extend pocketsphinx, but ultimately good ASR in arbitrary environments is a research problem!

I second the opinion that Kaldi is more advanced, but it is also way, way more complicated to do anything with a custom dataset. There are a few examples of decoding with existing models though, so maybe that is a start. These lectures may help: http://www.danielpovey.com/kaldi-lectures.html

There is also an interesting toolbox here that you can train, though getting access to TIMIT, WSJ, etc. is pretty annoying. http://www.cs.cmu.edu/~ymiao/kaldipdnn.html

You may also get mileage out of some kind of post-transcription NLP/cleaning if you haven't done that yet.

albertzeyer · on March 6, 2015

Have you tried using Kaldi instead? I think Kaldi has far more advanced models.

Existing trained models can be downloaded from here: http://www.kaldi-asr.org/ (via: http://www.openslr.org/12/ http://www.clsp.jhu.edu/~guoguo/papers/icassp2015_librispeec...)

saaaam · on March 6, 2015

I haven't, but I'll take a look. Thanks!

th-ai · on March 7, 2015

Can't help w/ pocketsphinx, but do work w/ transcription sync, where accuracy depends on source. Goog/tube ASR is above 90%, at least where well-recorded people speak evenly paced with minimal accent. Otherwise, where vocals are hard to hear, ASR isn't good enough. Human corrected transcripts cost $1/minute today, and will 10x more affordable soon.

dubeye · on March 7, 2015

Why will human corrected transcripts be 10x cheaper soon? Because of increasing ASR? What about poorly-recorded people?