Hey all - I made this and am wondering if anyone here has any experience with pocketsphinx and could lend a hand in making the transcriptions more accurate. Let me know! (Or just make a pull request.)
Well you already got pocketsphinx set up so that is a start (I recently did this here: https://github.com/kastnerkyle/ez-phones but it was a little annoying to script). There are a few ways to train/extend pocketsphinx, but ultimately good ASR in arbitrary environments is a research problem!
I second the opinion that Kaldi is more advanced, but it is also way, way more complicated to do anything with a custom dataset. There are a few examples of decoding with existing models though, so maybe that is a start. These lectures may help: http://www.danielpovey.com/kaldi-lectures.html
Can't help w/ pocketsphinx, but do work w/ transcription sync, where accuracy depends on source. Goog/tube ASR is above 90%, at least where well-recorded people speak evenly paced with minimal accent. Otherwise, where vocals are hard to hear, ASR isn't good enough. Human corrected transcripts cost $1/minute today, and will 10x more affordable soon.