Hey all - I made this and am wondering if anyone here has any experience with pocketsphinx and could lend a hand in making the transcriptions more accurate. Let me know! (Or just make a pull request.)
Well you already got pocketsphinx set up so that is a start (I recently did this here: https://github.com/kastnerkyle/ez-phones but it was a little annoying to script). There are a few ways to train/extend pocketsphinx, but ultimately good ASR in arbitrary environments is a research problem!
I second the opinion that Kaldi is more advanced, but it is also way, way more complicated to do anything with a custom dataset. There are a few examples of decoding with existing models though, so maybe that is a start. These lectures may help: http://www.danielpovey.com/kaldi-lectures.html
Can't help w/ pocketsphinx, but do work w/ transcription sync, where accuracy depends on source. Goog/tube ASR is above 90%, at least where well-recorded people speak evenly paced with minimal accent. Otherwise, where vocals are hard to hear, ASR isn't good enough. Human corrected transcripts cost $1/minute today, and will 10x more affordable soon.
Nice! I started making a python radio that used pydub and transcribed text (just like this) from different sources few years ago, but abandoned it prematurely. I'll look at your code, hopefully will kickstart it again! Thanks!
https://bitbucket.org/jaideepsingh/isodi/
It would be handy as a starting point for ADR, automated dialog replacement. In films you don't always get what you want in production sound - maybe a plane was flying overhead as the sun was setting on the last day that your Famous Actress was available, so you make the most of the visual opportunities and accept the inadequate sound. Then You bring the actors back to a recording studio later and have them re-read their lines. On large films, a lot of dialog that you hear in the final version is recorded this way, as much as 50% in action movies (because you have all this noisy equipement going on around the set, and getting good quality sound recordings always has a lower priority. On indie films there's more location shooting and smaller post-production budgets, so you aim to minimize adr requirements, to 0% if at all possible.
Actors hate doing ADR and it's time-consuming and annoying for editors. This wouldn't automatically solve the problem because you wouldn't have a good match between dialog recorded in different acoustic environments, but it does have the potential to save a lot of grunt work, especially for background dialog where you can compromise on quality a bit.
Also, in post-production you often find yourself wanting to edit just one or two words in a scene and you'd rather not bring the actor back for such a small problem, so you look for other scenes and other takes of the same scene where the same word or syllable appears, and do a little cut-and-paste and blending, the audio equivalent of photoshop retouching. It would be very useful for that.