Audiogrep transcribes audio files and then creates “audio supercuts”

saaaam · on March 6, 2015

Hey all - I made this and am wondering if anyone here has any experience with pocketsphinx and could lend a hand in making the transcriptions more accurate. Let me know! (Or just make a pull request.)

kastnerkyle · on March 6, 2015

Well you already got pocketsphinx set up so that is a start (I recently did this here: https://github.com/kastnerkyle/ez-phones but it was a little annoying to script). There are a few ways to train/extend pocketsphinx, but ultimately good ASR in arbitrary environments is a research problem!

I second the opinion that Kaldi is more advanced, but it is also way, way more complicated to do anything with a custom dataset. There are a few examples of decoding with existing models though, so maybe that is a start. These lectures may help: http://www.danielpovey.com/kaldi-lectures.html

There is also an interesting toolbox here that you can train, though getting access to TIMIT, WSJ, etc. is pretty annoying. http://www.cs.cmu.edu/~ymiao/kaldipdnn.html

You may also get mileage out of some kind of post-transcription NLP/cleaning if you haven't done that yet.

albertzeyer · on March 6, 2015

Have you tried using Kaldi instead? I think Kaldi has far more advanced models.

Existing trained models can be downloaded from here: http://www.kaldi-asr.org/ (via: http://www.openslr.org/12/ http://www.clsp.jhu.edu/~guoguo/papers/icassp2015_librispeec...)

saaaam · on March 6, 2015

I haven't, but I'll take a look. Thanks!

th-ai · on March 7, 2015

Can't help w/ pocketsphinx, but do work w/ transcription sync, where accuracy depends on source. Goog/tube ASR is above 90%, at least where well-recorded people speak evenly paced with minimal accent. Otherwise, where vocals are hard to hear, ASR isn't good enough. Human corrected transcripts cost $1/minute today, and will 10x more affordable soon.

dubeye · on March 7, 2015

Why will human corrected transcripts be 10x cheaper soon? Because of increasing ASR? What about poorly-recorded people?

albertzeyer · on March 6, 2015

Also check out the examples: http://lav.io/2015/02/audiogrep-automatic-audio-supercuts/

And this related project: https://github.com/antiboredom/videogrep

He already states the obvious idea to integrate audiogrep into videogrep, which at the moment just uses subtitle files.

> All the instances of the phrase "time" in the movie "In Time": https://www.youtube.com/watch?v=PQMzOUeprlk

> All the one to two second silences in "Total Recall": https://www.youtube.com/watch?v=qEtEbXVbYJQ

> The President's former press secretary telling us what he can tell us: https://www.youtube.com/watch?v=D7pymdCU5NQ

saaaam · on March 6, 2015

As an intermediate step of merging this into videogrep, I've been using moviepy/audiogrep to make these somewhat unnerving condensed c-span videos:

http://lav.io/2015/02/c-span-excerpts/

rasur · on March 7, 2015

These are really quite compelling, thanks for sharing!

georgehm · on March 6, 2015

Awesome! Will it be possible to do a reverse map from a sentence back to the approx time in the original clip?

jaideepsingh · on March 6, 2015

Nice! I started making a python radio that used pydub and transcribed text (just like this) from different sources few years ago, but abandoned it prematurely. I'll look at your code, hopefully will kickstart it again! Thanks! https://bitbucket.org/jaideepsingh/isodi/

hardwaresofton · on March 6, 2015

Nice! Glad to see more work done with CMU Sphinx (and I'd be equally excited if this was done with Julius) -- so many possibilities!

I assume this will be useful to data scientists who want to process lyrics? what other intended/near-at-hand use cases are there?

anigbrowl · on March 7, 2015

It would be handy as a starting point for ADR, automated dialog replacement. In films you don't always get what you want in production sound - maybe a plane was flying overhead as the sun was setting on the last day that your Famous Actress was available, so you make the most of the visual opportunities and accept the inadequate sound. Then You bring the actors back to a recording studio later and have them re-read their lines. On large films, a lot of dialog that you hear in the final version is recorded this way, as much as 50% in action movies (because you have all this noisy equipement going on around the set, and getting good quality sound recordings always has a lower priority. On indie films there's more location shooting and smaller post-production budgets, so you aim to minimize adr requirements, to 0% if at all possible.

Actors hate doing ADR and it's time-consuming and annoying for editors. This wouldn't automatically solve the problem because you wouldn't have a good match between dialog recorded in different acoustic environments, but it does have the potential to save a lot of grunt work, especially for background dialog where you can compromise on quality a bit.

Also, in post-production you often find yourself wanting to edit just one or two words in a scene and you'd rather not bring the actor back for such a small problem, so you look for other scenes and other takes of the same scene where the same word or syllable appears, and do a little cut-and-paste and blending, the audio equivalent of photoshop retouching. It would be very useful for that.

pault · on March 6, 2015

Making Brian Williams cover rap songs?

Mithaldu · on March 6, 2015

I'd like to see this applied to this video, targeting "hushed": https://www.youtube.com/watch?v=BpsMkLaEiOY

ouchy · on March 7, 2015

Heads-up: this video begins with loud audio from a smoke alarm.

niknamelogin · on March 6, 2015

I know similar project http://LaconiaTrimVideo.com Trim silence from video.

jaywunder · on March 7, 2015

This is a really cool tool. Just curious what the "real world" use case for would be for it.

archimedespi · on March 6, 2015

I'm going to have so much fun with this (evil grin).