Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This code is also useful <https://github.com/lowerquality/gentle>


Yes, there are several other open source aligners out there, mostly from academic research or derived from academic projects. In my personal GitHub page I have a repo with an annotated list of forced aligners. (If I add a link to it, the spam detector triggers ?! Anyway, google "github forced-alignment-tools" to find it.)

Gentle, which is based on Kaldi, has a good performance, and an handy setup script.

However, these aligners, which are based on automatic speech recognition techniques, have pre-trained models only for English and maybe an handful of other "popular" languages. Some allows you to train your own language model, but very few users have the actual competence/resources for doing that.

aeneas is build using an older approach, which has the advantage of requiring weaker language models, that are already available (in the form of TTS voices): this is the reason why it "supports" so many languages. Of course the disadvantage is that aeneas works decently well at (sub)sentence granularity, but worse than ASR-based aligners at word granularity or with more noisy audio files.


Do you know of any existing forced alignment tools that work well with live audio (microphone) input? I would like to create a live stream in which the words of a known text are displayed as they are being spoken into a microphone.


For sure aeneas is not suitable, since it requires all the text and all the audio in advance.

But ASR-based tools in theory would allow such an operation mode, but I have not seen aligners that read from the mic buffer directly or have a built-in option/CLI for it.

Knowing the text in advance basically means that you can train your own language (textual) model adapted to that exact text, and then use the (standard) acoustic model for your language and aligning procedure as usual. Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it. Perhaps gentle (which is based on Kaldi) is worth looking into.


I looked into gentle a few weeks ago and did notice that it seems to use an online algorithm. It doesn’t have built-in support for live audio input unfortunately, but it may be tweakable as you say (such as reimplementing it to use audio streams that work with either static or real-time input). I guess there’s no other way to find out than just try it myself.


Another possibility is to just run an automatic speech recognition system (e.g. Sphinx or PocketSphinx can read from the mic input), and align its output with the ground truth text.

You need to deal with imperfect matching because the ASR might produce a text slightly different from the ground truth, but if you want to chunk e.g. at sentence granularity (and then move on to the next sentence), you should be able to do it in real time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: