This is great! I’m always excited to see new common voice releases. As someone a...

scribu · on July 1, 2020

What if you first trained a classifier that told you if the uttereance is a single word vs. multiple words? Then, based on that prediction, you would use one of two separate models.

The technique you're thinking of is called oversampling and there are many other general techniques for dealing with imbalanced datasets, as it's a very common situation.

lunixbochs · on July 1, 2020

Thanks, the oversampling mention gives me a good reference to start.

The model itself has generalized pretty well to handle both single and multi word utterances I think, without a separate classifier, but I'm definitely not going to rule out multi-model recognition in the long run.

My main issues with single words right now are:

- The model sometimes plays favorites with numbers (ace vs eight)

- Collecting enough word-granularity training data for words-that-are-not-numbers (I've done a decent job of this so far, but it's a slow and painful process. I've considered building a frontend to turn sentence datasets into word datasets with careful alignment)

nmstoker · on July 1, 2020

For that last point, forced alignment tools may be useful.

An issue to watch for though is elision: a word in a sentence can often be said differently to the individual words, eg saying "last" and "time" separately one typically includes the final t in last and yet said together, commonly it's more like "las time".

lunixbochs · on July 1, 2020

Yeah, I'm familiar with forced alignment. This is slightly nicer than the generic forced alignment, because my model has trained on the alignment of all of my training data already. My character based models already have pretty good guesses for the word alignment.

I think I'd be very cautious about it and use a model with a different architecture than the aligner to validate extracted words, and probably play with training on the data a bit to see if the resulting model makes sense or not. I do have examples of most english words to compare extracted words.

nmstoker · on July 1, 2020

Sounds like a great approach