This is great! I’m always excited to see new common voice releases.
As someone actively using the data I wish I could more easily see (and download lists for?) the older releases as there have been 3-4 dataset updates for English now. If we don’t have access to versioned datasets, there’s no way to reproduce old whitepapers or models that use common voice. And at this point I don’t remember the statistics (hours, accent/gender breakdown) for each release. It would be neat to see that over time on the website.
I’m glad they’re working on single word recognition! This is something I’ve put significant effort into. It’s the biggest gap I’ve found in the existing public datasets - listening to someone read an audiobook or recite a sentence doesn’t seem to prepare the model very well for recognizing single words in isolation.
My model and training process have adapted for that, though I’m still not sure of the best way to balance training of that sort of thing. I have maybe 5 examples of each English word in isolation but 5000 examples of each number (Speech Commands), and it seems like the model will prefer e.g. “eight” over “ace”, I guess due to training balance.
Maybe I should be randomly sampling 50/5000 of the imbalanced words each epoch so the model still has a chance to learn from them without overtraining?
What if you first trained a classifier that told you if the uttereance is a single word vs. multiple words? Then, based on that prediction, you would use one of two separate models.
The technique you're thinking of is called oversampling and there are many other general techniques for dealing with imbalanced datasets, as it's a very common situation.
Thanks, the oversampling mention gives me a good reference to start.
The model itself has generalized pretty well to handle both single and multi word utterances I think, without a separate classifier, but I'm definitely not going to rule out multi-model recognition in the long run.
My main issues with single words right now are:
- The model sometimes plays favorites with numbers (ace vs eight)
- Collecting enough word-granularity training data for words-that-are-not-numbers (I've done a decent job of this so far, but it's a slow and painful process. I've considered building a frontend to turn sentence datasets into word datasets with careful alignment)
For that last point, forced alignment tools may be useful.
An issue to watch for though is elision: a word in a sentence can often be said differently to the individual words, eg saying "last" and "time" separately one typically includes the final t in last and yet said together, commonly it's more like "las time".
Yeah, I'm familiar with forced alignment. This is slightly nicer than the generic forced alignment, because my model has trained on the alignment of all of my training data already. My character based models already have pretty good guesses for the word alignment.
I think I'd be very cautious about it and use a model with a different architecture than the aligner to validate extracted words, and probably play with training on the data a bit to see if the resulting model makes sense or not. I do have examples of most english words to compare extracted words.
As someone actively using the data I wish I could more easily see (and download lists for?) the older releases as there have been 3-4 dataset updates for English now. If we don’t have access to versioned datasets, there’s no way to reproduce old whitepapers or models that use common voice. And at this point I don’t remember the statistics (hours, accent/gender breakdown) for each release. It would be neat to see that over time on the website.
I’m glad they’re working on single word recognition! This is something I’ve put significant effort into. It’s the biggest gap I’ve found in the existing public datasets - listening to someone read an audiobook or recite a sentence doesn’t seem to prepare the model very well for recognizing single words in isolation.
My model and training process have adapted for that, though I’m still not sure of the best way to balance training of that sort of thing. I have maybe 5 examples of each English word in isolation but 5000 examples of each number (Speech Commands), and it seems like the model will prefer e.g. “eight” over “ace”, I guess due to training balance.
Maybe I should be randomly sampling 50/5000 of the imbalanced words each epoch so the model still has a chance to learn from them without overtraining?