Data and compute are the largest hurdles. I only have one GPU and training one model takes 3+ days, so I am limited by that. Also, scraping from YouTube takes time and a lot of storage (multiple TBs).
Mozilla Common Voice data is already used for training.
Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.
I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.
Depending on your objective, noisy data might be useful.
I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that.
But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.
There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning
For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/
I don't know update prices but I remember it was quite affordable.
I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.
EDIT: Typed this from memory, there may be errors in the alias.
Mozilla Common Voice data is already used for training.