Hacker News new | past | comments | ask | show | jobs | submit login

Data and compute are the largest hurdles. I only have one GPU and training one model takes 3+ days, so I am limited by that. Also, scraping from YouTube takes time and a lot of storage (multiple TBs).

Mozilla Common Voice data is already used for training.




Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.

I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.


Depending on your objective, noisy data might be useful. I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that. But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.


There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning


For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/ I don't know update prices but I remember it was quite affordable.


> I remember it was quite affordable.

Relative to what? Paperspace is one of the costlier GPU providers.


Okay, you are right, but it's also really performant, so imho you can do a lot of work in minor time.

For something cheapest I read that post on reddit :

https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_clo...


performant? It's the same GPU..?


Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?


> Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

But you need supervised data too.


I know you can scrape only audio from YouTube with YouTubeDL but it’s somewhat annoying


I use something akin to

    'alias downloadmusic='youtube-dl --extract-audio --audio-quality 0 --extract-metadata'
in my .bashrc

I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.

EDIT: Typed this from memory, there may be errors in the alias.


    youtube-dl -f bestaudio $URL
Dunno when that went in but it works now.


So do you scrap videos from youtube with subtitles to collect data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: