Data and compute are the largest hurdles. I only have one GPU and training one m...

forgingahead · on Nov 16, 2020

Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.

I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.

iceychris · on Nov 16, 2020

Depending on your objective, noisy data might be useful. I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that. But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.

oulipo · on Nov 16, 2020

There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning

th3h4mm3r · on Nov 15, 2020

For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/ I don't know update prices but I remember it was quite affordable.

whimsicalism · on Nov 15, 2020

> I remember it was quite affordable.

Relative to what? Paperspace is one of the costlier GPU providers.

th3h4mm3r · on Nov 15, 2020

Okay, you are right, but it's also really performant, so imho you can do a lot of work in minor time.

For something cheapest I read that post on reddit :

https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_clo...

whimsicalism · on Nov 15, 2020

performant? It's the same GPU..?

jack_pp · on Nov 15, 2020

Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

whimsicalism · on Nov 15, 2020

> Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

But you need supervised data too.

klysm · on Nov 15, 2020

I know you can scrape only audio from YouTube with YouTubeDL but it’s somewhat annoying

Shared404 · on Nov 15, 2020

I use something akin to

    'alias downloadmusic='youtube-dl --extract-audio --audio-quality 0 --extract-metadata'

in my .bashrc

I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.

EDIT: Typed this from memory, there may be errors in the alias.

jerf · on Nov 15, 2020

    youtube-dl -f bestaudio $URL

Dunno when that went in but it works now.

th3h4mm3r · on Nov 15, 2020

So do you scrap videos from youtube with subtitles to collect data?