stergro's comments

stergro · on July 3, 2020

The complete project is very exciting, I hope that this is really a game changer, that enables private persons and startups to create new neural networks without a big investment for the data collection.

I worked for the Esperanto dataset of common voice in the last year, and we now have collected over 80 hours in Esperanto. I hope that in a year or two we'll have collected enough data to create the first usable neural network for a constructed language and maybe the first voice assistant in Esperanto too. I will train a first experimental model with this release soon.

stergro · on May 2, 2020

The project wants to build up a dataset to train neural networks for speech recognition software. The first goal for every language is collecting 12000 hours. They have reached this, but they only release a dataset twice a year, that's why you still see 1100 hours at this download.

Other languages are looking good too. Germany has almost reached 600 h, French almost 500 h and the website was localized to many other small and big languages in the last year.

The main factor to build a good dataset is A having a diverse group of donors and B having enough sentences. So if you want to help you can either donate your voice, validate other voices or collect more sentences to record. All three tasks are equally important to create a good dataset.

https://voice.mozilla.org