After taking some time to dig into the project and forums, I’m more concerned. I...

phirephoenix · on July 3, 2020

Hi! I'm the lead engineer for Common Voice and I wanted to thank you for taking the time to write such a thoughtful comment. Your criticisms on disordered speech and exclusion criteria are absolutely on point, and the issue of speaker diversity in our dataset as well as the lack of nuance in our validation mechanism is something we're very aware of and actively working to address. We are and have historically been a very small team and have thus far concentrated our efforts on language diversity, which is an explanation but not an excuse for some of these gaps.

There are some legal concerns with folding in samples from other datasets for licensing reasons, because all of our data (sentences and audio) are CC0, but I definitely hear you on looking for ways to expand our scope. As part of our commitment to open data, all voice samples are released as part of our dataset regardless of their validation status, and we do not filter or discard any contributions from our community. One of the things the team is currently scoping is how to allow contributors to provide reasons for rejecting a particular clip, to enable exactly the kind of post-hoc analysis you're describing.

Please do join us in Discourse or Matrix, we would love for you to be involved in ongoing discussions on how to improve inclusion and accessibility. Again, thank you for taking the time to express this, I really appreciate it.

dabinat · on July 1, 2020

It might be a good idea to post this comment directly to Discourse to make sure it gets Mozilla’s attention.

But as I mentioned, this has been discussed, including the ability for users to add flags to their profiles to indicate disordered speech.

IMO it might be better to include disordered speech in a separate dataset with separate validation requirements, which would require new features on the site. But the new “target segments” feature is a step towards achieving such a thing.

daanzu · on July 1, 2020

You raise good points. For what it's worth, I think all "invalidated" samples are still included in the distribution (invalidated.tsv), with the number of up and down votes for each (but not the reasoning).