After taking some time to dig into the project and forums, I’m more concerned. I have worked on building and validating large-scale ML datasets in tricky domains before and am best friends with an SLP, so I have some context for understanding the challenges involved with creating a dataset like this. I apologize if my tone is too harsh—my intent is to help hold the ML community to a higher standard on this issue and generate productive conversation via my criticisms.
The Common Voice FAQs say the right words about the mission of the project:
>”As voice technologies proliferate beyond niche applications, we believe they must serve all users equally.”
>”We want the Common Voice dataset to reflect the audio quality a speech-to-text engine will hear in the wild, so we’re looking for variety. In addition to a diverse community of speakers, a dataset with varying audio quality will teach the speech-to-text engine to handle various real-world situations, from background talking to car noise. As long as your voice clip is intelligible, it should be good enough for the dataset.”
However, your data validation criteria both implicitly and explicitly exclude entire classes of people from the dataset, and allow for the validators to impose an arbitrary standard of purity regarding what constitutes “correct” speech. In so doing, you are influencing who is and isn’t understood by systems built upon this data. Examples from the docs (https://discourse.mozilla.org/t/discussion-of-new-guidelines...)
>”You need to check very carefully that what has been recorded is exactly what has been written - reject if there are even minor errors.”
As currently stated, this criteria leads to the categorical exclusion of people for whom speaking without “even minor errors” is not possible (ex: lalling and other phonological disorders, where certain phonemes can’t be formed), based on the validators’ subjective perception of data cleanliness.
>”Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.”
Look at this kid’s face light up and tell me that’s not his new natural voice. An electrolarynx is not a computer-synthesized voice (you manipulate the muscles in your neck to generate vibrations—like an external set of vocal cords). Although it would almost definitely be mistaken for one, and summarily sent to the “clip graveyard” (https://voice.mozilla.org/en/about).
>”I tend to click ‘no’ and move on for extreme mispronounced words. I’m of the opinion that soon enough, another speaker from their nationality will submit a correct recording.”
Again, the use of the word “correct” here is problematic. Rejecting borderline cases and waiting for “cleaner” samples is a severe trap to fall into, regardless of the domain.
>”I do the same as you. Accept if it’s an elongation; reject if the reader takes two attempts to start the word.”
Again, this almost categorically excludes people with a stutter and other types of speech disorders.
@dabinat gets its right with this comment:
>”There are uses for CV and DeepSpeech beyond someone directly dictating to their computer. In my opinion, CV’s voice archive should contain as many different ways to say something as possible.”
But then...
>”You may well be right. I’d be interested to hear what the programmers’ expectations are.”
>”I will ping @kdavis and @josh_meyer for feedback on the ML expectations (in terms of what’s good/bad for deepspeech).”
Yikes. So the data is being selected to improve performance benchmarks of the speech recognition model, and not to better reflect the nuances and variety of speech in the real world (as was the stated goal of Common Voice). It’s very easily the case that cherry picking data to improve test benchmarks will decrease generalizability of the model in other applications. Narrowing the range of human speech to make the problem easier (as in simpler to build a model that functions well for most people) is antithetical to your stated mission. We can’t keep measuring AI progress in parameters and petabytes. It has to be about the people it helps.
>”I agree that we don’t want to scare off new contributors off by presenting the guidelines up-front as an off-putting wall of text that they have to read.”
Limiting the amount of documentation/training available to data annotators in an effort not to scare them is a surefire way to end up with inconsistently labeled data.
Although I find the above examples to be dismaying, I do not mean to ascribe any ill intent to your team or the volunteers. I understand the complexities at play here. But the outright dismissal of certain types of voices as out-of-scope or not “correct” is causing real harm to real people, because ASR systems simply do not work well for people with various disabilities. I could find no direct mention or acknowledgement of the existence of speech disorders anywhere* on the website or forum.
I believe there needs to be a more deliberate effort to construct a more representative dataset in order to meet your stated mission (which I am willing to volunteer my time towards). Just some initial ideas:
- Augment the dataset by folding in samples from external datasets (e.g. https://github.com/talhanai/speech-nlp-datasets). I’m not sure on the approach, but if movie scripts can be adapted, presumably so can other voice datasets.
- Retain samples with speech errors like mispronunciations and stutters (perhaps with a flag indicating the error). In fact, why not retain all samples, flagging those that are unintelligible? At least keep it available, for data provenance purposes (so it is known what was excluded and can be reversed).
- Establish a relationship with speech-language pathologists to collect or validate samples (eg: universities or the VA, who have many complex/polytrauma voice patients). Sessions with SLPs often involve having patients read sentences aloud, so it’s a familiar task. This is probably the best way to collect data from people with voice disorders, so volunteer annotators aren’t responsible for analyzing a complex subset.
- Use inter-annotator agreement measures to characterize uncertainty about sample accuracy, rather than binary accept/reject criteria.
- Collect/solicit more samples from people >70yrs old, since they are currently underrepresented in your data. Is there anyone over the age of 80 in your dataset at all?
- Improve your documentation and standards to be more explicitly transparent about the ways in which it does not currently represent everyone, and plans for bridging these gaps.
Hi! I'm the lead engineer for Common Voice and I wanted to thank you for taking the time to write such a thoughtful comment. Your criticisms on disordered speech and exclusion criteria are absolutely on point, and the issue of speaker diversity in our dataset as well as the lack of nuance in our validation mechanism is something we're very aware of and actively working to address. We are and have historically been a very small team and have thus far concentrated our efforts on language diversity, which is an explanation but not an excuse for some of these gaps.
There are some legal concerns with folding in samples from other datasets for licensing reasons, because all of our data (sentences and audio) are CC0, but I definitely hear you on looking for ways to expand our scope. As part of our commitment to open data, all voice samples are released as part of our dataset regardless of their validation status, and we do not filter or discard any contributions from our community. One of the things the team is currently scoping is how to allow contributors to provide reasons for rejecting a particular clip, to enable exactly the kind of post-hoc analysis you're describing.
Please do join us in Discourse or Matrix, we would love for you to be involved in ongoing discussions on how to improve inclusion and accessibility. Again, thank you for taking the time to express this, I really appreciate it.
It might be a good idea to post this comment directly to Discourse to make sure it gets Mozilla’s attention.
But as I mentioned, this has been discussed, including the ability for users to add flags to their profiles to indicate disordered speech.
IMO it might be better to include disordered speech in a separate dataset with separate validation requirements, which would require new features on the site. But the new “target segments” feature is a step towards achieving such a thing.
You raise good points. For what it's worth, I think all "invalidated" samples are still included in the distribution (invalidated.tsv), with the number of up and down votes for each (but not the reasoning).
The Common Voice FAQs say the right words about the mission of the project:
>”As voice technologies proliferate beyond niche applications, we believe they must serve all users equally.”
>”We want the Common Voice dataset to reflect the audio quality a speech-to-text engine will hear in the wild, so we’re looking for variety. In addition to a diverse community of speakers, a dataset with varying audio quality will teach the speech-to-text engine to handle various real-world situations, from background talking to car noise. As long as your voice clip is intelligible, it should be good enough for the dataset.”
However, your data validation criteria both implicitly and explicitly exclude entire classes of people from the dataset, and allow for the validators to impose an arbitrary standard of purity regarding what constitutes “correct” speech. In so doing, you are influencing who is and isn’t understood by systems built upon this data. Examples from the docs (https://discourse.mozilla.org/t/discussion-of-new-guidelines...)
>”You need to check very carefully that what has been recorded is exactly what has been written - reject if there are even minor errors.”
As currently stated, this criteria leads to the categorical exclusion of people for whom speaking without “even minor errors” is not possible (ex: lalling and other phonological disorders, where certain phonemes can’t be formed), based on the validators’ subjective perception of data cleanliness.
>”Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.”
Please watch this example of a person you are defining out of your dataset: (https://m.youtube.com/watch?v=5HgD0PXq0E4)
Look at this kid’s face light up and tell me that’s not his new natural voice. An electrolarynx is not a computer-synthesized voice (you manipulate the muscles in your neck to generate vibrations—like an external set of vocal cords). Although it would almost definitely be mistaken for one, and summarily sent to the “clip graveyard” (https://voice.mozilla.org/en/about).
>”I tend to click ‘no’ and move on for extreme mispronounced words. I’m of the opinion that soon enough, another speaker from their nationality will submit a correct recording.”
Again, the use of the word “correct” here is problematic. Rejecting borderline cases and waiting for “cleaner” samples is a severe trap to fall into, regardless of the domain.
>”I do the same as you. Accept if it’s an elongation; reject if the reader takes two attempts to start the word.”
Again, this almost categorically excludes people with a stutter and other types of speech disorders.
@dabinat gets its right with this comment:
>”There are uses for CV and DeepSpeech beyond someone directly dictating to their computer. In my opinion, CV’s voice archive should contain as many different ways to say something as possible.”
But then...
>”You may well be right. I’d be interested to hear what the programmers’ expectations are.”
>”I will ping @kdavis and @josh_meyer for feedback on the ML expectations (in terms of what’s good/bad for deepspeech).”
Yikes. So the data is being selected to improve performance benchmarks of the speech recognition model, and not to better reflect the nuances and variety of speech in the real world (as was the stated goal of Common Voice). It’s very easily the case that cherry picking data to improve test benchmarks will decrease generalizability of the model in other applications. Narrowing the range of human speech to make the problem easier (as in simpler to build a model that functions well for most people) is antithetical to your stated mission. We can’t keep measuring AI progress in parameters and petabytes. It has to be about the people it helps.
>”I agree that we don’t want to scare off new contributors off by presenting the guidelines up-front as an off-putting wall of text that they have to read.”
Limiting the amount of documentation/training available to data annotators in an effort not to scare them is a surefire way to end up with inconsistently labeled data.
Although I find the above examples to be dismaying, I do not mean to ascribe any ill intent to your team or the volunteers. I understand the complexities at play here. But the outright dismissal of certain types of voices as out-of-scope or not “correct” is causing real harm to real people, because ASR systems simply do not work well for people with various disabilities. I could find no direct mention or acknowledgement of the existence of speech disorders anywhere* on the website or forum.
I believe there needs to be a more deliberate effort to construct a more representative dataset in order to meet your stated mission (which I am willing to volunteer my time towards). Just some initial ideas:
- Augment the dataset by folding in samples from external datasets (e.g. https://github.com/talhanai/speech-nlp-datasets). I’m not sure on the approach, but if movie scripts can be adapted, presumably so can other voice datasets.
- Retain samples with speech errors like mispronunciations and stutters (perhaps with a flag indicating the error). In fact, why not retain all samples, flagging those that are unintelligible? At least keep it available, for data provenance purposes (so it is known what was excluded and can be reversed).
- Establish a relationship with speech-language pathologists to collect or validate samples (eg: universities or the VA, who have many complex/polytrauma voice patients). Sessions with SLPs often involve having patients read sentences aloud, so it’s a familiar task. This is probably the best way to collect data from people with voice disorders, so volunteer annotators aren’t responsible for analyzing a complex subset.
- Use inter-annotator agreement measures to characterize uncertainty about sample accuracy, rather than binary accept/reject criteria.
- Collect/solicit more samples from people >70yrs old, since they are currently underrepresented in your data. Is there anyone over the age of 80 in your dataset at all?
- Improve your documentation and standards to be more explicitly transparent about the ways in which it does not currently represent everyone, and plans for bridging these gaps.