Good FLOSS speech recognition and TTS is badly needed. Such interaction should not be left to an oligoply with bad history of not respecting users freedoms and privacy.
Mozilla CommonVoice is definitely trying. I always do a few validations and a few clips if I have a few minutes to spare, and I recommend everyone does. They need volunteers to validate and upload speech clips to create a dataset.
I like the idea, and decided to try doing some validation. The first thing I noticed is that it asks me to make a yes-or-no judgment of whether the sentence was spoken "accurately", but nowhere on the site is it explained what "accurate" means, or how strict I should be.
(The first clip I got was spoken more or less correctly, but a couple of words are slurred together and the prosody is awkward. Without having a good idea of the standards and goals of the project, I have no idea whether including this clip would make the overall dataset better or worse. My gut feeling is that it's good for training recognition, and bad for training synthesis.)
This seems to me like a major issue, since it should take a relatively small amount of effort to write up a list of guidelines, and it would be hugely beneficial to establish those guidelines before asking a lot of volunteers to donate their time. I don't find it encouraging that this has been an open issue for four years, with apparently no action except a bunch of bikeshedding: https://github.com/common-voice/common-voice/issues/273
After listening to about 10 clips your point becomes abundantly clear.
One speaker, who sounded like they were from the mid-west United States, was dropping the S off words in a couple clips. I wasn't sure if it was misreads or some accent I'd never heard.
Another speaker, with a thick accent that sounded European, sounded out all the vowels in circuit. Had I not had the line being read, I don't think I'd have understood the word.
I heard a speaker with an Indian accent who added a preposition to the sentence that was inconsequential but incorrect none the less.
I hear these random prepositions added as flourishes frequently with some Indian coworkers, does anyone know the a reason? It's kind of like how American's interject "Umm..." or drop prepositions (e.g. "Are you done your meal?") and I almost didn't pick up on it. For that matter where did the American habit of dropping prepositions come from? It seems like it's people in the North East primarily.
I can't quite imagine superfluous prepositions (could you give an example?) but I have found it slightly amusing learning Hindi and coming across things where I think Oh! That's why you sometimes hear X from Indian English speakers, it's just a slightly 'too' literal¹ mapping from Hindi, or trying to use a grammatical construction that doesn't really exist in English, like 'topic marking'.
[¹] If that's even fair given it's a dialect in its own right - Americans also say things differently than I would as a 'Britisher'
That's not one I've heard. Examples that come to mind are 'even I' (which seems closer to 'I too' than the 'you'd scarcely believe it but I' that it naturally sounds to me), 'he himself' (or similar subject emphasis), and adverb repetition.
I'd say it's mostly subtler (I suppose that should be the expected distribution!) things I've noticed though, they're just harder to recall as a result.
(Just want to emphasise I'm not making fun of anybody or saying anything's wrong, in case it's not clear in text. I'm just enjoying learning Hindi, fairly interested in language generally, and interested/amused to notice these things.)
Just thought of another - '[something is] very less' - which comes, presumably, from कम being used for both little/few and less than.
Hindi is much more economical, to put it literally, one says things like 'than/from/compared to orange, lemon is sour', and 'orange is little/less [without comparison] sour'.
Which, I believe, is what gives rise to InE sentences like 'the salt in this is very less' (it needs more salt, there's very little).
I downloaded the (unofficial) Common Voice app [1] and it provides a link to some guidelines [2], which also aren't official but look sensible and seem like the best there is at the moment.
If you read the doc, it says voice2json is layer on top of the actual voice recognition engine. And it supports mozilla deep speech, pocket sphinx and a few others as the underlying engine.
I've used the deepspeech project a fair amount and it is good. It's not perfect, certainly, and it honestly isn't good enough yet for an accurate transcription in my mind, but it's good. Easy to work with, pretty good results, and all the right kinds of free.
That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.
Maybe you could convince a couple of indie creators or state-run programs to licence their audio? But I'm not sure if negotiating that is more efficient than just recording a bit more audio, or promoting the project to get more volunteers.
It would likely be a lot easier for someone from within the BBC, CBC, PBS, or another public broadcaster to convince their employer to contribute to the models. These organizations often have accessibility mandates with real teeth and real costs implementing that mandate. The work of closed captioning, for example, can realistically be improved by excellent open source speech recognition and TTS models without handing all of the power over to Youtube and the like.
It would still be an uphill battle to convince them to hand over the training set but the legal department can likely be convinced if the data set they contribute back is heavily chopped up audio of the original content, especially if they have the originals before mixing. I imagine short audio files without any of the music, sound effects, or visual content are pretty much worthless as far as IP goes.
That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.
I'm not sure that is a clear copyright violation. Sure, at a glance it seems like a derivative work, but it may be altered enough that it is not. I believe that collages, and reference guides like cliff notes are both legal.
I think a bigger problem would be that the scripts, and even the closed captioning, rarely match the recorded audio 100%
And also... it's not like the program actually contains a copy of the training data, right? The training data is a tool which is used to build a model.
How is it different from things like GPT3 which (unless I’m mistaken) is trained on a giant web scrape? I thought they didn’t release the model out of concerns for what people would do with a general prose generator rather than any copyright concerns?
Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.
Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.
There are some that say that the Google Books court case is precedent for ML model stuff, if you search back through my comment history you will find links.
I am aware, I'm asking if the model, however, is infringing. Surely you can't distribute them in a dataset but is training on copyrighted data legal, and can you distribute that model?
All text written by a human in the US is automatically copyright the author. So if an engine trained on works under copyright is a derivative work, GPT3 and friends have serious problems.
I expect that wouldn't be perfect, though. Sometimes the cut that makes it into the final product doesn't exactly match the script. Sometimes it's due to an edit, other times it's due to an actor saying something similar to but not exactly what the script says, but the director deciding to just go with it.
What might work better is using closed captions or subtitles, but I've also seen enough cases where those don't exactly match the actual speech either.
Good speech recognition generally requites massive mountains of training data, both labelled and unlabelled.
Massive mountains of data tends to be incompatible with opensource projects. Even Mozilla collecting user statistics is pretty controversial. Imagine someone like Mozilla trying to collect hundreds of voice clips from each of tens of millions of users!!
> Really complicated question, but considering the free world got wikipedia and openstreetmaps, I'd bet we'll find a way.
Both of those involve entering data about external things. Asking people to share their own data is another thing entirely—I suspect most people, me included, are much more suspicious about that.
Then you need a lot of people that listen to those 12B hours of audio, and multiple listeners agree for each chunk of audio that what is spoken corresponds to the transcript.
Yes, but then you don't need Mozilla collecting read speech samples. You can just scrape any audio out there, run speech activity detection, and there you go.
Not an expert on any of this, but wouldn’t already published content (public or proprietary) such as Youtube videos, audiobooks, tv interviews, movies, tv programs, radio programs, podcasts, etc. be useful and exempt from privacy concerns?
Do user collected clips have soemthing so special to the point that it’s critical to collect them?
Another problem is that the models tend to get very very large for what I've seen. A gigabyte to 10s of gigabyes is an undesirable requirement on your local machine.
Not sure about others, but DeepSpeech also distributes a "lite" model that's much smaller and suitable for mobile devices. Not sure how its accuracy compares to the full model though.
It's well-documented and works basically out of box. I wish the STT models bundled were closer to the quality of Kaldi but the ease-of-use has no comparisons.
And maybe with time it will surpass Kaldi in quality too.
There are a bunch of good libraries that work offline out there for speech recognition -- CMUSphinx[0] has been around a long time and work seems to have shifted a little bit to Kaldi[1] and Vosk[2] (?). Julius is still going strong as well[3].
CMUSphinx and Julius have been around for ~10+ years at this point.
GNU is heavily skewed to developer tools and infrastructure, and gcc is no counterexample.
There are so many classes of software where this does not work. Pretty much anything for heavily regulated industries is not well served by FLOSS. There are few open source insurance software or medical records systems (the few that exist are highly niche and/or limited), EDA/CAD is not well served by FLOSS (I’ve toyed with FreeCAD, but even hobbyists gravitate to Fusion). Outside of developer tooling and infrastructure: commercial, generally closed source, closed development software is king.
* besides the hard part of standing up an EMR is not installing a prepackaged software.
There's more involved in than just raw CPU cycles. It's not something that is easily adapted to BOINC, but trying to offload things to BOINC to free up clusters better suited to training models might make sense.
Indeed, and it doesn't have to be as "machine learning" as the big ones.
A FLOSS system would only have my voice to recognise and I would be willing to spend some time training it. Very different usecase from a massive cloud that should recognise everyone's voice and accent.
pico2wav with the en-gb voice seems not too bad for TTS. I had reasonable luck in limited domain speech recognition with pocketsphinx, but it does need some custom vocabulary.
Granted, maybe this is "not good enough", but I feel like I got pretty far with pico2wave, pocketsphinx plus 1980's Zork level "comprehension" technology.
And the open source status of pico2wave is a bit questionable, I'll grant you that.