Hacker News new | past | comments | ask | show | jobs | submit login
Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Data (blog.mozilla.org)
521 points by Vinnl on Nov 29, 2017 | hide | past | favorite | 88 comments



I am very grateful for this release from Mozilla, and more generally for the broad vision of their effort.

As time passes, the quest for openness and freedom in software moves higher up in the stack. Thanks to the latest ~30 years of effort, we basically came to a point in which we have free OSes, basic infrastructure, building tools, end-user applications.

In the last ~10 years we changed paradigm: autonomous desktop computing progressively transitioned to mobile, with a lot of functionality offloaded to "the cloud".

What I feel is needed, going forward, is working towards building a viable free replacement for these distributed services. DeepSpeech is a step in the right direction.

Edit: just speaking about SW, here. HW is worth a different topic, and probably poses even more challenges.


Thanks for the note about HW. In my field in robotics openness is not common (ROS being a major exception).

I hold a rather uncommon view that a path to a more free society requires abundance of open source tools. In the software world we are getting closer, but the whole hardware world is still one big unmoving binary blob.

I make open source robots and I’m hoping to help in that area. But we need open source machine tools, open source factories, and open manufacturing processes. There’s lots of work do be done in hardware land.


I would think 3D printing, portable CNC machines[0], and custom PCB fabrications[1] are making headway into the openness of the HW space. I think that even before the tools, or maybe alongside of, there needs to be an abundance of cheap, clean energy to power the open source tools for the real revolution in manufacturing to take off.

[0]http://www.goliathcnc.com/

[1]https://oshpark.com/


Yes! Home manufacturing tools are definitely headed in the right direction! My interest is in going much farther though - to the point where a community or society can survive off of a totally open source chain. To me that means the machines that make the bagels at the bagel shop are open source and the vehicles that deliver the wheat that make the bagels are open source and so is the farm equipment and the solar panel manufacturing equipment etc.

I’m happy that we’re moving to more sustainable energy but for me the revolution I’m most interested in is the one where the people control the manufacturing technology. And I hope we’re caring for the earth while we do it.

Oh and see my latest 3D printed robot in that vein: http://reboot.love/t/rover-a-robot-you-can-make-at-home/94


It’s exciting to see open source RISC-V gaining momentum on the HW world


Agreed!


This period is only a phase while CNNs offer the best bang. As algorithms improve, the amount of data needed to train a model will drop 1000x making the big5's data hoards not worth nearly as much. In five years, the power will shift away from petabyte sized datasets.


My understanding is that data needed (aka data complexity) is independent of the algorithm. It depends only on model complexity (roughly # of free parameters in model). AFAIK this is a fundamental principle of ML that has been proven mathematically and which is inescapable.

https://en.m.wikipedia.org/wiki/VC_dimension


> AFAIK this is a fundamental principle of ML that has been proven mathematically and which is inescapable.

Well, the human brain manages to master several complex tasks using much smaller data sets than current machine learning algorithms. Natural language acquisition, for example, seems to require fewer than 10 million spoken words per year, and even academically successful 12-year-olds might be reading 1 to 4 million words per year. These are not exactly tiny data sets, but they don't require Google's scale to recreate, either.

Sure, the human genome probably "knows" what set of models to try when building the brain, which gives it an advantage. But I can't think of any reason why machine learning couldn't ultimately try similar techniques with similarly-sized data sets, and get competitive results.


You are right.

I didn't phrase it correctly. What is algorithm independent is the theoretical performance bound for a given data complexity.

Sure, not all algorithms are the same, but the best performance that can theoretically be achieved with a given iid sample of a given size is algorithm independent.

So yes, a better algo can get same performance with less data, but limit stays the same.


I'm not entirely sure. Yes, you need a mountain of data to train a system very well to be flexible. However, once that training is done, you absolutely do not need 'cloud' resources whatsoever. You could run an already-trained NN of any sort on an embedded processor running on tiny batteries. They require extremely little processing to actually push some input through the network and get the output. The only reason we continue to send all of our data to cloud companies to have the speech recognition performed there is because those companies have perverse incentives to snoop and spy and profile and target aggressively.

Technically speaking, all of these 'voice assistants' and the like would be far better products if they hosted a local pre-trained network that did all of the recognition. Latency is the biggest challenge to these systems, and that simply will never be solved so long as the recognition happens on a server miles away from where the user is speaking. The speed of light, at a minimum, comes into play.

What would be very interesting would be if there was a development of a continuously learning system which actually performs better training itself to recognize one or a few users voices without the burden of carrying the weights necessary to also recognize the voices of people with accents on the other side of the planet from anyone who will ever be within earshot of it. That would be an even more overt and active disincentive to sending all your data away "for speech recognition" (the spying is a fringe benefit!) and sitting around waiting for the cloud to get back to you.... and I imagine we'd see it ignored.

The limitations today aren't technical, they're organizational and business oriented. And those things don't generally change in concert with technical changes.


This is very timely, as today I was thinking of caving in and getting an Echo Dot so I can control my smart home devices by voice.

I would love an open-hardware microphone array that I could use with a Pi or something similar, to write my own Alexa. Not only would I love this, I would store all my commands and send them to Mozilla to help with their speech recognition models.

I don't want to be the guy who wishes someone else would do all the work and then give it away for free, so I'll do what I can to help (which is probably limited to writing a bunch of code and documentation on how people can set this up more easily). Congratulations and thanks to Mozilla for this.


Not affiliated in any way but have you seen the MATRIX Creator or MATRIX Voice? Both contain microphone arrays that can interface with a Raspberry Pi.

I’d love to see one of these meshed together with this new Mozilla voice project for an open source Echo or Google Home. The only missing piece it seems at this point is NLP and all of the glue that converts commands to API calls.


Oh I hadn't, this is fantastic (and it integrates with an ESP32), thanks! I wonder if it includes software to do some live DSP to reduce noise... The fact that it can just connect to the Raspberry Pi's GPIOs and provide great sound is ideal, though. I'm very glad someone has made this, I wish I had known about it before so I could back it.


There are already some open source personal assistants, like Mycroft and Jasper. Integrating Mozilla's DeepSpeech into one of them would be fantastic.


This is super cool, but I'd be cautious about the usefulness of this data set.

Both this data set and LibriSpeech are read speech, where the speaker was prompted with a transcription and asked to say it out loud. In practice it's very rare that you're trying to transcribe speech that's already been transcribed. Speech patterns for computer-directed speech (e.g. for voice activated user interfaces) or human-to-human speech (e.g. for meeting transcription) are quite different.


Yup, this is an excellent point. We have, and will continue to explore ways to allow Common Voice users to speak more organically (for instance by answering a question, or responding free-form to some other sort of prompt). The problem with this approach is that it requires an extra step, transcription, which at the scale we are trying to achieve is pretty costly in either money or time (ie. tedium for our users). Eventually we hope that speech engines can take care of the transcription part, but for now we need people.

That said, we will definitely be exploring ways to build in organic speech and perhaps transcriptions to the Common Voice app. This will solve another problem for us too, which is getting public domain material for people to read. Doing this obviously requires a much more complex user experience, and we have more work to figure out how to make something that people will want to use and contribute to. Stay tuned for that :)

On the flip side, we hope that these datasets, models, and the tools (ie. DeepSpeech) can get more people (researchers, start-ups, hobbyist) over the hump of building an MVP of something useful in voice. Once you have people using your products, collecting useful in-context voice data becomes much easier.

On that note, another approach we are working on is partnering with universities and socially-aware startups like MyCroft, SNIPS, and Mythic. Imagine if voice products in market allowed their users to opt-in to contributing their utterances to an open resource similar to Common Voice. Of course, sharing your voice publicly is not for everyone, or every product scenario. But it does work for some. And if we pool our resources, our hope is to indeed commoditize speech-to-text so that we can focus on more interesting challenges like building voice experiences people want to use. (For instance, could voice somehow be a "progressive enhancement" to the web?).


> could voice somehow be a "progressive enhancement" to the web?

I have created my own TamperMonkey plugin that adds TTS to web pages. It finds text, makes it clickable, and when a user clicks a word, it starts reading from there, highlighting text as it reads it, skipping menus and chrome. I find this helps me better focus on reading. Unfortunately I can only stand one single voice and it's been stagnating for years (Alex from Mac OS). Can't wait to hear the WaveNet voice Google has been threatening to give us.


Is it available for the rest of us perchance?


A speech recognition researcher I knew spent some time at Eastern Washington university because they had a lot of transcribed Washington state proceedings, which was open access enough to go into his company’s speech corpus, I guess (I only found out because I mentioned my mom graduated from there). Anyways, these people turn over a lot of rocks to realize their huge corpuses (erm, corpi?).


Whether that is “open access” enough for commercial use is an interesting question. I thought that the SCOTUS recordings, for example, can not be used for commercial applications, but that might be a restriction imposed by the organization that processes and publishes the data, not the proceedings themselves.


Have you considered getting volunteers to transcribe permissively licensed video or podcasts?


One advantage of being the size / prestige of Mozilla is presumably organisations that are willing to license their content for free to Mozilla for this purpose?


I was thinking earlier that maybe YouTube CC-licensed audio with manually entered subtitles might be a good source?

Though, most videos of decent length would only contain say three or four speakers, which is most definitely sub-optimal.

https://www.youtube.com/results?sp=EgYYAigBMAE%253D&search_q...


The last time I checked Youtube's terms of service prohibit you from making use of the rights granted by the creative commons licenses on the content.


How so?


Just had an idea... what about call center providers? They already collect speech data for training purposes and transcription could most likely help there!


They do, but privacy is a major concern.

The bigger problem is, of course, that you need speech data with (fairly) accurate transcripts for training ASR systems. These typically don't exist for call center calls.


Transcriptions are not really the issue here, the cost of freelance transcribers is relatively low. It is privacy that makes it so hard, most of the call center calls need to have some kind of user authentication, which means they would need to be anonymized prior to being transcribed and used as a training material.


Awesome, we're so close to having a speech-to-text system I can trust.

I really wish Mozilla would release a keyboard app for Android. It would instantly be the single most trusted keyboard available.


Regarding Android keyboards, it is horrific that Google keyboard sends all your key presses, except passwords, to them.


Does it? That would certainly be pretty alarming, but I can't seem to find any evidence that it does. The only relevant article I could find was this one on Gboard on iOS: https://www.macworld.com/article/3070767/ios/googles-gboard-... Seems to suggest that that at least their iOS keyboard doesn't send any data while typing. Could be different on Android, but I didn't find any articles suggesting that they are doing this. Wouldn't put past them to make a quiet change of a policy though.


... it does what? Even if you disable the "Share Snippets" option?


I hope not but I only noticed the opt-out feature just a few weeks ago. It's sickening that Google thinks it's acceptable. Opt-in or opt-out, it never should have been considered a viable "feature" to include.


There is no official statement of Google about this but there is a generic Android warning saying "This method can collect all of the text that you enter except passwords including personal data and credit card numbers.". I wonder why Google doesn't explicitly say what they are doing. I don't want to distribute FUD but this is a critical component.


That's because any keyboard could, theoretically, be a keylogger.


Everything by Google sends everything to Google.


but most useful keyboards require permissions that are scary...


Have you checked Multiling O Keyboard app? (I used it more than 1 year ago, now I use a phone powered by Sailfish OS).


no, but I have used the hacker's keyboard and it doesn't require any permissions, but it doesn't do swype which I like to use once in a while (but besides that, it's good)


For those who don't know that open source speech recognition that does not depend on AI already exists:

http://cmusphinx.sourceforge.net/

http://julius.osdn.jp/en_index.php

Maybe with this data set released eventually all that additional data will be used to improve those tools as well



The problem with Kaldi is that it's virtually impossible to get a dictation model working with Kaldi unless you have a doctorate in speech recognition. There is no "I know basic programming, but little about speech recognition" documentation for Kaldi.


Between learning curve and dependency hell, I've never managed good results with Kaldi, Simon, or Sphinx. It's unfortunate; hopefully we'll get an easy to use option soon.


When was the last time you tried sphinx? The library has changed a LOT. Their guides, new website and other resources basically walk you from zero knowledge to working demo.


On, I will have to try again:) Thanks for the tip; I always thought Sphinx should be ideal, it was just too much work to get it working.


There's the "Kaldi for Dummies" tutorial [1], which helped me to the point of creating a speech recognition program that could distinguish digits in recordings of my voice. I guess that's the documentation you're looking for.

My personal problem with Kaldi is that I don't have enough RAM in my cheap laptop to work with any of the big models. When it started swapping just doing the preprocessing for one of the pretrained models [2], I kind of abandoned that project until I bother to get new hardware.

For that reason, I can't tell how good the pretrained models really are.

[1] http://kaldi-asr.org/doc/kaldi_for_dummies.html [2] http://kaldi-asr.org/models.html


If you want to help them out you can visit https://voice.mozilla.org/ and record some sentences.


They have an iOS app too: https://itunes.apple.com/us/app/project-common-voice-by-mozi...

It’s a great idea to crowd source this. Wonder if this project can turn voice recognition into a solved problem.

I just set a daily reminder so I can do 10 minutes a day.


Thank you so much!

I also want to emphasize the importance of listening (validating) as well as recording. Validation is an big part of the puzzle for building machine learning viable data.


One thing that wasn't entirely clear to me is how strict you have to be when validating? i.e. I encountered one recording that was completely silent - I figured that had to be marked as invalid. However, another one was barely audible, but by intently listening I did recognise it pronounced the right words - is that OK?

And should we validate whether they match the proper accents as well? e.g. if I hear a clear Dutch accent, I presume you wouldn't want that labelled "native British speaker"?


Can I suggest is encourging user's to get recordings from their children as well, as most speech recognition libraries are pretty poor with children's voices. (IMO Alexa Voice Service is by far the best with children voices.)


Is that okay legally? Maybe parental permission is enough


Do they have other languages than english?


It says "more languages coming soon" right on the page.


Really happy to see someone major looking at open source speech recognition. Due to the lack of a self-hosted or on-device solution, my assistant/automation software is basically designed for speech, but not currently doing recognition because I haven't found a non-cloud option that does what I need it to.


Wowow. Time to build an open microphone array for impeoved speech pickup. That can be connected to RPi for voice control that respects privacy.



That is a very nice board! The inclusion of a FPGA and ESP32 makes it very capable, and 65USD is a good price for such a package. And I found beamforming code for the microphone array (running on host computer) at https://github.com/matrix-io/matrix-creator-hal/blob/master/...


How does Mozilla's 6.5% error rate on LibriSpeech’s test-clean dataset compare to Google's, Apple's, Amazon's and other's voice recognition? I couldn't easily find any chart of comparisons.


In the Baidu Deep Speech 2 paper, the Baidu implementation is able to get 5.33%, and a human 5.83%. https://arxiv.org/pdf/1512.02595v1.pdf


Does this relate or help at all with speaker identification? The Microsoft speech API also provides speaker recognition which is useful for many applications: https://azure.microsoft.com/en-us/services/cognitive-service...


No, this is for speech recognition only.


the world’s second largest publicly available voice dataset, which was contributed to by nearly 20,000 people globally

Well done, people!


This is tangential, but I wonder if something like this could be (mis)used to break captcha - by feeding in the disabled-friendly audio captcha and passing the results back to the captcha server.

As voice recognition becomes more sophisticated I think captchas are going to have to evolve to kjeep up as well.


Captchas are utterly beat, and more so, it's not the technology or difficulty : they are a lost cause. Pretty much any problem you might present in a captcha, machine learning performs better than humans.

So today, failure to captcha is actually an indication that the other end is human.


Pretty sure that's been done with existing voice to text systems. So, yes :)


How does this compare with Snips.co which can do offline speech recognition on a Rapberry Pi 3?

Coyld this be used to train a model/engine that can be used that way?


The model we released today is not yet optimized for smaller devices like that, but our plan is to make it usable on targets like the RPi3.


Are you releasing any prebuilt models, I searched but couldn't find any, so people can go and play with your work without training?

Edit: NM found it under releases: https://github.com/mozilla/DeepSpeech/releases.


Thank you for finding and linking the prebuilt model. It was eluding me.


OMG, C++??? I thought Mozilla folks would give Rust a spin.


It is Tensorflow based which has a C++ API. Though, it looks like they provide Rust bindings for the Deep Speech library itself.


I can't find exact numbers on Snips.ai, but generally there's a linear relation between the size of the inference model in RAM and the accuracy it can obtain.

I'd have to assume DeepSpeech outperforms anything running on a RasPi3, at least for LVCSR. It hits 93.5% accuracy on Librispeech, which I've never seen from any offline recognition models.


Kaldi has 4.14% WER (95.86% accuracy) on the same test dataset (test-clean) [1] using a model that runs faster than real time on CPU. You would have to make the model smaller to run it in real time on a RasPi3, but according to this [2], you can get decent WERs for read speech even then.

[1] https://github.com/kaldi-asr/kaldi/blob/master/egs/librispee...

[2] https://groups.google.com/d/msg/kaldi-help/Pr6jPH1Qshg/kn8df...


They are listed as a contributor to the Common Voice dataset: https://medium.com/mozilla-open-innovation/sharing-our-commo...

I think their product is higher level - it has things like recognizing commands and sentence structures, and hotword detection.


The URL is https://snips.ai/ .


Thanks, going by memory which is apparently faulty.


I don't think snips is open source, for one.


I don't see why Mozilla would do this kind of things except to spread ressources. I know that it is an anecdote, but I don't know anyone who uses any kind of speech to text in part because they all suck if you don't speak english and even then..


Personal assistant devices are selling like hotcakes, and there are tons of options for creating smart lighting and such. Maybe I hang around tinkerers too much, but I think it would be nice to have a method of speech control that doesn't rely on someone's remote speech recognition API.


As i stated above, it might be an anecdata or a cultural bias as I live in Montreal.

While i do believe that's interesting, I don't understand why that would be Mozilla's job.


Why not? They build free software and promote open standards; speech recognition is a prominent area that doesn't have any really good open solutions right now.


Really? I know a ton of people who talk to their phones (Google Assistant, Siri, navigation apps, ...) or have something like Alexa in their homes and I don't live in an english speaking country.


I live in Montreal and don't know anyone who bought one or speak to their phone. Might be a cultural bias or just an anecdata as stated above.


Does this foreshadow a day when Firefox starts spying on what I'm talking about?


Wouldn't this be the opposite? By bundling the speech model so that it's on the client device instead of their server, your speech need not leave your device.


If they wanted to do that, there are easier ways to do it than to release a speech recognition system and the speech data that makes it work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: