Hacker News new | past | comments | ask | show | jobs | submit login
Facebook open-sources a speech-recognition system and a machine learning library (fb.com)
498 points by runesoerensen on Dec 21, 2018 | hide | past | favorite | 143 comments



Very nice. In my work, as much as I love working with RNNs, convolutional models are faster to train and use. Years ago, FB blew away text modeling speed with fasttext and it is good see these projects made publicly available also.

As much as I sometimes criticize FB and Google over privacy issues, they also do a lot of good by realeasing open source systems. Most of my work involves using TensorFlow and Keras, and a little over two years ago I replaced a convolutional text classification model with fasttext, with good results.


I haven’t done any significant projects using neural networks for sequence modeling or analysis, but when I do, I plan to start with a temporal convolutional network, based on [0]. They argue that RNNs being standard is likely an artefact of the history of the field, while they get superior performance from TCNs.

[0] https://github.com/locuslab/TCN


The truth is that the best architecture depends on the problem, and you get a lot out of "graduate student gradient descent", i.e., hyperparameter search and fiddling with the architecture fine grained details. You'll want to experiment with rnn, tcn, and transformer (t2t) models to find the best one for your problem.

Disclaimer, I wrote the TF RNN api.


Totally agree - it's hard to say which sequence architecture works best for your problem (are there long term dependencies, do they exist at multiple levels of scale, etc) and dataset size.

Convolutions are cheap to compute and can be more efficient on smaller data but it's also possible that a CNN outperforms when you have a small dataset and an RNN when you have a larger dataset.


For graduate students out there that would rather be doing research than "graduate student gradient descent" (or, high dimensional, non-convex optimization in your head), SigOpt (YC W15) is a SaaS Optimization platform that is completely free for academic research [0]. Hundreds of researchers around the world use it for their projects.

Disclaimer: I co-founded SigOpt and wasted way too much of my PhD on "graduate student gradient descent"

[0]: https://sigopt.com/solution/for-academia


I do a lot of sequence work (text).

While CNNs are tempting and fast to train I've never been able to get the accuracy I can from RNNs. In NLP, accuracy is important because for lots of tasks NLP is right at that inflection point of being good enough to be useful.. if it's good enough.

It's worth noting that this TCN paper gets a perplexity of 45.19 on WikiText-103. That was competitive in 2015.

The current state of the art is 29.2[1] - not their claim of 48.4 (unclear what that came from).

Still, CNNs are nice in an ensemble model if that's your thing. They do tend to pick up different things to RNNs, which can be useful.

Edit: I now understand why their reported metrics are so wrong. They use generic models to compare against. They list SOTA performance in their supplementary material (they still get them wrong though).

[1] http://nlpprogress.com/english/language_modeling.html


Thanks for digging into how their claims are misleading. As someone in an adjacent subfield, I was wondering if I was missing something. It does at least sound like an interesting potential component.


From my cursory understanding of TCN, it always outputs sequence of the same length as the input. If that is true, then its usage is severely limited compared to recurrent networks.

Please point to more substantive sources if I am wrong. I am very interested to make TCN work, as it is much faster.


This is the dangling carrot. Be wary of the stick in the rear.


> As much as I sometimes criticize FB and Google over privacy issues, they also do a lot of good by realeasing open source systems.

You realize that they do this so top researchers still want to work for them?


And is that is a bad thing? The work still gets released and can be used by anyone.


In a world with less perverse incentives these people whose education was largely funded by tax payers would be working on actual valuable projects rather than optimizing how to get people to click more ads for a private corporation.


I agree with the sentiment that ads are such a savage application, but...

* I'm confused by why you bring up tax-payer funded education... do you think these employees don't pay taxes?

* What is your definition of "valuable"?


Ads are core part of how today’s world economy works, like it or not.


What do you think would go awry, if we suddenly abandoned machine learning for advertising?


A lot of business would go bankrupt. And I’m not talking about adtech, but advertisers. There’s tons of them, from small mom and pop style to big companies that depend on ads to get customers. And no, they aren’t scams.

HN crowd doesn’t accept that, but most people here have no idea about how business works.


Has the cost of customer acquisition objectively dropped over the last twenty years? If so, where can I read about that? If not, why do you believe that?


Nobody says it's a bad thing (licensing traps aside, if any, which I didn't check). It just should be noted that when big corps/celebrities get struck by bad publicity they must react with some good move as a lever to counteract the negative press. I'm not saying they wouldn't have done this, but it's a very common move among politicians, corporations, celebrities etc. Their PR office is just doing their work.


Nope. And I’m not joking here, this is:

1. Give free tools to people 2. Drive adoptions of tools 3. ??? (Data-driven pivot) 4. Profit


Does the reasoning behind it matter?

The public benefits from it, and they get to hire researchers that want to continue that work. It sounds like a win-win?


It's a win-win, but with nasty side-effects.

It's like our economy and its effect on the climate. Win-win for consumers and companies, but (without intervention) a downward spiral for our planet.

Also, it puts us in a morally difficult situation because we are benefiting from the ones we criticize, and as such, it is hypocritical.

Of course everyone can do as they please, but in my view it is best to look for moral-issue-free software instead of using BigCorp's candy-ware.


> Also, it puts us in a morally difficult situation because we are benefiting from the ones we criticize, and as such, it is hypocritical.

What moral difficulties do you see? My opinion is these companies are despicable but taking advantage of their generosity is not hypocritical. Applications and motives can be immoral, tools without human action simply exist.


You are ignoring that their generosity exists for a reason.

Tools exist because of human action.


Also it’s now old tech for them - pushing it to the public domain allows the public to maintain it and improve it for free; not to mention having more people use their system is like free training before they hire the person.

However, pushing it out surely does enhance the public knowledge - in the same way that Carmack released his 3d engines that were out dated in industry by one gen but still helped the public.

Lastly - they need to push it out as they need to attract top talent, and need to demonstrate they have top tech there (and are willing to let their researchers claim credit for it once it becomes old enough).


People are quick to criticize Facebook here on HN, but this release is awesome. I believe open source speech recognition is still lacking, and any contribution is very welcome. CMU Sphinx and Kaldi are great, but it feels like the most recent advances in the field are still hidden behind paid services.


People have evey right to criticize Facebook and open sourcing some software won’t make the bad stuff go away, just like criminal charges are not deopped just because you donated some to a charity.


You would be amazed about how little comments does an article about FB doing "this tech thing" attracts, vs a generic FB is bad. There are terribly few people that can comment on a tech subject, but everyone and their dog has an opinion about how FB destroys humanity


So what you're saying is that there are more people who can associate with social ethics vs. specialised areas of programming? And this is a revelation?

Perhaps it might help you if you look at it from the perspective that what Facebook has open-sourced here isn't affecting a billion people's privacy and it's not being willfully used as a tool of intimidation and propaganda by governments.

That's just a couple thoughts on why other posts might attract more comments.


People confuse Facebook the org with Facebook the workers. Don't confuse the great work it's workers do with the what management decides.


That logic doesn’t generalize too well.


What worries me —very much in-keeping with existing criticisms— is what Facebook wants to do with this tech. Voice is not a core business for them. Yet.

They are demonstrably very good at using technology and their omnipresence against their users (to monetize them without their knowledge).


So how good is that speech recognition system in, let's say, to listen in on a phone conversation and use that information in, let's say, the Facebook news feed. You know the precise thing many, me including, suspect Facebook is actively doing. And if you say they don't do any things like that, why do they need a speech recognition system in the first given that Facebook is a text based system.


They need speech recognition for their Alexa/Google Home like hardware product: https://portal.facebook.com

(Notice how much they emphasize privacy and security in their marketing...)


Oh my... Some people are actually going to let Facebook listen in on everything. Wow. Not in a million years in my house.


For the record, I do think it's telling that they currently use Amazons's Alexa Voice Service for this (or at least most of it after the hotword recognition) instead of building on Project M.


Alexa is optional and Facebook has their own voice recognition for Portal features, You can use the wake word “Hey Portal” for a limited set of portal commands mostly around calling and messaging, or “Alexa” for all the Alexa capabilities. When using Alexa functionality Facebook isn’t suppose to “listen” at all. (Though not sure if these queries still flow through their servers or not?)

https://portal.facebook.com/help/2149102838698668/


Seems that they have an automatic captioning feature for videos/audio on the site


Ok. I guess that is a valid reason.


Where are the pre-trained models? It's worthless without them. Edit: NM, someone hunted down the AWS links

    wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout.bin

    wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout-cpu.bin

    wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-lowdropout.bin


Link to the GitHub repo, for the lazy: https://github.com/facebookresearch/wav2letter/


As someone behind a corporate firewall on a Friday, thank you


Does anyone know ho this compares with Mozilla's DeepSpeech? https://github.com/mozilla/DeepSpeech


From https://arxiv.org/abs/1812.06864 "On Librispeech, we report state-of-the-art performance among end-to-end models, including Deep Speech 2 trained with 12 times more acoustic data and significantly more linguistic data."

Specifically they claim word error rates that are 1 to 2 percentage points lower, 3.44% on "clean" and 11.24% on "other".


It'll be interesting to see if anyone can reproduce their results, thus far its been troublesome: https://github.com/facebookresearch/wav2letter/issues/88


Probably positively. I brought up DeepSpeech in a docker container the other day, it couldn't understand me at all, using Mozilla's pre-trained model. _Some_ words were right, but the output did not make sense.


Did you use one of the examples (eg the Python GUI or CLI client)? WebRTCVad (which is used in most of those) made mincemeat of my audio files, versus feeding them into DeepSpeech got me usable results. Just need VAD for files nearing or over 1 minute in length (unless you have 4+GB of RAM to spare).


I wonder how all these open-source releases happening this month [1][2][3] are related to improve team morale internally...

[1] https://github.com/facebookresearch/DeepFocus

[2] https://github.com/facebookresearch/nevergrad

[3] https://github.com/facebookresearch/pytext


Probably just a end of year push to get things out


Or they're in a production code freeze so engineers and spend more time pushing these final projects to finally get them released.


It's Perf Review time, gotta get that Impact without affecting production


Did they release a model to go with this (so that the average dev can actually use this in their app) or is this just a tool for researchers?


https://github.com/facebookresearch/wav2letter/issues/88 mentions a "pre-trained model named librispeech-glu-highdropout.bin", but I couldn't find it anywhere.

https://github.com/facebookresearch/wav2letter/issues/93 also mentions a pre-trained model but without any reference which one or where to find it.

Googling "librispeech-glu-highdropout.bin" still shows the text "luajit ~/wav2letter/test.lua ~/librispeech-glu-highdropout.bin -progress -show -test dev-clean -save -datadir ~/librispeech-proc/ -dictdir ~/librispeech-proc/ -gfsai ..." for https://github.com/facebookresearch/wav2letter/blob/master/R..., but clicking it, it's gone.

But the Google Cache still has the result, including 3 pre-trained models:

    wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout.bin
    wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout-cpu.bin
    wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-lowdropout.bin
The cache also includes a much more detailed README on how to use the software.


Thank you! I found that some of the forks also have the more detailed README files for example: https://github.com/19ai/wav2letter


Thanks.

It would be great if anybody could build it all and try if the out-of-the-box experience with the pretrained model is good.

I've tried Mozilla's DeepSpeech a few times but so far it didn't recognise "this is a test" reliably without mistake out of the box from a good microphone.


I'm not sure if those models would work with the version in the current master (wav2letter++). The old version of the master branch (wav2letter, written in lua/torch: https://github.com/facebookresearch/wav2letter/tree/wav2lett...) contains links to the models you listed, so I guess they belong to that version.


I was wondering the same. Also, it's weird the issues on the repository are all older than the first commit.


They probably squashed the history before release because they only got legal's approval on the current state of the repo and not all prior history (which may or may not have had hacks referencing internal systems at some time).


Always excited to see more speech recognition releases. Way too many solutions are "just point your microphone's feed at our cloud service", and a lot of those that aren't have somewhat lagged behind.


Interesting, why does FB need a speech recognition system?


Because of their many products that support voice features? Messenger, whatsapp, etc...


Why would you need voice recognition for them? What is the use-case?


Speech-to-text i.e. dictation? One of the most common use-cases.


That's handled by the system on both Android and iOS. I remain unconvinced.


The built in offline Android speech recognizer is really bad. Giants like Google and Facebook are blessed with data, and so they can train state of the art speech recognition models (much much better than what you get out of the built in Android recognizer) and then provide speech recognition as a service. They can control the recognition because it happens on their servers and is independent of Android or any other OS.

And so FB for instance can send some voice data to their servers and get a text output. And then FB can use text sentiment analysis to get further context about the message.

Sadly, most people don't have the speech data to train their own recognizers on large vocabulary systems, and that's even harder for languages that are not English. With exception of Google/Amazon/FB/Microsoft/Baidu/etc other people have to use the API's offered by the above companies to do high fidelity recognition. Which sucks because there is a cost to each recognition. You have to pay someone else to do it.

Whereas FB/Amazon/MS/Baidu/etc can do high fidelity recognition offline on large vocabulary and offer it as a service. THIS is why FB wants to make speech recognition systems.


Labeled data is indeed a problem. The only sizable corpus I know of is TIMIT and it costs $300 and I think has prohibitions on commercial use. That said, phonetic labeling is becoming less important thanks to designs like this...

I wonder if you could bootstrap a sizable speech dataset by trawling audio off YouTube and then using one of the really good cloud speech recognition services to label it. :)


IMHO, the TIMIT corpus should no longer be used in most application-driven speech recogniton research, as it’s small and completely unrealistic for any real world application. Furthermore, nobody cares about phone error rates, as recognizing phones is not the ultimate goal.

There have been much better, larger datasets available for a long time, for example the Fisher English conversational telephone speech corpus was released in 2004 and contains ~1950h of transcribed speech. There are tons of other datasets in various languages and for various applications (conversational speech, broadcast transcription, etc.).


Isn't there some value in being able to bench accoustic models in isolation, no matter how weak they may be, without downstream language models?


The labeled data is $300? That's basically free, even for somebody who's just a serious hobbyist, much less any funded public or private research group.

Edit: It's even less [1]:

    $0.00 1993 Member
    $250.00 Non-Member
    $125.00 Reduced-License
[1]: https://catalog.ldc.upenn.edu/LDC93S1


The built in offline Android speech recognizer is really bad. Giants like Google and Facebook are blessed with data, and so they can train state of the art speech recognition models (much much better than what you get out of the built in Android recognizer) and then provide speech recognition as a service. They can control the recognition because it happens on their servers and is independent of Android or any other OS.

Is the implication that offline Android recognition does not train on the owner's voice at all? I imagine a lot of phones these days are at least as powerful as the Pentium 200s used to train (successfully!) Dragon Dictate et al 20+ years ago.


Well let's forget the offline android recognizer. That's the one that's built in and doesn't go to Google via the internet to get a better accuracy transcription. It's fairly good for what it is but doesn't come close to the accuracy when you go to the google recognition servers via their API's. That's because the models they offer via the recognition services are much larger, robust and better than what you get straight out of Android. These services offered by companies such as Google do not adapt the acoustic model to individual speakers and are therefore known to be speaker independent.

Secondly, when I say "train", it is in a totally different context than how you seem to be using the term. You are using it in the context of adapting an acoustic model to a individual speaker to improve the performance. I am talking about building the initial model. Typical RNN or even convolution based algorithms require a lot of time and processing power to train. What's even harder to get than the processing power though is of course, data to train off of.


I think you're making a distinction without a difference, since there is (or has been for a long time) an initial model supplied by personal recognition devices/software/etc., too. And sure, if it's not trainable by the user it's going to be using a generalized model. There are tradeoffs there, and the point of my comment was that for a personal speech recognizer it makes sense that it be trainable by the user, especially when the hardware is powerful enough.


This is false. The distinction I was making is very real. The "initial models" you are talking about are small and weak, nowhere near as robust or powerful as the models trained and used by Google/Microsoft/etc on their own servers. State of the art neural networks based recognizers need serious hardware for training, orders of magnitudes more than what is available in smart phones/personal commodity hardware (unless it's massively clustered and distributed).

Secondly, the trained model itself is very big just for storage, and inference against the model is also resource intensive. This is why Android/Google maps/search/etc go out to the google's backend recognition servers for speech to text before falling back on the shitty (but relatively good) offline inline model (that may not even be using state of the art speech recognition techniques and may be using old school GMM based recognizers).

Finally, the large models trained on the backend servers using their distributed computing infrastructure are extremely more accurate than the shitty fallback model, so speaker dependent adaptations aren't necessary. If you can get very very good performance from a speaker independent model, why would you put the extra effort to make speaker dependent adaptations if the gain is very marginal? Not to mention the fact that speaker independent models are more useful in more situations and are extremely powerful. Google for instance can caption videos automatically using speech recognition, which is amazing. If the models were speaker dependent they wouldn't be able to do that. That's why the focus has been so much towards speaker independent models.


> The built in offline Android speech recognizer is really bad.

I totally disagree. Compared to Sphinx it is still lightyears better.

To wit I use it for my android based home automation voice recognition and even from a distance with background noise it still works about >90% accuracy. My original tests with Sphinx in a similar environment garnered about 30%.


I completely agree with you that it's much better than Sphinx/Pocketsphinx. It's even much better than Microsoft's built in speech recognizer that's been around since XP. But it is still very bad compared to the recognition available via Google's voice API and that was the point. Also, I was trying to explain that given the types of models used today for recognition, inherently models trained and hosted somewhere are going to be bigger and more accurate than ones deployed on the field.


Facebook delivers software on many other platforms than just iOS and Android.

There’s also the new Portal hardware.


yeah well facebook wants to know what you said and not what android told it you said.


Advertisement based on keywords.


this was never confirmed afaik, but it looks like they are doing just that


data input that doesn't involve keyboards..


Why not?

Facebook as research arm dedicated to play Go/Starcraft also, what do you think their reason is for doing that?

They can use a speech recognition system to transcribe videos, just like what Youtube is doing to improve ad targeting and recommendation. Why is this hard to understand?


No need to be aggressive, as I am not being cynical. I am genuinely asking, because I couldn't see any immediate need for it in their main product.


But also remember that accessibility for visually impaired folk is often. A regulatory imperative in those product lines.


Facebook has its own Alexa/Echo like offering. Not sure if Messenger also has voice functionality.


It does, you can do video and voice calling from the Messenger app


that would be a bit creepy and opportunistic if fb did that. Yt videos are by and large uploaded for profit. Fb videos are uploaded for storing birthdays and family vacations.


you're forgetting about Facebook Watch, which allows pages to monetize their videos through an ads program just like youtube


Could be used for indexing videos for search


To automatically create subtitles for translation is another use case I have seen


Yt videos are by and large uploaded for profit

Is that true? I don't have the stats, but I guess a very small percentage of yt videos were uploaded 'for profit'. Like 1% or less? Maybe much less.


Interpret the spoken text in posted videos? E.g. for automatic subtitling?


speech recognition system is becoming fundamental for all kind of software these days. speech to text input, speech navigation, etc


I think Portal has voice control, but is currently using Alexa for most functionality.


Sending voice messages is a pretty common form of user-to-user messaging in Asia - i.e. record a brief audio snippet instead of tapping out a sentence on a phone keyboard



Potential Oculus control option?


isn't Portal supposed to have it's own voice assistant, Aloha or whatnot?


So their app can eavesdrop on your conversations to target more ads. Facebook does nothing without an ulterior motive.


Totally not because they listen to anyone to profile users or target ads.


Perhaps to listen in on your voice conversations and serve you ads so relevant its creepy?

From a software standpoint, this has never been proven. However, its super weird when it happens to you.

For example, I traveled to meet a coworker who was playing a mobile game I had never seen before and we talked about it. I never Googled it or anything like that.

Hours later I checked Instagram and the first ad was for the same mobile game. Coincidence?

Perhaps the game was simply advertised more in his city than my own?

Perhaps our phones being near each other prompted a "friend request suggestion" and then a took that to another level with installed apps?

Or just a coincidence and I am thinking too much about it. lol.


Their mobile app most likely records everything being said and transmits it encrypted to their servers, so that they can improve ad targeting.


That’s a nonsense conspiracy theory. Thousands of people have internal access to the full source code for the mobile app.

“Ah, but there’s an undetectable binary blob that gets linked in and called without being detected by anyone working on the code,” you say. In that case consider the battery life impact. The power consumption of the Facebook app compares favorably to its social media peers. Is everybody else also recording, encoding and encrypting all the time?


Thousands of people have internal access to the full source code for the mobile app.

So what? Thousands of the same people have access to the full source code of everything underhanded Facebook do and it hasn’t stopped anything.


I am not saying that the grand-parent is on to something, but I would be careful when calling anything done by them as "nonsense conspiracy theory". They have proved these "conspiracy theories" to be true quite a few times.


> They have proved these "conspiracy theories" to be true quite a few times.

Such as?

Facebook has explicitly denied spying on people's conversations. I can't think of any situations where they have flat out lied about something like that, so I'm curious to hear what conspiracy theories they have proven true.


> Such as?

Shadow profiles for example.


They lied to congress about selling access to users' private messages.


You mean this week’s news that Spotify and Netflix had experimental messaging client integrations years ago?

It’s a complete mischaracterisation that Facebook shared private messages with Spotify. By the same logic, you could say Google is sharing your emails with Apple when you access Gmail using the iOS Mail client.

Spotify was offering an integrated client to FB’s chat service. This UI integration was a market failure and was discontinued years ago. Of all the things wrong with Facebook, this wasn’t worth the noise.


I esp. deleted the Facebook App over its egregious power usage. I don't care that much about being spied on, or getting better ads when I mentioned a certain keyword, but draining the battery was unexcusable. Mobile uptime on my phone with app 3 hrs, without back to 9 hrs.


Twitter, Snapchat, Tumblr etc. are also battery hogs. The Facebook app is no worse and generally better (IME).

Scrolling through a social media feed is a heavy activity on a phone (which may be surprising because it seems passive). Constantly fetching more data from servers, decoding incoming images and videos in background threads, shuffling data to GPU-accessible buffers for fast scrolling, etc. — There’s a lot going on all the time when you’re scrolling mindlessly.


Any idea why they developed Flashlight for this, instead of using PyTorch?


> developing in modern C++ is not much slower than in a scripting language.

Is this accurate? I haven't written C++ since freshmen year of college and it was very cumbersome then.


It's improved a lot in terms of expressiveness, but it's still plagued by memory errors and obscure error messages.

Who said that, anyway? I don't see it in the text linked by the OP link.


Sorry I should have mentioned, it's part of the research paper announcing the release. It's on the right column of the first page.

https://arxiv.org/pdf/1812.07625.pdf


Thanks.


Not in my opinion. It's still an unmanaged language at the end of the day and has to be compiled. Secondly, it's still a very technical language.


I like the way they politely skipped Kaldi WER on test-clean (4.31) in results table. Their best WER (4.91) will not look so attractive.


Even more so if they compared it to Kaldi TDNN-LSTM with RNNLM lattice rescoring (test-clean 3.22%, apparently): https://github.com/kaldi-asr/kaldi/blob/master/egs/librispee...


And kaldi gets <8% on test-other other while this gets over 11%!


This is excellent. Modern free speech recognition software is hard to come by. Everything except Kaldi has laughable error rates, and Kaldi is a huge pain to set up.

Will be interesting to see what people can do with this and the available data sets.


> available data sets

Reminder that Mozilla's Common Voice project accepts voice donations! https://voice.mozilla.org/


Interesting to see Kabyle as the language with the third most validated hours at https://voice.mozilla.org/en/languages

Kabylia is a region in the north of Algeria mostly inhabited by Berber people who are bilingual in Algerian Arabic and Kabyle. In recent years, an independence movement has developed that emphasizes Kabyle over Arabic for reasons of internal cohesion. To confuse matters, there's also a pan-Berber movement denying the existence of a separate Kabyle language, classifying it as a dialect of Berber/Tamazight instead.

Those heated politics have led to a large number of Kabyles contributing to various linguistic corpus projects to gain visibility for their cause. E.g. trying to overtake Berber on https://tatoeba.org/stats/sentences_by_language (As far as I know, Mozilla's Common Voice shares data with the Tatoeba project.)


Given the value of open source training data (or scarcity of it), has anybody attempted to use LibriVox for training?

https://en.wikipedia.org/wiki/LibriVox

https://librivox.org/

The recordings are public domain audio books of public domain books, so the licensing should be fine. The audio isn't annotated, but given the value involved I think it would be worth attempting to use forced alignment to annotate the recordings with their public domain source texts. Forced alignment using the sort of speech recognizer you're trying to train in the first place may be a bit "chicken and the egg", but from some experiments I've run myself existing open source speech recognizers can do it reasonably well. Humans could manually tune up the alignment to improve the quality if necessary.

As for motivating people to actually do that mundane work... well these are audio books so maybe the work isn't so mundane after all! The LibriVox recording of Tom Sawyer (read by John Greenman: https://librivox.org/tom-sawyer-by-mark-twain/) is pretty great and has been listened to by millions of people. If somebody created a "read along" web app that showed you the text of the book from Project Gutenberg getting highlighted as the audiobook from LibriVox was played, users who have an interest in reading/hearing the book could have their attention held by Mark Twain and with the right UI provide fine tuning for the forced alignment at the same time.


LibriVox recordings are what the LibriSpeech corpus [1, 2] is based on. That's what they use in the paper.

[1] Panayotov, V., Chen, G., Povey, D. and Khudanpur, S. (2015). LibriSpeech: an ASR corpus based on public domain audio books. Proc. ICASSP. http://www.danielpovey.com/files/2015_icassp_librispeech.pdf

[2] http://openslr.org/12/


Librivox is commonly used as a training corpus however it's main weakness is that reading speech differs quite a lot from conversational speech.


I see. I hadn't considered that might be a weakness because to be honest the creation of forced alignments between LibriVox and the source texts was my objective (it's a feature that exists on some Kindle's when pairing ebooks with audible audio books. I believe the feature makes literature more accessible, a noble enough cause. Although I can understand why more effort is being spent on recognizing conversational speech.)



> a huge pain to set up.

This project doesn't look like it's particularly easy to build: https://github.com/facebookresearch/wav2letter/blob/master/d...


The problem with Kaldi is that it's not a turnkey solution for a speech recognition system, but a collection of libraries and shell scripts that can be used to build your own system, assuming you're a researcher in speech recognition or are willing to put in the time to become one. A long list of dependencies appears less daunting in comparison.


I like to take recordings of my thoughts on my cellphone similar to Dale Cooper. Unfortunately, I do not have a Diane on the other end to translate my thoughts to text, I have to do that myself.

I've been looking into things like mozilla/deepspeech and other open source libraries for automatically converting my messages to text. I'll have to take a look at this project as well!


Hey me too! A while ago I was looking st trying to figure out how to hack something like this together myself when I came across what is now one of my top thre apps: Otter.

Sounds like a shill and I don’t really care. I’m a premium member with 6,000 minutes of transcript time per month (and sometimes I’ve used almost all of it) and I couldn’t be happier.

You can export everything, support and head of product are kind and responsive, and you can click in the transcription anywhere and it will play the audio at that point.

Exactly what I need.

My main complain is that it’s geared towards corporate environments for conferences, meetings, etc, and so the grouping isn’t exactly what I like, but I was my text editor to keep the links more to my liking.

Being able to search by word hundreds of hours of my thoughts is a fantastically empowering experience and I hope you find the same.

Let me know what you think! Shoot me an email if you want to chat about it ever. If you can't tell I'm a pretty big fan.


How good is this library? Is this good and fast enough to see wider adoption in embedded devices or phones? Would be awesome to be able to voice-enable apps without the need for a cloud provider. How would this compare with a C++ pytorch-based approach?


It has dependencies on CUDA and/or Intel MKL, so not really suitable as-is for embedded/phones.


From what I understand, this is less of a machine learning advancement and more of an engineering advancement? Trying to see if any of the bleeding edge stuff has been implemented. Still waiting for SeLUs to be standard!


Is it better than openEars? I want something that will work onboard without sending to a server.


It would be interesting to see this benchmarked against mozilla's deepspeech.


Is there a demo of a working example app using this library?


Management: "Let's keep open sourcing all of the things so that the informed community will overlook our transgressions!"


Is it ethical to use their code ?


I think you should consider it ethical to do ethical things using a tool created by unethical people. Consider for instance Fritz Haber, the so called "father of chemical warfare" who contributed to the development of the Haber–Bosch process for artificial nitrogen fixation, that facilitated the mass production of explosives by Germany during the Great War, but also provides as much as 50% of the nitrogen in the body of an average human today due to it's role in the production of fertilizer.

Whether it's ethical to contribute back to the project knowing that the unethical creator might derive unethical utility from your contributions, is perhaps slightly more complicated. However the same could be said of any open source project, you could create something new wholly from scratch and if you release it publicly, somebody else could use it for something unethical.

I commend your consideration of ethical concerns, which I think is lacking in the tech industry today. But in this particular case I don't believe there is too much cause for concern.


I think your points generally stand, but I think Facebook open source raises some specific issues at least worth consideration:

If your use of it contributes to its popularity, perhaps becoming the standard of X area, does that give Facebook the company more power and possibly enable other unethical actions?

I think it's probably not as much of a worry given the narrowness of the area, but I do think this is something to consider when it comes to React for example.



No


why not ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: