Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: LibreASR – An On-Premises, Streaming Speech Recognition System (github.com/iceychris)
233 points by iceychris on Nov 15, 2020 | hide | past | favorite | 71 comments



Hey HN!

I've been working on this for a while now. While there are other on-premise solutions using older models such as DeepSpeech [0], I haven't found a deployable project supporting multiple languages using the recent RNN-T Architecture [1].

Please note that this does not achieve SotA performance. Also, I've only trained it on one GPU so there might be room for improvement.

Edit: Don't expect good performance :D this is still in early stage development. I am looking for contributers :)

[0] https://github.com/mozilla/DeepSpeech

[1] https://arxiv.org/abs/1811.06621


You can also check out https://github.com/TensorSpeech/TensorFlowASR for inspiration (not my project, not involved). It implements streaming transformers and conformer RNN-T (but in TF2). Deployment on device as TFLite. So far, there aren't many usable pretrained models available (just LibriSpeech), but with some work it could turn out quite nicely.


Hi! What should you need to implement other language i.e. Italian or French? I mean: it's a problem due to the less of datas or what?

Another question: could you use for example mozilla voice data to train/test?


Data and compute are the largest hurdles. I only have one GPU and training one model takes 3+ days, so I am limited by that. Also, scraping from YouTube takes time and a lot of storage (multiple TBs).

Mozilla Common Voice data is already used for training.


Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.

I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.


Depending on your objective, noisy data might be useful. I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that. But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.


There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning


For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/ I don't know update prices but I remember it was quite affordable.


> I remember it was quite affordable.

Relative to what? Paperspace is one of the costlier GPU providers.


Okay, you are right, but it's also really performant, so imho you can do a lot of work in minor time.

For something cheapest I read that post on reddit :

https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_clo...


performant? It's the same GPU..?


Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?


> Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

But you need supervised data too.


I know you can scrape only audio from YouTube with YouTubeDL but it’s somewhat annoying


I use something akin to

    'alias downloadmusic='youtube-dl --extract-audio --audio-quality 0 --extract-metadata'
in my .bashrc

I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.

EDIT: Typed this from memory, there may be errors in the alias.


    youtube-dl -f bestaudio $URL
Dunno when that went in but it works now.


So do you scrap videos from youtube with subtitles to collect data?


Vosk supports both Italian and French. French model is trained by Linto project, pretty good one.


Awesome project - I'm also working on a similar idea for an on-premise ASR server! Any reason you decided to go with RNN-T?


[flagged]


That's called setting up expectations. If you know your project might interest people but needs work, why pretends it's good when it's not? They seem to be courting contributors more than users anyway.

I found the video to be funny. It nicely highlights both the current limitations and the ambition of the projects. Bold choice certainly but I think it works.


and it's maybe a dig at Macron's accent at the same time :D although the author is a student in germany. Anyway you should join the Discord, we discussed this there too...

https://discord.gg/pqTMeP5D3g


Nice project and bold to demo with a French native speaking English.

On a side project, I'm looking at the best interface to facilitate further edition (correction) of the recognized text. Target is local councils and regional parliament, where sessions are usually recorded but without transcripts. If xx% accuracy is enough to identify keywords, manual edition is still required to not distort precise meaning.

Nothing special in the interface, but two features seems interesting: 1. Be able to collaborate in real-time. Maybe using Etherpad API to merge multiple editions. 2. Easily validate text and label speakers so as to generate new training data.

Pointers to similar existing solutions would be very appreciated.


I’m really interested in this project too. Been thinking about similar solutions for a while now.

I looked into Kaldi and Mozilla Deep Speech but the former seems geared at ASR experts and the latter didn’t seem suited for my particular application (longer recorded audio or real time stream)


Mozilla DeepSpeech has streaming audio support as of a few releases ago, the word error rate has also improved.

I would recommend looking at Vosk too, it converts speech to text much faster than Mozilla DeepSpeech while having slightly better results: https://alphacephei.com/vosk/


Use wav2letter


In case someone wondered what does the title have to do with logic, it’s just probably a (common) malapropism.

premise (noun) a previous statement or proposition from which another is inferred or follows as a conclusion.

premises (noun) a house or building, together with its land and outbuildings, occupied by a business or considered in an official context.


Right, fixed it, thank you :D


It's not "mal" anything, many simply prefer to use "on-premise" or on-prem for "on-premises".

You didn't have any issue understanding the original title.


> You didn't have any issue understanding the original title.

Humans are very good at live error correction, but that doesn't make it not wrong.


Is there an open-source or paid SDK/API that I can use to create a group voice chat mobile app with "live" transcription? Or something that can plug-in to a system like this?

I looked at Twilio but they seem to only offer a means to do it on their VOIP/SIP product.


> open-source or paid SDK/API that I can use to create a group voice chat mobile app with "live" transcription? Or something that can plug-in to a system like this?

Yes, Google, Amazon, Microsoft all offer streaming solutions (wouldn't recommend Amazon's however, might recommend Microsoft over Google). wav2letter from FB is the only open-source framework worth looking at, deepspeech is not a seriously usable framework.


Check out Kaldi. It's a toolkit rather than a ready-to-deploy service but has some solid pretrained models and recipes for training your own. You can use various existing projects for deployment, e.g. vosk-server (also for on-device) which comes with models for various languages and accents and has an excellent support channel via telegram. Quite frankly, despite not being "end-to-end", you'll get much much better results in practice.


I collected custom audio and had it transcribed by hand for cash, then evaluated it on wav2letter and vosk. At least for that domain, wav2letter outperforms vosk.


Good for you, it's the only way to know which tool works best in your case. I did the same for my use case and arrived at the opposite conclusion.

What most people don't realize is that it heavily depends on your use case and domain whether any given model/algorithm will work better.


Curious why would you not recommend Amazon... is it cost or something else.


For my use case, quality subpar compared to the other cloud providers


Telnyx has media forking, the ability to clone a media stream in real time without affecting the original call. It allows receiving the stream directly and operating on it without latency.

Not sure if relevant though, it's using their SIP product also. If the original service isn't using Telnyx, you could get creative and have a Telnyx shadow user join the group call to receive the stream, etc.


how real-time do you need it? if you use a streaming API you can even use google and there isn't too much lag, and it's continuous.

Agora also talk about this, but I haven't used it myself https://www.agora.io/en/


Google Meet does this


I LOVE that you provided a sample application targeting the ESP32-LyraT! While the ESP8266/ESP32 get plenty of love on HN (and elsewhere) I think the ESP ADF (audio development framework) and various boards dev boards (Lyra, Korvo, etc) are really under appreciated and essentially unknown.

I enjoy a Raspberry Pi, Jetson nano, Arduino, whatever as much as the next person but the seemingly endless stream of projects and resulting blog posts, etc featuring them can get a little old.

Great work!


Audio boards based on ESP32 boards are quite under the radar and have lovely features for just a few bucks. Running LibreASR on a RPi should also be feasible soon.

Thank you for your kind words! :)


Cool project, seems like your model have similar WER as mine (4th reference in readme). Do you plan to do any pre-training on the encoder part in the future? Maybe something like this[1]

[1] https://ai.facebook.com/blog/wav2vec-20-learning-the-structu...


Hey blackcat! Your project [0] helped me a lot! Pre-training the encoder sounds great, I'll maybe add it in the future.

[0] https://github.com/theblackcat102/Online-Speech-Recognition


Hey black cat, I have some work in preprint for a NeuroIPS workshop, demonstrating negative results of different audio distances on pitch tasks. There is one particular w2v result I'd like your feedback on.

Do you mind emailing me? Lastname at gmail dot com (see my profile for my name)


Having worked on ViaVoice OSX back in the day, we had to have models for different varieties of English. The US model couldn’t understand my northern English (think GoT) accent. It’s why the product came out with a UK localisation.

Wondering if you might have a better reco of the French President if you have a model per dialect of English?


Yes, probably. The data I trained on mostly reflects UK and US accents.


IBM kept their US and UK models apart. May have been historic or dataset size.

As a FYI, I was told “the money” was in specific “dictionaries” for medical professionals and so forth. Apparently, doctors liked to dictate straight into text. Might be worth trying that $$$€€€£££?


Can you post WER per dataset? Bucketing all of the WER together means you can only directly compare to models that are validated on the exact same combination datasets. This excludes all other ASR systems from comparison, as well as your own models if you decide to add validation data in the future.


The secret of RNN-T is that it is __extremely__ hard to train, it is very unstable. On a single GPU you'll spend years to train reasonable quality model, moreover a streaming one. Streaming training requires teacher-student setup.

Thats why there are dozen RNNT projects around, some of them more reasonable, some less, but most of them struggle to demonstrate even a good librispeech WER.

You have a long way to go.


This is somewhat off-topic, but as someone who has worked on various speech processing and ASR projects, I'm curious to learn from people who have specific problems and applications for this technology where it can make a difference.

That is to say, what are areas where you think ASR can enable new products or make common and tedious tasks much more efficient?


My partner and I are raising our child in unique ways with a focus on avoiding certain linguistic patterns. I'm interested in filming/recording our home, automating transcription, and creating a system for voice-driven video editing we can do on the fly to create highlights and capture discussions. I see this as a way to create a dataset they can use in the future to diagnose any trauma caused by our choices.


You're effectively experimenting on your child? Don't you think the repurcussions might be severe?


Every parent is, though many will deny it. The repercussions of doing so without intention and without admitting it are already severe. Every parent is likely to traumatize their child in some way, including the trauma of protecting them from trauma to the point that they don't learn to heal through it. I'm ok with intentionally experimenting and normalizing healing within our family.

Also, it's already paying off tremendously. When repercussions can be severe, so can rewards. We have a 2-year old who is incredibly emotionally aware, has a huge vocabulary, enjoys eating anything, explores freely, is learning to play multiple instruments, draws with a pencil grip in both hands, can sit to actively listen to music for 20+ minutes at a time, and learns lyrics incredibly fast.

If you have specific fears, I'm interested in hearing them because that gives us an opportunity to prepare.


That's a good observation! I have no specific fears at all, since I was wondering what sort of "experimentation" it involved.

I wonder if the drawback will be that he/she will be incredibly bored once released to the "normal" world and the slower development pace of contemporaries and may feel out of place and frustrated. It's the curse of the gifted.


Here's a list of some of what we're doing. We haven't been very diligent about keeping track of it all, so listing it here is kind of an exercise for me to start working on that.

We're working hard to keep from using judgmental/subjective words like good/bad, like/dislike, etc. We're also starting to incorporate Nonviolent Communication patterns and concepts. We use they/them pronouns instead of gendering them. If they want to do something, we strive to help them do it as long as they wont be maimed or killed. We ask them for consent before changing their diaper, touching them, picking them up, taking things from them, and performing medical/dental procedures on them. We've named them Uni Verse All. They wear whatever clothes they choose, no matter what gender they may seem created for. I'm genderfluid and do the same. I also shower once every 1-2 weeks, stopped using shampoo about a year ago and am about to stop using soap on my body, too. We aren't teaching them about property currently and may not ever, choosing to instead describe things as "living with" someone. I'm developing a spirituality with a component I call "radical ignorance," which is essentially a sort of Zen "beginner's mind." It recognizes that ignorance isn't an excuse, but a spiritual reason for doing things, which runs counter to the US's legal reasoning of "ignorance of the law is no excuse for breaking the law." My partner and I are intentionally staying out of the workforce, instead choosing to serve people in our community alongside Uni, which allows both of us to be available so they can have their choice between us. Anytime one of us is choosing to not let them do something without there being a safety issue, the other ideally defaults to helping Uni do what they want. When they get hurt, we bring their attention to the pain and teach them to mindfully experience it while breathing through it.

As for the drawbacks, we're currently designing a community anti-adultist homeschool model that focuses on collaboratively learning our needs over what schools typically teach and allowing the students (of age 0-200+) to choose the contexts for learning. So they'll probably have an interesting intergenerational peer group to blow past the world with.


Viewing your child as an optimisation problem centred around your subjective view of good and bad could be all the trauma you need to inflict. Of course that's anything but a new pattern, parenting is training and training is optimising. However wielding extensive tracking & data to make your imprinting even more precise & targeted would certainly - for me - eliminate the joy of seeing someone grow into someone I admire, possibly in ways I never expected.


What you're describing is definitely not something we want to or are choosing to do.

The primary purpose of the recordings isn't for the sake of analysis, but for documentation. I think the only time we'd probably "go to the tapes" is for reliving what we call "sacred moments" (like last night when they began playing the pump organ in ways similar to me without me doing more than playing and explaining a little bit of the mechanics of the machine) or for when there's a dispute about things. My partner has an automated process of constructing exaggerated narratives and is still learning to notice when it's happening. The videos are more about capturing when we're carrying our own traumas into the relationship and visiting them upon Uni.

If this turns into something where I'm poring over videos, I'm relapsing in my information addiction hard.

Parenting is, ideally, training a new person to know/identify what their needs are and how to meet their needs, including safety, autonomy, exploration, acceptance, and interdependence. Our goal is to only stop them from doing something when they might die or be maimed. And then to get out of the way. Parenting as it's classically been done in many cultures around the world is incredibly adultist, with all kinds of assumptions about what children can and can't do. Even neuroscience and the medical field uses "childhood development" as a reason for ignoring childrens' pains, consent, and autonomy. We don't play that way. Uni is "behind" on their vaccinations due to them not yet saying yes to the ones we're at. We tried respecting their consent for the first few, but the nurses weren't onboard. Even when we had someone setup to receive a flu shot first, the nurse administered it when Uni wasn't looking, keeping them from actually seeing what was happening.

No, if anyone will be analyzing the videos, it won't be us. It'll be other people and algorithms. Any suggestions coming from the analysis will be converted into experiments to conduct with Uni's full informed consent.

Does that make things clearer about what we're trying to do here?


Is the transcription of Macron's speech completely off or am I not understanding what the two shown texts represent?


The upper transcript is YouTube's automatic transcription. Below is the web app transcribing live. And yes, it is actually missing a few words.


Seems off indeed. Plus, the readme says French is not supported yet. Not the best demo IMO.


I have not yet trained a french model. Also, the gif shows Macron speaking to the congress with his english accent [0]

[0] https://www.youtube.com/watch?v=RqUc1h7bZQ4


This is great! would you be able to integrate it with Live Transcribe to make a great FOSS solution for the deaf and hard of hearing? :)

https://github.com/google/live-transcribe-speech-engine


The README could use a section as to what CPU platform and storage requirements are necessary to run this app.


What datasets are used to train the models?


LibriSpeech, Tatoeba, Common Voice and scraped YouTube videos.


Do you get good results when adding scraped youtube audio? My model performance on LibriSpeech dev drops a bit when adding youtube audio to the training dataset ( my guess is likely due to poor alignment from auto generated captions ).


I haven't trained on LibriSpeech exclusively, but yes, the perf on LibriSpeech dev is quite bad, around ~60.0 WER. If the poor alignment of yt captions is the issue, maybe concatenating multiple samples helps a bit.


You should consider realignment; maybe start with something like DSAlign or my wav2train project.


would it be possible to train on any of the more recent Text to speech engines out there? some of them are very realistic.

this would give you absolutely perfect sync down to the word, I assume... I don't know about the cost if you paid ratecard though, perhaps you can do some partnership with them since yours is a symetrical product


(Anyone know how the transcription quality compares to the various cloud offerings from AWS, Google, and IBM?)


As I commented above, very poorly. It's still early days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: