Hacker News new | past | comments | ask | show | jobs | submit login
Whisper.cpp example running fully in the browser (ggerganov.com)
199 points by lawrencechen on Jan 23, 2023 | hide | past | favorite | 38 comments



I’ve been following along Whisper.com’s incredible progress.

It is a high quality piece of software which performs better on its intended hardware than any other implementation. It can easily be embedded anywhere. This is truly remarkable. A big shoutout to Georgi for this.

We need to remind ourselves that a part of him is choosing to give this away by open sourcing it. And he has gone through a lot of effort to make it easy to use and understand (just look at the documentation). Georgi to me, personifies every open-source author who put in their sweat and toil towards something that benefits our entire community.

Thank you Georgi. Salut, my friend!


I don't know what Whisper.com is but it doesn't seem related to this project.


Probably autocorrect (cpp->com)


Thanks - that’s right!


I recently laid off and currently trying to build some apps that could create some revenue that can afford my costs for some weeks. I built a transcription and dictation app for Mac [0] using whisper.cpp, small model works really well on 2019 mpb and m1 for streaming (dictation). It was really straight forward to use, however the streaming algorithm doesn't ready for production so I implement my own algorithm using VAD. I believe in that with that pace this could also be fixed.

[0] https://apple.co/3j2k8E7


I hope you get some success with your app. Speaking of, how is it doing? I keep hearing about mac AppStore not being very good but how is your experience? Do you get downloads and revenue?


Thanks for your good wishes. It's not going well actually, the downloads are very few after 10 days and no paying customers. The App Store seems like a desert, even there are not many competitors in transcription, still the app impressions and page views are very few. What I see that people trying to do marketing outside the app store.


The current crop of transcription apps are truly awful. Trying yours now!

One suggestion -- the trial is difficult to use if I have a long file, because it won't let me select a file that is longer than 5 min. Maybe let me pick a longer file but only transcribe the first 5 minutes? Most people aren't going to edit their files to accommodate your trial mode.


Hey, thank you so much for trying the app. I will improve the free trial feature, especially you are right for the 5 minutes limit. I go with the most basic solution first, but I will let to pick longer files on next release. Thanks!


This is a cool app I was excited to try it. Couple points of feedback:

- There is a popup every few seconds encouraging me to pay. Let me at least try it for a minute or two? I've seen the popup about 5 times while trying the app for just a couple minutes.

- Use the system light/dark mode instead of defaulting to dark mode

- Show language names instead of codes "EN" etc.


Hey, thank you for trying the app. These suggestions are very value for me. Currently, I am working on the paywall stuff, at the begging I go with the most basic solution. Now I will make it more useful for free users.


Just cap the max amount of minutes they can transcribe in a day or total. Then you will be asking only the people who find real value in it


Seemed impressive enough to me, but I don't know what the current best-in-class looks like these days. Can anybody working in this area explain if this is a significant milestone and what opportunities it might unlock? The consumer value proposition of basic speech-to-text input seems to be well-handled by most major OS's, but I appreciate that's proprietary tech and only one use case.


Whisper is way better than at least some other transcription solutions: https://blog.lopp.net/open-source-transcription-software-com...


Wow, near-perfect transcription on desktop Firefox! Didn't seem to work on Android Chrome, though.

I wonder if this can be sped up using WebGPU...


Does anyone know of a real time version of this, that can immediately transcribe individual words? Could be very useful for those hard of hearing.


There is an attempt for real-time transcription from the microphone at:

https://whisper.ggerganov.com/stream/

It chunks the input in 5 second buffers and processes them independently. Results and performance are not great, but I think good enough for a proof-of-concept.


Does it have to be open source? If not, the Pixel phones do an incredible job of real-time transcribing.


Yeah, and local only processing


I highly recommend trying out https://whisper.ggerganov.com/talk/. It lets you talk to GPT-2 using your voice, all running locally in your browser. Holy cow.


Very cool, it works for videos too. Parsed a 1 minute video with ~95% transcription accuracy


This is incredible! Thank you for sharing! Did OpenAI release these pretrained models, or was the training done separately alone with this project?

Of OpenAI releases the pretrained models, why would we use their service?


They are pre-trained. This project is running a port of the original open AI release [0] to C++.

From the OpenAI paper and release notes: "We are releasing models and inference code to serve as a foundation for further work on robust speech processing." So I guess they are either truly altruistic in this, or they are planning on monetising whatever they build on top of it.

Also OpenAI is a startup (if we can call it that) so their value right now is more about being impressive, and looking a lot like future value; as opposed to showing an immediate route to profit.

[0] https://github.com/openai/whisper


Would be interesting to see this connected to YouTube, to improve upon their auto generated transcripts. There is this command line version using YouTube-dl and OpenAI's API https://simonwillison.net/2022/Sep/30/action-transcription/


Running in the latest safari iPhone browser I get the error:

failed to asynchronously prepare wasm: CompileError: WebAssembly.Module doesn't parse at byte 5: can't get Function local's type in group 1, in function at index 9 Aborted(CompileError: WebAssembly.Module doesn't parse at byte 5: can't get Function local's type in group 1, in function at index 9)


from the web page: "Important: your browser must support WASM SIMD instructions for this to work."

I don't think Safar IOS supports this. [0]

[0] https://developer.apple.com/forums/thread/693067


Me too, same browser


My understanding is each inference run requires 30 seconds. Therefore anything processed process under 30 seconds is padded out with silence.

To my knowledge, nobody's been able to work around this and it may not be possible without work. Upstream.


If someone wants to self host you can also try this decent web interface: https://codeberg.org/pluja/web-whisper

I'm not the creator, just a fan.


This might help out the timestamp guy for very long videos/podcasts.


I think we should make standard browser API for transcribing, otherwise each website wanting to implement private voice recognition will need to download 500MB of data


Perhaps we should call it Web Speech API.


I was trolling, sorry.

This API already exists. It isn't nearly as good as Whisper.cpp (at least on macOS).

Docs: https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...

Demo: https://codepen.io/Rumyra/pen/NWLyLe


An important limitation of Web Speech API is that it only accepts audio from a microphone, you can't transcribe an audio file or a WebRTC call.


FF Win for the small model: Uncaught DOMException: IDBObjectStore.put: The serialized value is too large (size=487614318 bytes, max=267386880 bytes).


English model is really good. My native language Danish, not so much


Three clicks to find out what it is:

1: “Minimal whisper.cpp example running fully in the browser”

2: “Port of OpenAI's Whisper model in C/C++”

3: “Whisper is a general-purpose speech recognition model.”


I think many people here are familiar with Whisper. On the speech recognition task it is the most exciting news in the last year.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: