I've been using MacWhisper for a few months, it's fantastic.
Sometimes I'll send a mp3 or mp4 video through it and use the resulting transcript directly.
Other times I'll run a second step through https://claude.ai/ (because of its 100,000 token context) to clean it up. My prompt for that at the moment is:
> Reformat this transcript into paragraphs and sentences, fix the capitalization and make very light edits such as removing ums
This is so good! I studied English, then moved to linguistics, then lived in the UK for almost a decade and due to my accent none of the TTS tools are close to the approach you just mentioned (whisper + LLM). Thanks Simon!
I have a Python script on my mac that detects when I press-and-hold the right option key, and records audio while it's pressed. On release, it transcribes it with whispercpp and pastes it. Makes it very easy to record quick voice notes. Here it is: https://github.com/corlinp/whisperer/tree/whisper.cpp
I was working on a native version in the form of a taskbar app with customizable prompt and all. However I quickly realized that the behaviors I want the app to do require a bunch of accessibility permissions that would block it from the app store and require more setup steps.
> However I quickly realized that the behaviors I want the app to do require a bunch of accessibility permissions
Which behaviours specifically?
Personally, I wouldn't worry too much about the App Store. I'm distributing Enso (http://enso.sonnet.io) via gumroad.com, and people download/pay for it. I think it's easier than using the App Store Connect route anyway.
Unless it's a trusted recommendation I'll look on the app store first. Sometimes I'll still get it from the app store anyway:
- The store itself is convenient for browsing and discovery.
- I don't need to do any kind of background checking on the developer prior to running their app
- Similarly they don't get access to my credit card details, so I don't have to be concerned about them storing it incorrectly or abusing it later
- It's also easier for me to pay them, as often foreign transactions are blocked despite me specifically approving them.
- I don't need to hand over my email or other personal details, small developers seem really bad at storing information and CC details properly. I use custom emails for everyone and numerous times I've seen my data on-sold, stolen by ex-employees, or simply thefted from the company by hacking groups.
- I don't have to do any reviews when there is an update, I can just accept it knowing that the developer is still trusted and that the project hasn't been hijacked such as the numerous painful times that popular open source projects have had malware snuck into them.
- If the app doesn't properly do what claims (or at least what I thought it would), it's a few clicks to get a refund from Apple.
- Apple carrot and stick developers to keep their apps up to date with the system. First carrot, and eventually the stick (delisting).
Some may argue that some of these things can still happen with an app store, but it's demonstrably less and there are processes in place to deal with that.
It's not a popular opinion, especially on HN, but there are plenty of developers who, whether through frustration or dealing with pedantic/rude* customer requests, treat their customers like shit, the store prevents that.
The app store ain't perfect, and there are plenty of functions which apps can't have if they're sold through the app store, but for its flaws it's trustworthy and helps me utilise a far larger number of utilities than I'd normally be comfortable with personally maintaining.
* I sit on enough discords to see how breathtakingly rude and demanding some users are without noticing it, even for totally free software.
Not being able to be on the app store isn't an issue, it wouldn't put anyone off using it or downloading it. Majority of my apps arent from there, I imagine most long time users are the same.
Thanks for actually trying it out! I must admit, I didn't pay much attention to UX for installation here since it was mostly for my own use, but that's great you got it working! What do you think?
I have not solved the bouncing icon. That's one of the reasons it needs to be rewritten in Swift!
I think it's pretty cool to have a hotkey to type STT text anywhere locally. It also helps with using LLMs, it opens you to using more run times since most don't have a whisper plugin, and using the whisper plug is usually an awkward UX too.
- MacOS native dictation, in my experience, is slow to start up (indeterminate delay after pressing the dictation key)
- The accuracy is decent but the vocabulary is very limited. With Whisper, you can customize the prompt to include industry-specific terms and acronyms.
See my example from the repo. Apple recognizes:
> Popular Linux distributions include Debby and Fed or Linux, and do Bantu. You can use windowing systems such as X eleven or Weiland with a desktop environment like KD plasma.
Whisper recognizes:
> Popular Linux distributions include Debian, Fedora, Linux, and Ubuntu. You can use windowing systems such as X11 or Wayland with a desktop environment like KDE Plasma.
Popular Lennox distributions include Debbie and Fedora, Lennox, and Beau you can use windowing system such as X11 or Whelan with a desktop environment like Katy plasma
Just tried the MacOS one, here's my nearly worthless result:
Popular Linux distributions include Devion fedora Linux and a bunch to you can we use when doing system such as excellent Wayland with a desktop environment like Katie plasma.
macOS native dictation is not as good as Whisper in terms of accuracy, however that is probably going to change with macOS Sonoma since they will switch the model for speech recognition to a better one (Transformer based iirc).
Was it ever addressed that even when you had the microphone turned off it could still detect audial stimuli and reflected that in the oscillating sound wave visual? Makes me wonder if it was/is always listening even with Hey Siri disabled
That "sounds" to me like the mic was properly cut off electrically, but the rest of the system as active, so you'd get electrical noise coming in. E.g., the mic and amp are powered down, but the ADC is still active.
My old Sun Ultra 40 M2 had a ton of electrical noise on my headphone jack, and I could def. tell when the CPU was busy from what I was hearing.
I meant like, even after toggling it off when you made noise or spoke, there was a visual representation/feedback for that shown in the wavy graph thing. Not just ambient/moving parts type noise. Just thought it was weird, never really thought about it too much.
Better than dictation used to be on MacOS. I tried some whisper-based stuff, but it lacks the integration that the built-in dictation has (so I don't have to dictate somewhere else and copy/paste). It seems in the same ballpark as whisper, but I haven't done a comparison.
Whisper is cool. Back in college I wanted to do some projects with speech-to-text and text-to-speech as an interface like 10-12 years ago, but at that point the only option was google APIs that charged by the word or second.
On top of that, constantly sending data to google would have chewed a ton of battery compared to the "activation word" style solutions ("ok google/siri") that can be done on-device. The power for on-device processing was obviously going to come down over time, while wireless is much more governed by the laws of physics, and connectivity power budgets haven't gone down nearly as much over time. I am pretty sure there is a fundamental asymptotic limit for this, governed by Shannon entropy limit/channel width and power output. In the presence of a noise floor of X, for a bandwidth of Y, you simply cannot use less than Z total power for moving a given amount of data.
BTLE is really the first game-changer (especially if you are hooking into a broad network of receivers like apple does with airtags) but even then you are not really breaking this rule - you are just transmitting less often, and sending less data. It's just a different spot on the curve that happens to be useful for IOT. If you are, say, doing a keyboard over BTLE where the duty cycle is higher, the power will be too. Applications that need "100% duty cycle"/"interactive" (reachable at any time with minimal latency") still have not improved very much.
In hindsight I guess the answer would have been writing a mobile app that ties into google/siri keywords and actions, and letting the phone be the UI and only transmit BT/BTLE to the device. But BTLE hadn't hit the scene back then (or at least not nearly to the extent it has now) and I was less experienced/less aware of that solution sapce.
If you're looking for an alternative that runs on Linux, I just recently discovered Speech Note. It does speech to text, text to speech, and machine translation, all offline, with a GUI:
While whisper.cpp is faster than faster-whisper on macOS due to Apple's Neural Engine [0], if you have a GPU on Windows or Linux, faster-whisper [1] is a lot faster than OpenAI's reference Whisper implementation as well as whisper.cpp, with the CLI being wscribe or whisper-ctranslate2 as faster-whisper is only a Python library. It's pretty good.
I've used this for a few months to transcribe interviews and it works pretty well. The UI for dealing with multiple speakers is a bit cumbersome, and there are occasional crashes, but overall definitely a great app and worth the money
The main problem I have faced with the whisper model (large) is when there is silence or a sizable gap without audio, it hallucinates and just puts out some random gibberish repeatedly until the transcription ends. How does this app handle this?
This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so I’m looking forward to trying MacWhisper.
Out of curiosity, does anyone know what the state of the art for transcription is? Is there a possibility it will soon be "better than a person carefully listening and manually transcribing"?
I ask because I asked a friend to record a (for fun) lecture I couldn't attend, and unfortunately the speech audio levels are quite low, and I'm trying to figure out how to extract as much info as possible so I can hear it. If I could add context to the transcriber like "This is about the Bronze Age collapse and uses terminology commonly used in discussions on that topic", it might be even more useful.
A few weeks ago I found myself wanting a speech to text transcriber that directly captures my computer's audio output (I.e. not mic input, not am audio file), but I could not find one. The best alternative I found was to have my computer direct audio output to a virtual audio input device, but I could not do this on my desktop because I do not have a sound card. I found software that did this, but it did not allow me to listen to the audio output while it was redirected to a virtual audio input.
Has anyone else tried to do something similar? How did you achieve it?
Audio Hijack[1] will let you route any audio to multiple virtual or actual outputs while adding the ability to listen to any part of the signal chain. Hope that solves it for you, it’s saved my sanity a number of times!
[1] https://rogueamoeba.com/audiohijack/
Love the idea behind this. High quality transcription + the data not leaving your device is excellent.
Any chance there's an iOS version of this coming down the pike? It would be great to have a voice-based note taking app that you can use when you are driving or walking and you don't want to type into your phone, but you just want to save that thought you just had somewhere by quickly dictating it, and having it accessible as text later.
I didn’t know whisper could differentiate voices for the per speaker transcription. Is that new? Is it also available in the command line whisper builds?
If you want a quick and free web transcription and editor tool, We've built https://revoldiv.com/ with speaker detection and timestamps. Takes less than a minute to transcribe 1 hour long video/audio
Good point but the problem with local hosting is that if you want to use the larger models it will take a long time to transcribe a file. We use multiple gpus and we do speaker detection, sound detection and it is has a rich audio editor.
Totally agree, having built a similar app I know speaker diarization is a killer feature that's hard to get. My problem is I'll never share these recordings ;).
I tried to sign up and got a Clerk error: "You have reached your limit of 500 users. You can remove the user limit by upgrading to a paid plan or using a production instance."
Shameless plug: recently launched LLMStack (https://github.com/trypromptly/LLMStack) and I have some custom pipelines built as apps on LLMStack that I use to transcribe and translate.
Granted my use cases are not high volume or frequent but being able to take output from Whisper and pipe it to other models has been very powerful for me. It is also amazing how good the quality of Whisper is when handling non English audio.
We added LocalAI (https://localai.io) support to LLMStack in the last release. Will try to use whisper.cpp and see how that compares for my use cases.
Seems shady to me to charge for running larger free models you don't provide on hardware your users provide. You are charging for openAi features not yours.
Supports Tiny (English Only), Tiny, Base, Small, Medium and Large models
Translate audio file into another language through Whisper (use the Medium or Large models, the results will not be perfect and I'm working on more advanced ways to do this)
It runs locally, using Whisper.cpp[1], a Whisper implementation optimized to run on CPU, especially Apple Silicon.
Whisper itself is open source, and so is that implementation, the OpenAI endpoint is merely a convenience to those who don't wish to host a Whisper server themselves, deal with batching, renting GPUs etc. If you're making a commercial service based on Whisper, the API might be worth it for the convenience, but if you're running it personally and have a good enough machine (an M1 MacBook Air will do), running it locally is usually better.
FWIW, I will add that most laptops made in the past 10 years are fast enough for real-time transcription. Unless you're trying to transcribe in bulk, running it locally will usually be the best option.
Any insight on how Whisper works on older Intel Macs? I have a 2012 Mac mini with 16GB of RAM doing nothing; if I could use it to (slowly) transcribe media in the background, this becomes a must-buy.
Web browsers are mostly free and don't try to upsell you to a Pro paid version. The MacWhisper author deserves to be compensated for their work, so I'm not objecting to the existence of a paid version. This feels like yet another relatively low value freemium/upsell wrapper in the Mac shareware ecosystem to me.
I'm probably wrong and there's a real population that benefits from this work, clearly some folks perceive it as useful enough to pay for it and I'm just not in that audience to see it.
I think part of what rubs me the wrong way about this is that it feels to me like commercial freeloading due to the thinness of the commercialized wrapper around a free/open core in this case (whisper model + code); it feels ethically questionable unless the author contributes back some portion of the proceeds to research in some way -- I didn't see evidence of that. I'm probably being naive here, happy to have a less snarky discussion about it though.
Sometimes I'll send a mp3 or mp4 video through it and use the resulting transcript directly.
Other times I'll run a second step through https://claude.ai/ (because of its 100,000 token context) to clean it up. My prompt for that at the moment is:
> Reformat this transcript into paragraphs and sentences, fix the capitalization and make very light edits such as removing ums
That's often not necessary with Whisper output. It's great for if you extract captions directly from YouTube though - I wrote more about that here: https://simonwillison.net/2023/Aug/6/annotated-presentations...