Hacker News new | past | comments | ask | show | jobs | submit login
Whisper.api: Open-source, self-hosted speech-to-text with fast transcription (github.com/innovatorved)
216 points by innovatorved on Aug 22, 2023 | hide | past | favorite | 50 comments



This is awesome.

For anyone confused about the project, it is using whisper.cpp, a C-based runner and translation of the open whisper model from OpenAI. It is built by the team behind GGML and llama.cpp. https://github.com/ggerganov

You can fork this code, run it on your own server, and hit the API. The server itself will use FFmpeg to convert the audio file into the required format and run the C translation of the whisper model against the file.

By doing this you can separate yourself from the requirement of paying the fee that OpenAI charges for their Whisper service and fully own your translations. The models that the author has supplied here are rather small but should run decent on a CPU. If you want to go to larger model sizes you would likely need to change the compilation options and use a server with a GPU.

Similar to this project, my product https://superwhisper.com is using these whisper.cpp models to provide really good Dictation on macOS.

Its runs really fast on the M series chips. Most of this message was dictated using superwhisper.

Congrats to the author of this project. Seems like a useful implementation of the whisper.cpp project.

I wonder if they would accept it upstream in the examples.


One caveat here is that whisper.cpp does not offer any CUDA support at all, acceleration is only available for Apple Silicon.

If you have Nvidia hardware the ctranslate2 based faster-whisper is very very fast: https://github.com/guillaumekln/faster-whisper


ctranslate2 is amazing, I don’t know why it doesn’t get more attention.

We use it for our Willow Inference Server which has an API that can be used directly like OP project and supports all Whisper models, TTS, etc:

https://github.com/toverainc/willow-inference-server

The benchmarks are pretty incredible (largely thanks to ctranslate2).


Obligatory hooking up of Willow to ChatGPT, for the best virtual assistant currently available:

https://twitter.com/Stavros/status/1693204822042739124


I haven’t used faster-whisper so I can’t compare performance, but whisper.cpp does support cuda via CUBLAS, and it’s noticeably faster than the cpu version. I used it earlier this year to generate subtitles for 6 seasons of an old tv show I backed up from dvd that didn’t include subtitles on the disc.


Thanks for the Nvidia based implementation!

Fwiw decent acceleration works on any avx2 compatible chipset. I get realtime speed for everything but the large models with a recent Ryzen system. The apple silicon is good but not as special as folks think!


Many of you are asking if the project is completely self-hosted and does not rely on any third-party services. Yes, it is completely self-hosted and does not rely on any third-party services. The user is for authentication, so no one can use the service without authentication.


Getting an authentication token does rely on a third-party service, if the README instructions are correct. It requires sending an email address to that third party.


Maybe the auth token example is meant to also hit localhost?


Huh?

"This project provides an API with user level access support to transcribe speech to text using a finetuned and processed Whisper ASR model."

Why is this a service at all? Why not just a library? Or a subprocess?


From what I can see, it runs in a docker container and uses an HTTP server to handle interaction.


Whisper API - Speech to Text Transcription

This open source project provides a self-hostable API for speech to text transcription using a finetuned Whisper ASR model. The API allows you to easily convert audio files to text through HTTP requests. Ideal for adding speech recognition capabilities to your applications.

Key features:

- Uses a finetuned Whisper model for accurate speech recognition - Simple HTTP API for audio file transcription - User level access with API keys for managing usage - Self-hostable code for your own speech transcription service - Quantized model optimization for fast and efficient inference - Open source implementation for customization and transparency


What was the fine tune?

How does this compare to what is possible using https://goodsnooze.gumroad.com/l/macwhisper for example?

Thanks!


Are you able to provide more information on the fine tuning? Any improvement in WER and what language it was fine tuned in and the size of the dataset used?


Any plans to add phrase timestamps, channel separation and other equivalent ASR features to make this API more approachable?


I am working on the timestamp feature. You will be able to see the option for timestamp soon.


Appreciate it! All the best.


This looks great, does recognition use the GPU? What's the speed you get on it?


Not to be confused with

Whisper – open source speech recognition by OpenAI https://news.ycombinator.com/item?id=34985848



I thought that was the same. I still don't see the difference.


It is the same, this is a self-hosted solution.


Related to whisper: whisperX is a god send. I can finally watch old or uncommon tv series with subtitles.


Oh dang, diarization? How well does it work?


It is surprisingly good! Not clear if you can go without a manual step where you need to change the likes of SPEAKER_00 to Bob and SPEAKER_01 to Sarah, but I've not had it mess up on me at all transcribing 2 hour long conversations between 6 people.


This is not fully self-hosted so much as middle-ware, no?


It is completely self-hosted, but it currently supports only the tiny and base models. You can soon expect support for large models. For any requests, you can create an issue.


Nice! This will be very useful for me. Think I can run this locally can spin a basic telegram bot around it for personal use.

One issue I faced with all the whisper based transcript generators is that there seems to be no good way to make editing/correcting the generated text with word level timestamp. I created a small web based tool[0] for that.

By any chance if anyone is looking to edit transcripts generated using whisper, you'd probably find it useful.

[0] https://github.com/geekodour/wscribe-editor


So is "real time" translation a thing yet? I've long wanted to be able watch non-english television and have the audio translated into English subtitles. It's doable for pre-recorded things, but not for live.

An iPhone app that could do this from the microphone would also be amazing. Google Translate and it's various competitors from Microsoft/Apple are nearly there, but they all stop listening inbetween sentences. Something that just listened constantly, printing translated text onto the screen, would be amazing.


Just wait for a couple of weeks. I am working on speech-to-speech translation. Instead of subtitles, you can listen to it directly. I am also working on subtitles.


But I don't want that. I just want a live stream of translated text.


You can do this with PowerPoint actually. I bumped something once and Japanese subtitles popped up following what I was saying in my confusion.


For long running stuff https://developer.apple.com/tutorials/app-dev-training/trans... should be straightforward to translate as well using ported on-device BERT models


I've been using the Microsoft Speech api for an app and so far it's been surprisingly very good for realtime speech to text.


how is this open source, or self-hosted, when it requires an API key and a login from a third party?


No, it is not a third-party. It is a just PostgreSQL database for logging everything. You can simply visit the /docs endpoint. It is just for authentication so that you can work with different users. One Again its completely self hosted


I'm guessing the installation instructions just need a bit of love. It could seem confusing to see a token request to https://innovatorved-whisper-api.hf.space/api/v1/users/get_t...


https://innovatorved-whisper-api.hf.space/docs

Just visit the swagger create account and then gettoken to grab a token


I agree with freedomben that the reason people are confused is probably because your README says

> curl -X 'POST' 'https://innovatorved-whisper-api.hf.space/api/v1/users/get_t...'

while later saying

> curl -X 'POST' 'http://localhost:8000/api/v1/transcribe/?model=tiny.en.q5'

If you just change the first "curl" to also be localhost:8000, this will be cleared up.


What's the point of logging everything ? I don't understand, isn't it possible to just deal with authentification locally ?


Many live streamers, and platforms, would love to have custom real-time transcription elements. I actually looked into this exact project of yours when I thought about creating such a thing.

Even if it meant delaying the broadcast for a second while transcribing the accessibility value could be immense.


>Get Your token

If it's completely self-hosted why do I need to get a token? Where does the actual model run?


getToken is just an authentication layer for authenticating your request. If you want to self-host it, just clone the repo and please check the .env.example file.


I think people are confused by the README showing localhost in some places and your instance url in others, causing them to assume you were centrally issuing tokens for using this project.


So Whisper is all the rage with speech-to-text, but what about text-to-speech?


i dont understand the excitement here. it's just a HTTP wrapper for CLI command. you can build it easily on your own with any decent RAD framework


Does Android OS come with ASR?


Not that you can feed arbitrary audio files to without an app for that.


I guess I should have phrased it better: If you were building an app, does Android provide the ability to listen to arbitrary speech and convert it into text for your app?


Thank you. I love it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: