Hacker News new | past | comments | ask | show | jobs | submit login
Writeout.ai – Transcribe and translate any audio files (writeout.ai)
172 points by mpociot on March 8, 2023 | hide | past | favorite | 92 comments



I don't understand why this doesn't actually do the transcription / translation locally. Sending the data to openAI for paid conversion makes no sense. Whisper can be legally run on your computer, for free.

Running it locally way more sense for an open source project, because why would you pay and be dependent upon a 3rd party if you don't have to be.

It also makes way more sense for a service because then _you_ don't have to give all your money to openAI and skim off of what's left.

This is just.... bewildering. I really wanted to use it, but I'm not going to pay openAI to transcribe podcasts for me when i can literally use the exact same language model and do it locally with free open source code.

I'm hoping someone will fork this and teach it to run whisper locally.

[edit: getting the exact right version of python and PyTorch and dependencies to make whisper run was a pain but now i've got it set up and it's a trivial command to transcribe every mp3 i feel like transcribing]



Because that would involve actual work. A box sitting somewhere that passes through API calls to OpenAI is trivial to set up.


To do Whisper transcription for free locally you can use AirCaption (www.aircaption.com). It's an electron desktop app running Whisper.cpp (https://github.com/ggerganov/whisper.cpp). Just released a few days ago.


Try Revoldiv.com , it uses Whisper. The transcription quality is near perfect and it's free.


With your references to money and paying, where exactly does this indicate it's charging? As of 3/11 Sat 16:46 CST I see no reference to that.


How's it handling long files? Let's say worst case scenario, a 2 hour long podcast.

What ratio are you getting (podcast length to transcription time) and does it error out memory wise as others suggest?


I dunno about openAI as a service, but on my M1 mac i think whisper took something on the order of 8x realtime to process with the "large" language model. That is to say... 8 minutes of processing for every 1 minute of audio. It was surprisingly not fast. I assume openAIs servers have more GPU at their disposal to make this go faster.


Are you using whisper.cpp? You really want to be using that if you care about speed. You should be able to get better than real-time transcription on an M1.


Well that'd be why it didn't come with local transcription out of the box then. People would have called it shite!

I can edit a podcast twice as fast, never mind transcribe it! Using API calls seems like it was the best method for launch.


Seems like “local API” would help here: just something that duplicates the official API while running locally


Good points.


I agree with you, but the reason is cost and convenience.

Whisper v2 costs $0.006 per minute of transcribed text: https://openai.com/pricing

If you had meetings every working hour, you'd have up to ~160 hours of audio per month to transcribe. For most people, this is a gross overestimate.

Throwing this audio at OpenAI's API would cost $57.60 per month, and also frees you up from having to set up and maintain local inference.


"cost and convenience": cost: $57.60 vs 0 Why would you want to pay nearly $700 a year just to avoid running a program in the background on whatever computer you already have open?

convenience: yes, it's a nicer interface, but the current state of the "geeky" version is type command on command line, with path to file. The end. unless you're really afraid of the command line it's not that much more convenient.

The text line being highlighted while you listen is nice but a) we wrote something that did it at the word level (as opposed to sentence..ish level) nearly 20 years ago, b) in this context it's not actually that useful. With video sure... you can click the text and go to teh right place in the video. With spoken text (what this is best at) you click and go to the point...where they're saying what you just read. Unless you really want to hear what you just read, there's not a lot of added value.

Would it be good for podcasts to use an interface like this for playback? absolutely. It'd be a massive upgrade, but that's not what this is offering.

maybe someone will extract that code and let us combine the MP3 and timestamped text file in a web site (if that doesn't already exist). That'd be cool.

But, the cost you propose is way too much for most people, especially in countries that aren't rich. In many places $400 a month is a really good salary. So yeah, if you're rich $700 a year is not a big deal, but...


First, this $57.60 is a VERY pessimistic upper limit. Remember, that's based on having to transcribe every working hour of every day. The number of hours/month required for transcription is probably pareto-distributed among the workforce. I'd bet 90% of people would need to transcribe up to ~4 hours (1 important 1-hour meeting per week), corresponding to an API cost of USD$1.44 per month.

Second, don't underestimate the business value of a nice interface. IMO, the value of excellent UI/UX is part of why ChatGPT took off the way it did. The number of people willing to pay a few dollars per month in order to never have to see a command line is quite a bit larger than the number of people willing to host their own `whisper-large` inference.

Speaking of hosting, do you already own hardware that supports sufficiently fast inference? If not, how much would a good enough cloud instance cost you per month? It depends on how fast is fast enough, but more than $0, that's for sure.


I love all of these random "companies" popping up that just make either one or a handful of API calls to OpenAI. Come on people, try harder!


People are excited and doing things, that’s wonderful. What are you doing besides complaining?

Everything starts small.

Also, the most important thing about a service is attracting customers, not the tech stack under it. Facebook was made with PHP, Twitter famously failed constantly while struggling with user growth.

I’d much rather have tons of users with a tech stack that is a wrapper for a bunch of other stuff, than have super impressive in-house tech and no users.


> People are excited and doing things, that’s wonderful. What are you doing besides complaining?

I don't think people are excited about making API calls. They see a land grab and are clamoring for their piece. As for what I'm doing, I work on my own products that, I hope, push the envelope, at least slightly. And I have seen AI companies that are doing good work using OpenAI's tools, but this isn't one of them.


> As for what I'm doing, I work on my own products that, I hope, push the envelope, at least slightly

That’s awesome. Please post about it on HN, share what you are doing and get people’s feedback.

But please don’t just throw shade at others because they don’t conform to your view of what’s praiseworthy.

We can all support each other and give feedback/advice.


Out of curiosity, could you share those envelope pushing products of yours?


I would, but then I'd sacrifice the anonymity of this HN account that I like to shitpost from.


I am excited. I need this tool professionally and I don't give a damn how it works underneath if it can work for me and give results.


It's a good metaphor for a good part of the startup scene: good looking gift wrapping around tools built by smarter people.


in their defense, there's a lot of value there. As i commented elsewhere here, I'm frustrated that this particular thing isn't running the transcription locally, but this is a _massive_ improvement on what the "tools built by smarter people" built.

Sometimes "good looking gift wrapping" is a huge value unto itself. Also, it isn't fair to good UX and UI developers to imply that that isn't also really hard work to get right. It's just different work using a different form of thinking. Not lesser in any way. And... without the people who could make the "good looking gift wrapping" most apps would suck a lot harder than they already do.


It is a complete grift which everyone and their llamas are somehow immediately AI companies, when they are all hitting the OpenAI API. So when it goes down their entire business is down as well.

This hype is going to eventually subside with lots of losers and a tiny minority of winners when the price increases come in.

The only winners of this race to the bottom is Stability.ai who are already open sourcing everything and OpenAI cannot afford to open source their flagship AI product(s) for free.


Agreed.

The current AI hype cycle has driven companies to slap AI somewhere in their offering so they can call themselves an AI company - even if it's an API key and an intern spending half a day with an API wrapper.

Gatekeeping is always risky but in my mind if you're not at least touching an ML framework you're not an "AI company" - which is already IMO a pretty low bar. That said it starts to get really hazy when you look at things like SageMaker and other offerings where you're doing abstracted model development or substantial amounts of fine-tuning/training on a custom dataset, etc.


The low hanging fruit always comes first. Sure, you and I could whip it together in an afternoon, but for non-tech people these simple tools are very handy since they put a UI on an API.


Does it have to satisfy your standard of being novel, or be super approachable and adoptable by customers?


This project is meant to be especially for educational reasons, which is why it's entirely open source


Is there any chance you could expose a pathway to use a local instance of Whisper? I ask primarily because OpenAI completely open-sourced Whisper in September 2022[0]. It seems odd to me to default to or encourage the usage of a paid service for something that appears to be available for free under MIT license including models[1].

My understanding is that the only reason OpenAI even setup the paid API is because it "can also be hard to run [sic]". Personally, I'm skeptical. I"m not knocking them for it but I could see how this is just brand capitalization.

[0]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis...

[1]: https://github.com/openai/whisper


If you use the large-v2 model they expose via the API, the more accurate, in your local machine, you'll see that even though it works great it's slow and won't work for long audio files because of memory limitations.

It's fairly easy and quick to run Whisper for free either locally in an Anaconda environment with Python or the command-line interface or, even better, in a Google Colab notebook.

Here's a sample notebook that builds on a notebook by Pete Warden.

https://colab.research.google.com/drive/1sxsey3n0jd09MjUd9Ky...


On a 1080Ti (so a 6 year old GPU), the large model runs in 1x time (so transcribing 10 minutes takes 10 minutes) and I've successfully transcribed even 1h+ files.


FWIF an optimized implementation I've been working on comes in at roughly 70x realtime (large-v2, beam size 5) on an RTX 3090.


Nice! Are you going to release it publicly?


Great question!

We're still very early stage and stealth so it's not quite clear to us where our lines are with regards to special sauce/significant competitive advantage.

As the CTO (and lead dev) I'd lean towards open sourcing it (because it's awesome and we're standing on the shoulders of open source giants already) but it may become clear it's too differentiating to open source. As I said it's just too early to tell.

What I can say is if we open source it HN will be the first to hear about it!


> My understanding is that the only reason OpenAI even setup the paid API is because it "can also be hard to run [sic]". Personally, I'm skeptical. I"m not knocking them for it but I could see how this is just brand capitalization.

Why is it hard to see that not every organization has the capability to set up their own translation cluster, provision GPUs, frontends, scaling, on-call rotations, regularly update models..? It's not just "brand capitalization". An API that you can call to transcribe/translate a recording with zero extra work is absolutely essential to have for most.


I have a pipeline setup in https://github.com/cnbeining/Whisper_Notebook/blob/master/Wh... .

- Run Voice Activity Detection for better timestamp output - Transcribe with Whisper - Run Forced Alignment to get per word timestamp - Create better segmented SRT - Translate(with multiple APIs - implemented DeepL, Google Translate, Baidu and a couple more)


The API is useful because not everyone has quick 10+gb vram gpus lying around.


You know, this is true. I was a bit too dismissive about it because I haven't done a lot of deploying models myself. I was making the assumption that it was similar to many other services, but even looking at pricing for managed GPUs on most instances shows me that's clearly not the case.


Just a note for anyone basing their business on the .ai TLD.

It's technically the domain for Anguilla, a literal British colony in the Caribbean.

It appears to be managed by some random guy- check out the .ai registration FAQ: http://whois.ai/faq.html

If you are going to use .ai, just be aware the top level of the domain appears to be managed by some dude with a gmail account. Its not necessarily bad, but something to consider if you're planning to host your billion dollar AI startup on it.


I wonder how a simple individual can acquire control on a ccTLD.

I always thought they would need to be vetted by the government of that the ccTLD represents.


I think they found the one weird software dev in Anguilla willing to do it. The "Offshore Information Services" link on the .ai wikipedia page "registry" link redirects to the dudes wikipedia page.

https://en.wikipedia.org/wiki/.ai https://en.wikipedia.org/wiki/Vince_Cate

A colorful character to say the least, and exactly the kind of person I'd expect to be running the ccTLD of a small caribbean island.


> Cate engaged in civil disobedience against U.S. cryptography policy by setting up a webpage inviting readers to "become an international arms trafficker in one click". The page contained an HTML form which, when submitted, would e-mail three lines of Perl code implementing the RSA public-key encryption algorithm to a server in Anguilla; this could have qualified as unlicensed export of munitions under U.S. law at the time.


The definition of "chaotic neutral"!


I want something that I can self host. I am perfectly OK with a single language and a few mistakes here and there.

Does such a thing exist? I would gladly donate to a kickstarter project for this before trying to build one myself.


Just download whisper ....

If you own a gpu use this one https://github.com/openai/whisper

If you don't own a gpu use this one https://github.com/ggerganov/whisper.cpp (this one is very very slow)


whisperx also adds improved timestamping, closed captioning output, and beta diarization (speaker labeling) support. unfortunately it doesn't seem to support m4a out of the box but you can convert to mp3 (upgrade the sound lib dependency first) or wav with ffmpeg.


whisper.cpp is not universally very very slow. With an M1 Macbook and the medium model it's faster than real time. There may be some accuracy lost because it uses a different search method and if you choose to run a smaller model.


You mean without using the OpenAI API? This project is open source and on GitHub, so you can self-host this if you want!


You (essentially) need GPU but here you go:

https://github.com/ahmetoner/whisper-asr-webservice

For your requirements the medium.en model (max) should be satisfactory.


https://github.com/ggerganov/whisper.cpp makes it relatively feasible to run on CPU.


Yes but it doesn't provide an HTTP/whatever API - it's CLI.

OP said "self host" so I assumed they're looking for an implementation that provides an API endpoint.

It would be straightforward enough to create an API utilizing whisper.cpp but I'm not aware that such a thing exists.

Additionally, depending on requirements whisper.cpp is remarkably performant considering it's running on CPU but it's still nowhere near competitive with GPU implementations (for obvious reasons). Depending on expectations vs the GPU powered OpenAI Whisper endpoint it could be disappointing.

From the whisper.cpp benchmarks it's showing transcribing 3:24 of audio with whisper medium.en in 30 seconds (on an M1 Pro!!!) - which is (again) incredible considering. That's 6.8x real-time.

As an example, we've spent quite a bit of time optimizing our self-hosted Whisper API endpoint and it can do 3 min of audio (the max we currently care about) in 2.5 seconds with large-v2 and beam size five on an RTX 3090. That's 72x real-time with a much larger, more capable, and more accurate model - and we have further work to do.

Our focus is primarily "real time" dictation tasks with ~10 sec sentences. All in with internet latency of ~70ms end-to-end (from end of 10s audio segment to returned results) is currently roughly 700ms. Medium.en is 400ms all in.

Not a fair comparison but yet another example of the massive performance differences between CPU and GPU for tasks like this.

Additionally, my experience with this project has illustrated to me (yet again) the gulf between "we opened our model" and "actually use it at scale, in production, and be competitive in the marketplace". It's a HUGE difference and the resources, knowledge, etc required are substantial.


> You (essentially) need GPU but here you go

Don't most devs most likely already have a powerful GPU? Maybe I am biased for also being a gamer or having worked in game-development, which requires a powerful GPU anyway.


whisper is extremely simple to use on the command line. Just install it with pip and you are off to the races.


FYI: "Writeout uses the recently released OpenAI Whisper API to transcribe audio files. You can upload any audio file, and the application will send it through the OpenAI Whisper API using Laravel's queued jobs. Translation makes use of the new OpenAI Chat API and chunks the generated VTT file into smaller parts to fit them into the prompt context limit."


If anyone wants transcription locally (on-device) on macOS or iOS, I just released a free app for it: https://sindresorhus.com/aiko It runs Whisper on your device.


Thanks a lot for making this! Just last week I was trying out the transcription APIs of AWS and Google Cloud and they produced rather bad results for a German interview (wrong punctuation and capitalization, about 1 misheard word per sentence).

I didn't know OpenAI had an API for that as well, but now I was able to try it out and it's magnitudes better: Perfect spelling and only 1 wrong word in 2 minutes of audio (an abbreviation) that I was able to understand. It even filters out filler words!

You just saved me literally hours of work by showing the powers of OpenAI!

(Reading this back it sounds like an ad, but I'm in no way affiliated with any of those services. I'm just very happy.)


note that you can run openAI's whisper locally. The language model and tools are open sourced. It is finicky to set up if you're not a python dev. Just wanted to let you know that it's an option and it works literally exactly as well. You can even choose if you want to sacrifice quality for speed of conversion. The experience of using it will just be a lot.... geekier. command line call that produces a text format with timestamps on every line.


This works with Whisper behind the scenes and it is true: running Whisper locally it is easy and this may not be needed (it even adds overhead, since Whisper gives me the .srt file directly, I can ask to use the tiny model to make it faster, etc).

BUT if my brother (accountant) needs something like this, he wouldn't be able to install Whisper: he wouldn't even be able to open Github. So I think frontend GUI that behind the scenes runs models are always welcome.

I think this would be much better if it would run Whisper on their server instead of using an external API, but that's their decission.


Btw your imprint and privacy policy requires margins and changes to references of whatthediff.


I don't see any benefit over using Whisper locally or the API directly


OpenAI isn’t running a charity. This “free” service is going to run into that reality sooner or later, so I’d suggest not using it for any real work and instead buying Whisper API tokens directly.


Or use this nice UI for now to trial, and when that reality comes... transition to the API if you still need to do what you need to to?

So I would suggest this project to try things out, then setup Whisper locally :)

But what do I know, I'm just in the ether...


When you say free, do we still need to subscribe to other services?


Well if you want to self-host it, then yes. You will need an OpenAI account/API key for it to work.


does anyone know of a complete system or player to automatically generate subtitles based on speech recognition (apart from YouTube)?

there are a lot of older series/movies with where the speech is hard to discern but no subtitles are available for download.

i have been thinking about creating an AutoSubtitle app for years, but haven't had a free day to tackle it - hope someone else beats me to it.


What languages does it support?


It uses the OpenAI Whisper API, which according to their API documentation supports: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

See https://platform.openai.com/docs/guides/speech-to-text/quick...


Good tool translate and decipher the lyrics of ATL rappers like Future.


at the very least they should host the model and point their service to their own locally hosted whisper instance...


The title seems quite disingenuous.

A better description would be "A PHP based web app which calls OpenAI's Whisper API to transcribe speech"


I agree. Kudos to the author for sharing a working example of using the OpenAI's PHP Whisper client though. Digging a bit deeper into the organization that released this seems to provide more context: https://beyondco.de/. It appears this is Laravel oriented.


The main reason people add the tech stack is for marketing reasons.

The title describes what it does, I think you're making a mountain out of an anthill.


why php though, couldn't the whole thing not be completely running in the browser?


Many people on HN infamously called Dropbox just an rsync script, right?

It's usually all in the details and delivery (and ya'know we're lazy and lack time to setup stuff locally)

Though I wouldn't really knock anything free and open source either way.


The objection here is more structural than technical. The famous dropbox objection is 'anyone could do this' - even though they might not have the wherewithal to do so. The objection here is that the open source project is relying on a closed source paid service to do all the heavy lifting. Someone is going to need to foot the bill, which means this project will eventually have to answer some tough questions about funding, and what the project actually delivers.


Whisper is open source.


Where can I download the source from?



This is not open source. The wrapper may be, but it's using a non open source cloud service.


This thread is about the wrapper, which is open source.

You can run Whisper locally, and it is open source.

Feel free to fork this open source project and adapt it to a locally run Whisper instance.


[flagged]


Please don't break the site guidelines like this, regardless of how wrong someone is or you feel they are.

Rather, please make your substantive points thoughtfully and without name-calling or swipes.

https://news.ycombinator.com/newsguidelines.html


It's disingenuous because literally none of the code transcribes or translates audio.

This is NOT an app that transcribes, or translates, audio.

This is a front end to another companies service.

In its defense, it is a useful front end, because getting whisper running locally was a pain in the butt because of py-torch's specific python requirements (not too old, not too new... juuuuust right).

This app also looks like it does very useful things with what whisper outputs.

But it is 100% disingenous because it does none of the things it markets itself as doing. I was expecting it to run whisper locally, not call out to a paid service.


Download Whisper and the models, run it in a Docker container as a server, and it's Open Source.

Honestly, try see it as a favor that it's using OpenAI's endpoint, since some of us won't think it's feasible to have a GPU-loaded server running 24/7 just for some occasional transcriptions.


This is a really bad comparaison. Expedia didn't build their services in a way that makes the users think the hotel they are booking belongs to expedia. No one is going to buy an Air-France flight from them and expect the plane to be flown by Expedia employees


....only on HN


I would expect “transcribe any audio” to mean music transcription, Personally.


I think that's fair, but i also thing that it's mostly just musicians that would ever think that. I don't think the average person (geek or not) would assume that. I'm a musician and I didn't think it'd write sheet music.

As a geek whos done basic music arrangement, I also know that that's an incredibly hard problem once you introduce modern instruments. even staying with just classical ones differetiating between a violin part, and i a viola? or even a cello playing a high note vs a viola? Like... wow. that would be SO hard.

We're barely getting words right. I don't think there's any way we're anywhere close to transcribing a full band or orchestra in a meaningful way. Extracting the melody? sure. Chord changes? Sure. Actually doing an accurate and even remotely complete transcription? Incredibly hard.


Orchestration is it's own ball of wax.

If there was software I could just dump an MP3 into and get a basic chart with chords and a melody...that'd be pretty amazing. I've done it, by hand, and the results were even published... it ain't easy. 30 minutes of moderately complex pop rock took a couple of months off and on.


Why not just use whisper.cpp locally?


Just tried it out and wow!

insane.

Free and open source.

Thank you!


The number of people on HN b****ing about people shipping MVPs that build on OpenAI is hilarious and makes me feel like HN has definitely jumped the shark.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: