Craig from YC here. Ramon Recuero and I built SpeechBoard.
Here's how it works: you record a podcast and upload it to SpeechBoard. We run it through a few speech to text APIs to generate a transcription for you. From there you can delete words from the transcript and we cut them from the audio.
Then you can download three files: your edited audio, your original file with cuts marked in metadata for importing to Audition/Audacity, and labels for importing into Audacity.
Nailing the in/out points of words was the hardest part, which led us to create the Audition/Audacity import feature and now I think that's the best part :)
This is great! You've nailed one of the major pain-points in creating and producing high quality podcasts/audiobooks.
The other major pain-point that I've identified after spending 10 years in the podcast and audiobook industries is audio mastering. As a mastering engineer, I know how troublesome it can be to master audio so that it is at a competitive dynamic range.
Would you be interested in implementing an "auto-master" feature that I've developed? If so, please reach out! My email is in my HN profile.
Nice! Yeah, we've run into a few interesting use cases already e.g. sermons, which I didn't even know were recorded.
So it's written in Python and the main API we rely on is IBM's Speech to Text.
One thing that became an issue was deleting half of a word or inserting characters. Now we handle that with JavaScript. In the future we'd like to get some Lyrebird in there :)
Thanks! It's likely obvious to users with a technical background, but others might not realize that the STT APIs used by this service could retain uploaded audio data.
Very cool! I did find that if I remove "three" there's a noticeable jump in the resulting audio. But it's great that I can take the output and further refine it.
This reminds me of a recent Radiolab episode [0] in which they discuss a new technology that would allow you not only to remove words from audio, but also to add new ones in the speaker's voice -- and likeness, in the case of video. They made an example video [1], which was not super-convincing, but a very good demonstration of what it may be able to do in the future.
Of course, the focal point of the episode isn't the technology itself, but the implication it could have on society once it gets good enough that its output is virtually indistinguishable from real video (i.e. fake news in the form of convincing-looking videos).
Very interesting application. I am not sure if you guys have looked into this, but there is a Python library that can detect timestamps on word level if given the audio and transcript. It's pretty accurate for English: https://github.com/readbeyond/aeneas
There is a podcast I listen to every time it comes out (Uhh Yeah Dude), but the creators don't create transcripts. One idea I had was directly targeting podcast creators and creating a service whereby you create searcheable transcripts for listeners looking for an old podcast.
I think the transcript creation service in itself must be worth something for these guys.
I would be quite interested in that, especially if it could integrate with my podcast app.
Since August of 2016 I've listened to 30 days worth of podcasts and saved another 22 days worth of time by skipping introductions, listening faster than 1x, etc. The most annoying thing I'm facing is that I've heard hundreds of stories and the audio cannot be indexed easily. If I want to send a friend to a specific episode for a certain story, I have no good way to remember if it was the Freakonomics podcast, This American Life, Story Collider, Planet Money, or one of the 20+ other podcasts I listen to.
I'd love a system which would make available a searchable transcript of every podcast. I couldn't pay for transcribing all of them, but I'd pay 50 cents/podcast. Google tells me 1.75/minute will get a transcription from the top listed service, so if we had 210 people like me, we could transcribe an hour of audio.
I've said this a few times but happy to say it again: I will very cheerfully pay for a good podcast -> transcript service.
I speed-read. I don't speed-listen. And my lifestyle has zero podcast-compatible travel time. So there's a whole world of great content in a very user-unfriendly format for me currently.
Yeah, I think there are totally people out there who'd pay for it.
YC does :)
For the YC podcast we chose to host with Backtracks because they offer transcription and allow you to link back to exact times in the episode, sort of like YouTube.
This is really cool. There's something magical about just editing text, and then having real spoken audio change to reflect it.
I hit an error when I trimmed the text down to:
> Hey this is a different original.
> The text cuts into
Maybe I was too aggressive?
Some undo support would also be really helpful. Have you considered just having a free-form text field, then using something like wdiff to produce the edits? That might make the UX easier, since you wouldn't have to manually reinvent the text editing tools people expect (although you'd have to handle invalid edits, like people adding new words).
Yeah, without looking at your logs I suspect you cut too much. :)
We were using a free-form text field before and it led to a couple issues: cutting words in half + inputs. Both of those basically break it now so we went for a slightly less convenient but mostly functional demo.
I totally agree though, this needs a lot of polish on the UX side.
This looks very interesting. You might be the right person to ask about something related that I am currently working on: do you know of any app that would extract keyword / name based parts of audio? For example, extract only the parts where Elon Musk speaks given audio input (podcast, YouTube etc.)? Alternatively, extract only the parts (-30 and +30 seconds) when a specific word is mentioned.
> This looks very interesting. You might be the right person to ask about something related
Hi Craig, do you know an app/code that can split the audio/transcript based on persons? Detect different persons in a podcast and group the transcript by person. Thanks!
Hi. Not for me. Just thinking out loud if the transcripts are automated by your service and it’ll be a way for the podcasters to provide along with their audio recording.
I would like an (automated) “IMDB for Podcasts.” Specifically, I would like to be able to find, and be notified of, every podcast where a given person speaks.
Similarly, I’d like automated data on what ads are run/read.
Essentially, I’d like rich automated metadata, in addition to timestamped transcripts.
Craig from YC here. Ramon Recuero and I built SpeechBoard.
Here's how it works: you record a podcast and upload it to SpeechBoard. We run it through a few speech to text APIs to generate a transcription for you. From there you can delete words from the transcript and we cut them from the audio.
Then you can download three files: your edited audio, your original file with cuts marked in metadata for importing to Audition/Audacity, and labels for importing into Audacity.
Nailing the in/out points of words was the hardest part, which led us to create the Audition/Audacity import feature and now I think that's the best part :)
Let us know what you think!