Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Cleanvoice – Automated Podcast Editing (cleanvoice.ai)
221 points by autoencoders on Nov 20, 2021 | hide | past | favorite | 117 comments



As someone who has personally edited over a hundred 1-2 hour podcasts with a new guest every time removing umms, ahhs, dead air and filler words is soul crushing. It has gotten to the point where after 2 years of running my podcast[0] I'm seriously considering stopping the show because I'm getting burnt out from editing and without sponsors it's not feasible to hire an editor, but even with the show making no money I would happily pay triple your asking price if I could click a button and have the problem solved in a way that matched a human's ability to edit out filler words.

It really is the difference between being able to edit a 1 hour episode in 1 real life hour (editing at 2x speed) vs literally spending 5 hours to edit 1 hour when there's a lot of filler words or ums. That's due to having to stop every few seconds, think about when to cut it and perform the cut. This is using a heavily optimized keyboard shortcut focused workflow too.

I hope you don't mind constructive criticism but in my opinion your "after" version doesn't sound natural. This isn't an attack on your service specifically, because the outcome is the same with all of the automated tools I've tried. I haven't tried them all but I did play with a few of them.

For example in your case the pause between "Removing" and "filler" doesn't match the pace of the rest of the sentence and the transition from "very" to "time" has a very hard cut. This is also a 10 word clip that's about 6 seconds. If you listened to a 1 hour podcast episode that was edited like this it would be much more noticeable.

There's so many intricate and subtle details around when and what to cut to remove these things in a way where it's not noticeable. Are there any paths moving forward in AI / ML that can lead to this being indistinguishable from being humanly edited?

I debated deleting this comment before posting it because it's a combination of feedback but also saying the service isn't something I would buy in its current state but I'd like to think it's more beneficial to post this to show there is a real demand for this service if it can be executed flawlessly.

[0]: https://runninginproduction.com/


I use the editing software “descript” and this process (removing ums) takes just a few minutes even for a long show, because you just delete the words in text. They even have a button, remove “ums”. It’s a game changer.


Meta, but your comment was (IMHO) a great example of constructive criticism. Show HN is about that, not just staying silent and letting the users work die.


Funnily enough I was about to start building this then found descript[1]. It transcribes the text and allows you to edit the transcription then export it as audio.

[1] https://www.descript.com/


revoldiv.com has a similar feature set


I’ve not edited anywhere near as much as you have but I agree, it’s so tedious and by the end of an editing session you can really start to resent the guest and all their verbal ticks. I find I get a good idea for what the waveforms look like for some noises and can see them coming and preemptively split the track the start with a decent success rate.

Using RiversideFM to get two locally recordings is also a big help.

I was sat next to an audio editor and producer at a wedding recently and we got on to this topic and he said “your number one job when editing an interview is to make the host sound good and then just do the minimum on the guest, otherwise you’ll waste too much time”.

Doing the kind of editing 8 hours a day I can see why he says that.


Yeah it's weird. I have these in depth technical conversations with every guest where it's great, I love this part. The frequency of verbal ticks and filler content really takes an edit from "this isn't too bad" to "what the fuck am I doing with my life?" all based on how many times you need to remove filler content within the first 5 minutes of editing a 90 minute show.

I'm kind of surprised that wedding producer openly said that. My philosophy has always been the opposite. One of my main goals of the show is to make the guest walk away thinking this was the best podcast experience they ever had from start to finish as well as do everything I can to make them come off as good as possible.

I rarely cut content but most episodes have hundreds of manual edits to remove filler content and create a more concise flow by removing long pauses because my 2nd main goal is to optimize for the listener. I keep the edits organic at the same time by leaving in some filler content and subtle things like a deep inhale or a sigh because there's a lot of meaning around that when it comes to sentiment and tone, the same can be said for sometimes leaving in an extra 500ms pause to amplify the meaning behind something. At the same time, sometimes filler content gets left in because it flowed too quickly into the next word so cutting it sounds too unnatural as if it clipped.

This is why I think it's a crazy hard problem to get a machine to be able to make decisions like this.

I do use separate recordings (we each record our track locally), it definitely helps eliminate the few cases where we talk over each other or being able to lower the volume of a laugh so it doesn't overpower what the other person said while still keeping it in because it's a good part of a conversation and a snort or laugh can easily be the difference between a listener wondering if the guest was offended or happily agreeing with something.


The edit on the page is not the best. I agree!. Mainly, if your recording is unnatural (like that one) the edit is also unnatural. However, the tool works better in an interview podcast. I would strongly recommend to just upload a sample, and you would see a big difference.

Regarding if ML would be indistinguishable from humanly edit. Hard to tell. I think it will be like self-driving cars in the future. 98% edits good 2% bad edits.


This is a super cool product, congratulations. I especially like the extremely clear value proposition on the homepage. I know what this is and who it is for right away.

My first impression of the unnatural recording was that it must be that way to make it easier to get a good result, but then the result doesn't sound natural either. I think a lot of this is the drawn out uterrances made the speaker vary their pitch/cadence a lot more than usual. Once edited to remove the gap, the sudden change is very noticeable.

I don't think that's due to your software, but just a fact of the unnatural source audio. I think a different, more realistic source audio could let you have a really awesome example, without it being disingenuous or not representative of real-world results.

Thanks for jumping into the ring and answering questions in here!


Thank you for the suggestion and your impression. I agree, a better example would highlight it better.


Is it possible you’ve set the bar too high for yourself? What if you timebox the editing effort and just focus on the most egregious issues? Certainly you would get more complaints but how much impact would there really be?


It might not make much of a difference with a hobbyist's podcast, but filler/mouth sounds won't fly in professional productions for a variety of reasons (time constraints, professional standards, wanting to make hosts/guests sound good, etc.)


Professional productions also have deadlines and budgets and obsessive grooming won’t fly either.

I’m not suggesting no edits, just relaxing things a bit so the burden doesn’t become an existential threat to the podcast.


> Is it possible you’ve set the bar too high for yourself?

Probably but I have no way to turn this off and be happy with myself.

I try to approach everything I do from the angle of "what needs to be done to make this as good as it can be with my current skill set?". From a listener's perspective if I had to listen to something with a bunch of mouth noises, ums every 3 seconds or long pauses I would end up focusing on that instead of the topics being covered. It would give off a wrong impression that conflicts with my core values.


>Probably but I have no way to turn this off and be happy with myself.

I can appreciate this, but...

>It has gotten to the point where after 2 years of running my podcast I'm seriously considering *stopping the show* because I'm getting burnt out from editing and without sponsors it's not feasible to hire an editor, but even with the show making no money I would happily pay triple your asking price if I could click a button and have the problem solved in a way that matched a human's ability to edit out filler words.

(emphasis mine)

I don't think it's actually the case, but extrapolating a list of priorities from this, I can only arrive at the following:

Priority #1 - no aahs, umms, slurps or smacks

Priority #2 - no ads or obvious sponsors

Priority #3 - surfacing hard-won lessons from experienced folks for the world to learn from

Maybe that resonates, maybe it doesn't, but to me it seems upside down.

I'm only commenting because what you're describing used to be me. I used to do this type of editing for recordings of live audio production and I've gone down the rabbit hole you're describing above. The problem is there's no obvious point of 'done', and chasing perfection in the output can become a pathological obsession. You can get so lost in mating phase angle at each end of a trim or taking an eraser to get rid of a sleeve drag across the desk that you lose sight of the totality of it. Ultimately you end up in a weird uncanny valley, like those folks that keep 'fixing' their face with plastic surgery. Once you get to that point, you can no longer identify specific issues to correct, you just fall into a diffuse unease.

For me podcasts are a way to join a conversation that I wouldn't otherwise have an opportunity to listen to. I don't see them as a show or corporate media product, and the more they start moving that direction the less inclined I am to listen to them. Julia Childs had a quote that I've found oddly applicable in this context: 'It's so beautifully arranged on the plate, you know someone's fingers have been all over it.'

Hope this doesn't come across as negative. Good luck!


Thanks a lot for the reply.

> I don't think it's actually the case

Do you mean the editing process isn't what's making me want to stop the show?

For perspective, phrases like sleeve drag aren't even in my vocabulary. I mainly do my best to quickly get rid of filler content without it sounding like there's hard cuts. It's not chasing absolute perfection where I'm zoomed into the waveform so much it looks like an oscilloscope while I hem and haw about there being a 35ms or 50ms pause between 2 words, or agonizing if I should leave an um in there so things don't sound over processed.

Here's a screenshot while editing an episode where the guest was extremely fluent and I didn't have to edit much filler content: https://i.imgur.com/7CBZ1yc.jpg, for context the episode was 90 minutes long but I zoomed into the point where you can see a ~10 minute chunk (normally I'm zoomed in much more while actively editing). This is a best case scenario where I "only" had to do 305 cuts for a 90 minute show. In the worst case scenario it's gone as high as 1,800 cuts for 90 minutes.

I try to keep things organic while being respectful to listeners. All of the cuts you see there are related to removing filler content (umms, ahhs, mouth noises and long pauses). I also remove their dead air when I talk to avoid any of their mic's background noise overlapping my voice since it's all recorded in an uncontrolled environment.

The before and after is pretty staggering even with a fairly minimal amount of filler editing. To be honest I would feel embarrassed posting the unedited version of most episodes.

It's also very interesting because in a way I think posting a much less edited version where I kept all of the filler content in wouldn't save me much time in the end. Not to sound too over confident but I'm really confident in my ability to perform quality assurance of each episode while I'm doing the editing. I haven't listened to a single episode in its final form because I've gone through each sentence and phrase multiple times during the editing process. For example I'll start playing it, hit a cut point, make the cut, rewind a bit and ensure things flow smoothly, then continue onwards.

If I did a much less edited approach I would still need to listen to the show at 2x speed, so no matter what I'm spending 30 minutes listening to 1 raw hour. However I'm also creating timestamped show notes like you see here https://runninginproduction.com/podcast/99-a-custom-electron... along the way while editing so I have to pause to write these down.

Basically I would still be spending quite a lot of time to produce things and I don't think I can outsource that because it would involve finding someone who is not just an audio editor but they would need a ton of domain knowledge around 100 different assorted technologies. A lot of those timestamped notes aren't verbatim quotes. I'm mixing quotes with trying to keep it concise to fit into 1 line. I'm also making judgment calls on what to include because not everything is worth making a note over, otherwise there would be one every 30 seconds (I used to do this in earlier episodes).

Personally I would rather have a transcript with timestamped links where each guest is broken up into their own paragraphs but to have them done right costs a lot of money. Every machine generated transcript service I used had really bad grammar issues and mistakes. A human reviewed one would be well over $100 per episode to make which is a lot when the show already has a net loss on every episode (hosting).

That quote you mentioned was really good by the way. I'd like to think my editing style is more on the side of someone occasionally using their hand to make sure the food doesn't slide off the plate while you run the plate over from the kitchen to the customer. That's how I feel during the editing process. I'm trying to get through it as fast as possible but taking great care to ensure a high quality meal arrives to the customer. I'm optimizing for folks wanting to come back to their favorite restaurant on a regular basis, not serve an artificial feeling $10,000 plate to a king.


> > I don't think it's actually the case

> Do you mean the editing process isn't what's making me want to stop the show?

No this is just confusing language on my part. What I meant was that I don't actually think those are your list of priorities in order, but that is how they could be extrapolated based on which part has to give.

OK so after your description of your workflow I think I was reading too much into where you were at specifically with regards to the content clean-up. I was worried that you were hovering over every sentence trying to optimize it and was just trying to talk you down off the ledge. :) For some reason I tend to gravitate towards jobs where I'm at my best when nobody knows I did anything at all. Editing is probably one of the best examples of this and, as a result, it's hard for anyone that hasn't done it to truly appreciate how much work there is behind it.

(Some of this is selfishly motivated btw, I've been following your podcast since the spring and don't want it to go offline lol. If i have to listen to some CTO's lips smack every time he gets ready to talk I'll allow it. :) )


> For some reason I tend to gravitate towards jobs where I'm at my best when nobody knows I did anything at all.

Yes, this is perfectly said. It's exactly how I feel and what I strive for. I think most folks would be surprised if they listened to a before / after even if all that was done was occasionally remove filler content and mouth noises. It's like that one business analogy iceberg picture with "success" being the 10% that's above water and the other 90% is buried with all sorts of things you never hear about.

> Some of this is selfishly motivated btw, I've been following your podcast since the spring and don't want it to go offline lol

That really means a lot and I'm happy to hear you like the show but unless a big pile of money falls from the sky to afford hiring a dedicated editor and human reviewed transcripts then I have to pull the plug. I've already been feeling this way for 3-4 months but tried to power through it. I've reached the point of feeling resentment and disgust just thinking about opening my editing tool of choice and it's taking its toll. It sucks because I would love to record the show until the day I die but these are the cards I'm dealt and I have to choose sanity over suffering at this point.

There's no middle ground due to the last half of my previous reply.


Sorry to hear that man. You did it once, quite well in fact, so you could always do it again if the opportunity strikes. Hopefully we'll figure out a way to make it simpler for this kind of thing to sustain itself. Until then I'm glad there's no ambiguity at the number one priority, health and well being.


It's all good. That feeling was very much compartmentalized to just the editing bits of the podcast. Maybe one day it'll work its way out to being doable.

I posted a new episode today since I still have 6 unedited episodes left, I figured I would release them once a month until they run out. I'll also be posting a "what's happening with the podcast?" video on YouTube tomorrow.


What post-processing do you do already to catch the low hanging fruit? Izotope? I reckon putting in 100 hours of editing and not being able to get an hour down to sub an hour means there is something which could be optimised out quite quickly.


> What post-processing do you do already to catch the low hanging fruit?

None, everything is manual.

I use DaVinci Resolve to do the editing where both the guest and myself have separate tracks. Then I line up the tracks (only takes a few seconds) and start playing things from the beginning at 2x speed. I stop to make cuts mostly to remove filler content.

Through out this process of editing I'm also creating show notes as I go. An example of the end result is here https://runninginproduction.com/podcast/103-great-question-m.... Basically every few minutes I recap what was said into a 1 sentence bullet point with a timestamp. Along the way I list out techs used as tags and list out reference links / libraries into a Markdown document. Then once I'm done editing the show I write a few paragraphs which is a TL;DR of the episode.

All in all if the guest uses minimal filler words or noises it takes about 1 real life hour per 1 hour of recorded content to do all of the above. For context, the episode I linked has someone who I would bucket into a category of speaking very fluently with minimal filler content. I was able to blaze through that one.

I also have a 2560x1440 display and use the "always on top" feature of most window managers to layer the Markdown document and a preview of the page just above the waveform in DaVinci Resolve so I can quickly make cuts and update the notes with minimal mouse movement. Almost everything is keyboard driven.

What tools can be used to speed up that process?


It sounds like the show notes are the most costly part I would assume? I imagined you were exhausting yourself on scrubbing through manually and editing little clicks, lip smacks, inhales out slowly. The former is much harder to automate away but the latter is definitely easy with some commercial audio plugins.


I've timed myself going through episodes where the guest spoke very fluently vs guests where I had to stop every few seconds to cut a filler word. The latter takes multiple hours longer which makes me think the time consuming part isn't the show notes, but the mechanical editing. Each note only takes about 30 seconds based on listening to the last few minutes of what was said.

It is mentally taxing though, it means during the whole editing process my brain is constantly identifying and removing filler content, listening for specific tech choices to tag, listening for specific references that could be interesting to link, listening for mentions of libraries to link and also digesting the main takeaway of what's being said to sum it up into a note. All of this happens in 1 pass during the editing process. I tried doing it in 2 passes where I only focused on mechanical editing the first time around and doing the show notes on the 2nd but it took longer in the end.


Hey HN!

I like podcasting, but I hate editing them. I tend to stutter and have a lot of filler words in my podcast. That's why I created Cleanvoice, in order to spend less time editing them. Cleanvoice is an ML tool which removes filler words, mouth sounds, stuttering and dead air from your podcast. To use it, just upload your podcast - wait some minutes - download the cleaned audio.

It's still not perfect, but it's at a stage where I can blindly use it on every single one of my podcast.

I would love to hear your feedback!


Neat! I love products that come out of a personal need.

Is it possible for you to do a live, personal demo? No logins or anything. I'm thinking something where you tell people to start up their audio and then give them a quick prompt like "Describe your breakfast yesterday." Record for 30 seconds, and then let them play back the original and cleaned versions. You could limit them to, say, 5 goes, with a different prompt each time.

I suggest it because a) a little personal investment makes it more likely they'll give you their email address for signing up, and b) many potential customers underestimate how much they need something like this.


I like your idea, makes sense.

My biggest fear is that without login, people will start abusing it in ways that I don't expect. Definitely considering it. Thanks you!


That's a good fear to have. That's the kind of thing I would set up some monitoring for and then wait to see. You might get a few jerks. But those same jerks might also be the sort of people who would sign up with a bunch of fake emails, so gating on an email address may not be much better than gating on a fresh-issued cookie.

Thanks for listening, and good luck with your project!


Have you compared this to other commercial options such as Descript? Looks really great at a glance, thanks for sharing!


I tried to use Descript for my podcast, but it has some issues.

1) It doesn't work well if you have a strong accent. As an non-native speaker, the transcription were quite bad, making the editing quite bad.

2) Cleanvoice works with multiple languages, descript doesn't.

3) Cleanvoice can remove stutters (not always, but it tries) and mouth sounds like lip smacking, teeth clicking. Descript can't. This is not a big deal for most, but since I stutter alot this was essential.

My approach is different from Descript. They use a transcription service, and then they edit the audio based on the text. I work directly on the phonetics level. Allowing me to have more control over audio.

Depending on the needs, either one is better. I guess you should try it for yourself and compare.


I will try it thanks! We work with lots of accents and we've found the same with Descript that it fails for example with a strong French accent. Translstion is really key for us also, looking forward to seeing systems trained on more accents.


I use Descript and it is absolutely lovely. There are a bunch in this space that I would not be surprised being merged or acquired. Would love to see Descript & GetWelder merging together.

While Cleanvoice has some niche features that Descript doesn't offer I would not be surprised to find them rolling these features out in the next major release they're doing. IMO the founder of Cleanvoice should sell/join Descript.


Without giving away your secret sauce, what are your approaches to the cleaning process? Is it a combination of different passes of algos or is it something more generic and "sausage machine-like" like a neural network?


The audio is edited in several phases. It uses different algorithms, but most of them are deep learning based. It is surely overengineered, but as a Data Scientist, ML is the most fun part for me.


How is the latency and, if it's sufficiently low, could this realistically be applied to "nearly live" content?

That scenario seems really appealing for conferences, even if it just quietens down the verbal ticks, but I'm guessing if the lag is too great it would get like a bad lip sync issue


How does real time makes sense in the first place for an algorithm that gets 1 minute of audio and gives you back 50s? You are gonna have to fill the gaps anyway with something not meaningful.


An awareness of your point was precisely why I mentioned "quietening down verbal ticks" (ie 1 minutes gives you back 1 minute but with the ticks removed/muffled)

To me this seems like it could be worthwhile even if it results in silence or less prominent umms and other filler - I've sat through enough conference talks by technically gifted people who I very much wanted to hear but who unfortunately make their talks much harder to follow due to the ticks. It might even help relax some nervous speakers if they knew any of these that creep in were being suppressed.


I understand now. Very niche, but I applaud the effort to give voice to people who have something to say, instead of those who know how to talk in public.


Silence is meaningful, but pretty awkward when not deliberate!


Tools like this are designed to remove awkward silences.

What it sounds like the GP is after is something more like hiss and pop removal (to use an only vinyl analogy) and that’s a different and also simpler problem to solve. I’d wager there are already tools on the market for that.


Very insightful :). Now I need an AI to tell me when silence is deliberate or not. :)


It would be a huge engineering endeavour, which I wouldn't be capable of doing. That said, things like background noise and some sounds can be removed. See Krisp.ai


Nvidia RTX voice does similar. It's pretty similar to other technology though where it focuses more on removing background noise. It actually works very well. It would definitely be interesting to see it also filter speech itself. But I feel like this would be hard to do without introducing extra latency. If someone is saying "umm" or some other filler before a word you kinda need to know what that word will be to determine if it's filler or not. So it almost can't be done without introducing latency as it would need some future speech to determine if filler or not.


To do this, the speaker would have to wear an EEG cap. You're talking about cutting the mic before a verbal tic happens.

With an EEG cap, though, I bet a smart person familiar with the methods could bash something together in a day that would work.


True. You don't even need a full CAP. Just some channels in the visual cortex. (With more advanced AI) So you would just need to hear a headband or one of those EEG which look more elegant.


Izotope plugins already do some of these things but not all. In particular their de-clicking algorithm is pretty good but definitely not automatic or low latency.


Do you do any audio segmentation to remove the filler words and such?


Based on the OP's username, surely one of the deep learning algorithms is a denoising autoencoder, right?


I literally just bought your product, thank you very much, I needed this and wondered why no one had made it yet.


I appreciate it! If you have any issues or need help, feel free to reach out. (You can use the chat in the app.)


This is awesome.

Can I suggest the ability to export as project files for popular editors for your roadmap? It'd cut professional workflows down substantially, which would be worth an (even higher) upcharge.

(It wasn't immediately obvious to me if you already did this)

Edit: https://cleanvoice.ai/integrations seems pretty close. I'd honestly charge more for integrations and provide a base tier for just exporting sound. I imagine most indie users would benefit from finished exports enough to pay, while project files would command a higher fee from editors looking to speed up their workflow to take more clients. That's where I'm coming from on pricing tiers and upcharging for professional features.


ADL Support will come around Q2, so you can import it in lot of audio and video editors. For now, we have these export files which you mentioned.

Regarding Pricing, that's a good point. I will definitely consider it, thank you!


To add to this, it might also be a good feature to output EDL files for video editors.

https://www.rev.com/blog/how-to-import-an-edit-decision-list...

This could help when you have for example multiple camera angles, to switch between/do morph cuts (https://helpx.adobe.com/premiere-pro/using/morph-cut.html) for video interviews.


Thank you for the suggestion. EDL/ADL is definitely on the list.


Sounds similar to Descript https://www.descript.com/


I'm a very happy user of the free tier of Descript right now, and will definitely pay once my transcription limit is reached.

It seems like this particular product might do a better job of the automated editing specifically, but Descript has a ton of other features (speaker identification, transcription, real-time editing based on text edits, asset management, and uploads), and I definitely wouldn't trade them for marginally better auto-removal of noise and filler.

Does anybody who develops Cleanvoice have any commentary here?


Adrian from Cleanvoice here.

Before building Cleanvoice, I tried to use Descript for my podcasts.

1) It doesn't work well if you have a strong accent. As an non-native speaker, the transcription were quite bad, making the editing quite bad.

2) Cleanvoice works with multiple languages, descript doesn't.

3) Cleanvoice can remove stutters (not always, but it tries) and mouth sounds like lip smacking, teeth clicking. Descript can't. This is not a big deal for most, but since I stutter alot this was essential.

However, if none of these apply for you. There is no reason to change from Descript.


I suspected something like this was happening with podcasts. I've noticed lately that some podcasters have unnaturally short pauses between speakers (question and answer) or between sentences. It really annoys me. It makes it almost unlistenable.


I agree, as if they don't breathe!

This is not the case with my app. I keep the edits longer than shorter, since I also find that unlistenable.


Yes, the worst is when so much silence is removed that it sounds like someone is laughing over themselves.


Reminds me of https://auphonic.com/

Their pricing is also similar, but Auphonic allows both subscription and prepaid "credits".


Yes, the idea is to bring also prepaid credits soon.

Auphonic and Cleanvoice go well together.

I guess the idea is to have your podcast edited by Cleanvoice and then the audio post-processing with Auphonic.


Auphonic's volume equalization is almost a must-have for podcasts. I used to spend a lot of time getting volumes right. With Auphonic it's quick and easy.

I definitely prefer pre-paid credits to a subscription given my podcast production varies a lot.


Is the example in the page really made by the computer? In my opinion the pauses in where the filler words were are slightly too long. Is it possible to configure this?

Is it possible to keep some filler words? I make something similar (but not professionally), and sometimes I like too keep a few of them.


> Is the example in the page really made by the computer? Yes. >In my opinion the pauses in where the filler words were are slightly too long. Is it possible to configure this? I agree, however, if you use it in an interview. The edits sound better. In an unnatural setting, you get unnatural results.

Currently, there is no way to set it for now. But customization is planned for Q2 next year.

>Is it possible to keep some filler words? For now no, but keeping some filler sounds to keep it authentic is something which I plan.


I agree that the correct length of the pause after the word is removed is very tricky. Perhaps your configuration is the better than my imaginary magical edition.

In other comment, eganist posted a link to https://cleanvoice.ai/integrations It looks interesting because I can choose which to keep and even use it to sink with video [with some additional work]. I didn't see it the first time in the page.


ADL Support is also around Q2, so you could just import it in your audio/video editor without issue. Thank you point out. I'll put Integrations on the homepage as well.


Overcast has features to do some of this on the listener side. I prefer having the AI on the listener side so I can go back to the raw version if the AI messes up for some reason.


That logo is very similar to the Cisco logo:

https://www.cisco.com


Congratulations on launching. How are you finding using termly.io for the legal side of things?


It's not ideal. See the comment talking about the terms. I have a meeting with a lawyer soon. But I guess is better than no terms.


Can anyone recommend similar for removing ums etc. in videos? IIRC there is a workflow in some professional software, but being able to train and throw the algorithim right at the video itself (especially locally) would be useful.


> Can anyone recommend similar for removing ums etc. in videos?

For single camera floating head style videos where you're continuously talking about 1 topic it's going to be very jarring if you start cutting out filler words. You'll end up with a bunch of jump cuts where it looks like video frames are dropped.


For now, Descript would be the best option. You can still make it work with the integrations, but it is a lot of effort.

That will change in Q2, when I add support for video.


Yes; Descript.com does this.


Looks cool! Would this also work for "explainer" type videos, showing how to use a software product or similar?

If yes, you might consider a page or callout about that use-case, as it might attract some additional users. Just a thought.


That seems like it would be tricky, as the video and audio would get out of sync. You would have to remove, then "fill" to keep the timing. Though this product does mention it works with multiple speakers on different tracks...so they are already somewhat in that space.


For video is quite tricky. One thing with Video is that you don't want to over edit the audio, since its then very hard to keep the video synced. That said for explainer video it should work ok, but for a Video Podcast it would be horrible. I have an idea how to deal with this, but this is not now available.


Really cool project, I wish you great success! Could be useful for my (german) podcast agency!

Out of curiosity: Which ai-technology did you use? OpenAI? Google API? Or did you train the models yourself with Python (sth. like Tensorflow)?

Cheers, Mike


Hallo Mike, freut mich dich kennenzulernen!

I trained my own models. No OpenAI/Google API.

Liebe Grüße, Adrian


“The algorithm can also work with accents from other countries, such as Australian ones or Irish.”

Other than which country, though? Presumably an English speaking one - UK? New Zealand? Canada? US?


Hey! Justin (from Transistor.fm) here. This looks really interesting. Two questions:

1. Any plans for an API and bulk pricing?

2. Any plans to add loudness normalization, balancing, etc to the processing?


So my current podcast stack looks like

Zencastr RECORD

Transistor.fm HOST

but looks like I'll be adding

Cleanvoice AI CLEAN UP

Auphonic AI EQ

any other suggestions for optimum output from multiple inputs all over the world?


Hey Justin! Love your podcast.

1) API Access will come end of Q1.

2) In the next 6 months, No. However, Auphonic would be a good fit for you.


This is excellent, well done! I'd be curious to know how it's done, as I don't know much about deep learning and this looks like magic to me.


What's the high level approach required to build something like this yourself?

Does it involve relying on speech to text with timestamps and then a series of cuts based on that?


Your demos don’t play on iOS safari.


Ups! Thank you for point it that out. I'll check it.


Not sure were you are located, but if you are giving access to people protected by the GDPR your cookie notice does not fullfill the requirements set by European Regulations.

Additionally, if you are located in a country that (like Germany for example) has regulations on the necessity of an imprint, this might also be missing.


It should be ok, since I use strictly essential cookies, which don't require consent. (But users need to be informed)

Or do I misunderstand the law?

[1] Strictly necessary cookies — These cookies are essential for you to browse the website and use its features, such as accessing secure areas of the site. Cookies that allow web shops to hold your items in your cart while you are shopping online are an example of strictly necessary cookies. These cookies will generally be first-party session cookies. While it is not required to obtain consent for these cookies, what they do and why they are necessary should be explained to the user.

[1] - https://gdpr.eu/cookies/


The Terms of service seem worrisome.

> By posting your Contributions to any part of the Site or making Contributions accessible to the Site by linking your account from the Site to any of your social networking accounts, you automatically grant, and you represent and warrant that you have the right to grant, to us an unrestricted, unlimited, irrevocable, perpetual, non-exclusive, transferable, royalty-free, fully-paid, worldwide right, and license to host, use, copy, reproduce, disclose, sell, resell, publish, broadcast, retitle, archive, store, cache, publicly perform, publicly display, reformat, translate, transmit, excerpt (in whole or in part), and distribute such Contributions (including, without limitation, your image and voice) for any purpose, commercial, advertising, or otherwise, and to prepare derivative works of, or incorporate into other works, such Contributions, and grant and authorize sublicenses of the foregoing.

It sounds an awful lot like "we are allowed to do anything and everything we want with the content you upload to us". Maybe I'm misunderstanding something, but I'd be extremely hesitant to upload any content I create to a service with those kinds of terms.


I agree. The terms will be changed. I used an auto-generated Terms generator for now (termly.io)

I would like to rewrite it.

What I do is just keep your files on the server for a week. In case you have an issue, I will look into your file to fix your issue. And if you want, you can give consent for me to further improve the service. (Say you have an accent which the AI is bad and I can use your audio file to understand why it failed.)


With this statement you’ve now shown that your site doesn’t take contracts seriously and opened the door to people arguing future contacts are also invalid. I’d delete this response asap.


What? This person made something, we pointed out an improvement and they said they'd change it. You're literally complaining that it wasn't perfect already, and thus they somehow don't "respect stuff".


Sorry, who said “respect stuff?” I don’t see what you’re quoting. And who is “complaining?” I will just restate the warning, admitting that you’ve auto generated terms, didn’t know what was in them and forced everyone to sign them invalidates all terms of the current contract and gives ground for future disputes on completely new contracts. So… I’ll reiterate my recommendation to delete the post. This seems to have stuck a nerve with you so I’m not looking to get into it beyond that.


Why? They can change the policy and ask for a confirmation, as every service out there is already doing.


How and when have you seen it happen that a contract was invalidated by one party indicating that they would prefer a more appropriate contract?


Thanks for the heads up. I’m a little hesitant to upload something now. On the flip-side, I think devs just want total protection while they navigate the landscape of machine learning. I agree that they could have worded things better but someone who worked on writing this probably didn’t understand the nuances of machine learning or the countries that people would be signing up from. Plus they’ll need to constantly use datasets for their internal purposes to train.


Yes, that's exactly the case. As I previously commented, I used an terms generator, until I get a lawyer, which can write specifically what I do with the data.


Thank you.

I would pay for a piece of software that does that job on my computer with no Internet.

This way? I may even end up in court for saying something “improper”…

Edi. OK: I’ve just read the developer’s reply below.

Honestly: you need to fix this because right now it is more scary than not.

Congratulations for the project but please do fix this.


I agree. More and more AI applications are exploiting our data in negative ways.

I will get proper terms soon as possible. Especially, since now people have mentioned it.


"Free 30 Minutes Trial" is not native English. "Free 30 Minute Trial" would be better; but I think the sentence is a little confusing. I presume you mean you can convert 30 minutes of audio for free, not that the trial account is only valid for 30 minutes from creation. I would do "Clean 30 minutes of audio for free. No Credit Card needed." or similar. The sale page which says "Get 30 minutes credit to try the service out." is better, and "30 minutes" does sound correct on that page.

In your FAQ, you say: "Currently we remove lip smacks, saliva crackle, mouth clicks and harsh parts of breathing (not the whole breath). If you want to remove a particular mouth sound (ex. Chewing), write us in the chat as a feature request." I don't think most English speakers would understand what "harsh parts of breathing" are. Typically a parenthetical example in English would be written "(e.g. chewing)" not "(ex. Chewing")".

Your question "What filetype and sizes do you support?" doesn't answer what filetypes you support, and I suspect the singular "filetype" was a grammar error. You also write "We have an audio file size limit of 1.5G per file or in case you are uploading multi-track and a total file size of 2 GB. ". The part that says "or in case you are uploading multi-track and" doesn't make any sense in English. I think you mean "We support file sizes up to 1.5GB per file for single-track files, or 2GB if you are uploading a multi-track file as separate files." but I'm not sure.

In general I don't understand why each selling point has a separate FAQ page but the FAQs are often not related to the selling point. I don't think people think the "Mouth Sound Remover" page is the one that lists file size support, while the "Stutter Remover" page is the one that lists the maximum number of tracks per project.

Your integrations page lowercases "cleanvoice" whereas other pages write it as "Cleanvoice".

Under integrations, you have a section called "Markers Export". This should probably be "Export Markers" or "Marker Export".

Under "How to Export Edits", you probably don't want to capitalize "Results" or "Editor" unless these are supposed to be title cased, in which case you probably want to title case all of them.

Under your pricing FAQ you have "Does my credit expire at end of the month? Your credit will reset every billing month. Unused credit will be lost." This is needlessly confusing. You use the verbs "expire", "reset", and "be lost" to describe the same thing, and you don't actually answer the question. Also you don't want "at end of the month", you want "at month's end" or "at the end of the month". I would rewrite as "Does my credit expire at the end of each month? Yes. Credit resets every month and cannot be carried over to future months. Unused credit will be lost." This is a terrible business model, though, and so I suggest you not do this. Either sell as a subscription or sell as a credit model, not both, this is gross.

In general I think you want to pay someone who is a professional English copywriter to fix your website. Cheers.

Edit: I just noticed your changelog is powered by a service called Headway. I am not sure if you also made Headway, but Headway's website is also in need of English copyediting.


Wow! Thank you so much! You are right, I need to get ASAP a copywriter.

I'm curious why the Subscription + Onetime Credit is bad. But I agree it is confusing.

My understanding is that not every customer wants or needs a subscription, since they upload podcasts irregularly.

This business model is seen in other AI products:

https://www.remove.bg/pricing https://auphonic.com/pricing

I am very grateful, you took the time to help out. Really appreciate it!


Maybe you can get away for a quick fix with something like deepl.com.

They are great. As a German native speaker I came a long way with using them when I needed valid translations.


I'm going to sound like a negative nancey, but I wish podcasters/youtubers would just practice their speaking skills instead of rely on series of really quick jump cuts. Worst offenders are those that can't get through a sentence without splicing it 2+ times...

Perhaps you could have a mode to detect how much one stutters, and parts worth redoing without spending as much time combing the whole thing.


Some podcasts I listen to are over-edited. I'd always assumed that a) it was done manually and b) it was done to keep the length below some threshold. Now I'm curious if they are using software to automate the editing.

I find the cadence very unnatural when all the spaces between phonemes are removed.


>I find the cadence very unnatural when all the spaces between phonemes are removed.

Any editing can be overdone and, while I do a modicum of editing out umms, you knows, and other verbal ticks when I'm putting together a podcast interview, I'm not fanatical about it.

You do occasionally get someone who just speaks quite slowly and it is sort of annoying to listen to as audio. So I've done some automated gap reduction is a couple cases.


What software do you use to automate?


Audacity.


Especially ones who won't set their background LED lights to a stable color. The smooth flowing gradient becomes very distracting when you jump cut the heck out of it.


synesthesia?


Classically professionals learnt their discourses by heart. That stands out when you see it.

I remember fondly a student of mine who seemed unable to express himself properly. I told him to memorize his final project dissertation because otherwise it would be a wreck (OK, I did not say this last part, it was more of a suggestion).

BOY: did he memorize it. He got an honors and I did think “this guy has really done it, and it sounds like music!”

When you do it well, it tells.


Once I started noticing jumpcuts it ruined every single YouTube video with a person talking into the camera. The worst offender being Phillip DeFranco.


Interesting take. I saw Phillip DeFranco as more of a pioneer of that style. He really leaned into the cuts. At the time it was something no one else was doing so it was very noticeable, and he had a very crisp cadence with them where the jarring cuts were part of the presentation. It was clear his process was: Write a script, mark cuts everywhere it could make sense, go through the script repeating every phrase until you're happy with the sound, and when editing, always make the cuts where they're marked, even if it could be skipped.

The result feels something like pixel art: Clearly not the closest possible imitation of conversational speaking, but something else. A style in its own right with different considerations.

Now that it's par for the course to have jump cuts, I see them used more sloppily everywhere, where it's clear the narrator decided where to do the cuts after the fact. Cutting off the beginning or end of a phoneme, missing or repeating bits of a thought because they they liked one phrasing in recording but opted for another one in post, misordered cuts where something which moved in the background moves back to its old place, etc. Phillip's style looked lazy but it can't really be imitated with actual laziness.

These days I look back and really cringe at the substance of his show. But I still see the style as professional.


I find talking into a camera really tough. If you're doing it by yourself you almost need to imagine you're talking to a person. I even know of people who put cutouts or pictures of someone by the camera so they can talk to a person.

I haven't had a lot of luck using teleprompters but maybe I just haven't hit of the right setup.

Something else someone told me recently was to try to work in short segments that you redo until you get right and then do a cut to the next segment somewhere that it's natural.


The logo is similar to ours https://www.lovo.ai/


While turning it into a heart may be clever branding, you've only slightly modified a ubiquitous icon representing audio, and countless startups used that before you.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: