Show HN: Full Text Search on Podcasts

wenbin · on May 1, 2020

If you just want meta data search (with 150k transcripts search), try https://www.listennotes.com/

gingerjoos · on May 1, 2020

Hi wenbin,

Listennotes is one of my favourite single-person companies. Congratulations on a wonderful product! I've been really impressed that you were able to build such a good and stable product; I suppose some of that comes from using "boring technology" (https://www.listennotes.com/blog/the-boring-technology-behin...)

What made you add the playlist feature? It seems more b2c than some of your other features. Is it gaining traction? Do you think at some point you would make your own podcast app?

Thanks! Anirudh

krat0sprakhar · on May 1, 2020

Great idea but I think you should expose more search options to make this useful. Options like post, duration, etc would make this very very useful! Something like:

  podcast:thedaily q:coronavirus testing

I can't remember the number of times I remember the podcast where I heard something but can't seem to recollect the episode.

andrewmatte · on May 1, 2020

Cool idea for an add-on!

I've noticed that I feel better about giving feedback to uncompensated developers using "this is great and" instead of "this is great but."

It could make or break someone's day.

krat0sprakhar · on May 1, 2020

That's great feedback - thank you.. I will keep that in mind for future :)

ehsankia · on May 1, 2020

those would definitely be great but even more basic than that, it should allow for "exact matches". I feel like most of my queries are useless without it.

nmiodice · on May 1, 2020

That is great feedback - and pretty easy to implement. I'll have a go at it when I have a little free time outside of work. Thanks!

nmiodice · on May 1, 2020

I made a tool that enables you to run full text search against audio content and explore the results using an embedded media player.

As of now (mostly for cost reasons) I have ingested a limited set of podcasts.

Please let me know what you think!

Note: It is not yet mobile optimized!

devbas · on May 1, 2020

Great idea! Did you deploy a speech-to-text pipeline to achieve this? I always thought it would be relatively expensive to do podcast-to-text translation at scale (compared to the gains) but maybe I just didn't optimize it well enough :)

jjice · on May 1, 2020

Not OP, but I've looked into AWS Transcribe [1] and at least their solution would begin to rack up quite a bit of a bill. From what I've seen, there isn't a great open source SST solution yet, although there do seem to be quite a few promising ones [2]. STT is one of the technologies I'm looking forward to most in the open source realm.

[1] https://aws.amazon.com/transcribe/pricing/?nc=sn&loc=3 [2] https://github.com/mozilla/DeepSpeech

nmstoker · on May 1, 2020

Seems like it has plenty of potential.

Not sure if I've been unlucky but the terms I searched and found (eg "economic") did not come up in the audio for quite some time after I played the sample. Is it meant to start the audio near the search term occurrence? (which seems like the natural thing you'd expect)

nmiodice · on May 1, 2020

Your intuition about how it should work is indeed correct. Some podcasts don't seem to produce a high quality timestamp in the underlying STT engine.

I'll be experimenting with word-level timestamps. Right now I am just getting the timestamp of larger chunks (1-3 sentences).

Thanks for the feedback!

chid · on May 1, 2020

this is awesome, I have previously been looking for something like this - where I wanted to learn more about a topic/discover podcasts on this sort of thing.

when I was looking at the cost to transcribe it just didn't make sense to do it for myself.

hope this works out well!

froindt · on May 1, 2020

I've wanted this product in the past and found the economics to be similarly challenging. [1] Some podcasts already put out professionally created transcriptions which is great, but I'd need to compile them in one place to figure out who said that one direct quote I remember.

Having people transcribe 5 minutes of audio/month in lieu of a subscription cost equal to 3 minutes of professional transcription was one model I had in mind.

I'm also hoping this works out well.

[1] https://news.ycombinator.com/item?id=15826604

kreetx · on May 1, 2020

This is very useful! For UI using algolia or similar would be pretty neat.

nexuist · on May 1, 2020

Are there plans to allow users to pick their own podcast episodes?

pdwittig · on May 1, 2020

Very strong second on this! Use case for me: Like most ppl, I often listed to podcasts while doing some other primary task (cooking, driving, etc.), and am unable to "note" the interesting snippets. When I go back to find those snippets, I am often unsure of which exact episode I heard it on, and if I do remember that, using the audio scrubber to find it is still a disaster. Would love to give it a try when you roll this out.

nmiodice · on May 1, 2020

Yep :)

tiew9Vii · on May 1, 2020

I had a similar idea wanting to play with the AWS / Google speech to text services.

I wanted to pipe in audio of various Youtube tech conference videos then apply some basic taxonomy/tagging and provide full text search so you can find a conference talk which contains some specific technology/subject you want to view.

I ran in to difficulty in technology / software conferences uses very specific acronyms and words that are not very general. Also being international there's many accents and levels of English. This means the AWS/Google API's struggled to translate videos which was also made difficult by using compressed audio streams you get from Youtube vs wavs.

lowdose · on May 1, 2020

Google offers the functionality to add your own acronyms and products on the commercial speech to text. I think there is even a manual quality feedback loop in alpha.

crawdog · on May 1, 2020

Very interesting. With the speech to text APIs out there would be interesting to further expand to point in time queries of the articles. Similar to the https://podcastsearch.david-smith.org.

Facets would be a nice to have - such as:

Series, Category, Guest, Date

Extra credit:

Speaker diarization and be able to search by individual speakers! If multiple channel feeds were available for the podcasts this would be easier to do... Maybe a search engine for podcasts where you partner with the content creators and give them incentives to tag/provide better feeds?

edit formatting.

bryanrasmussen · on May 1, 2020

It looks like you're wrapping some sort of api from hubhopper, so I guess there is not much you can do about setting up the search engine, but as a general rule you want to give extra weight to specific fields to end up with something useful.

In the case I searched for coronavirus the first hit of some sort of mountain bike podcast ranked higher than a number of podcasts that had coronavirus in the title.

mnfn · on May 1, 2020

I've been hoping that someone would do a broader version of David's Smith's podcast search[1] for a long time. It's really helpful for answering 'what was that episode where they talked about x?' type questions.

[1]: http://podcastsearch.david-smith.org/

smcleod · on May 1, 2020

It doesn’t seem to handle searching multiple words, for example with someone’s full name - if I search for the Doug Stanhope or “Doug Stanhope” it only returns irrelevant results for the word Doug and ignored Stanhope.

nmiodice · on May 1, 2020

My hunch is that, due to the limited dataset, there are just no results for "Stanhope".

smcleod · on May 1, 2020

What exactly is the dataset? The page has nothing on it other than the search bar.

lihaciudaniel · on May 1, 2020

This may be the best post you can't believe how useful this is

nmiodice · on May 1, 2020

Thank you! I always appreciate feedback on how to make it better. Let me know if you have any specific feeback.

wusatiuk · on May 1, 2020

Would be interested in the stack / workflows / frameworks / tools you use.

nmiodice · on May 1, 2020

The front-end is built in React. The service is deployed as a single Spring Boot application written in Kotlin. In terms of workflows, there is state machine that jobs process through (asynchronously with persistent state) within the service itself. It depends on no external workflow engines.

Infrastructure components include CosmosDB, Azure Blob Storage. The service is deployed as an App Service.

The STT and Search tech I'll keep under-wraps for now as they may change.

rohan_shah · on May 1, 2020

Is there any site that ranks podcasts by viewership/listenership numbers?

schlu · on May 1, 2020

Did you handle dynamic ad insertion? If so what approach are you taking?

jdc · on May 1, 2020

I get a 500 error when I put my query in quotations marks.

notRobot · on May 1, 2020

What podcasts does this search?

rapsey · on May 1, 2020

Not many it seems.

nmiodice · on May 1, 2020

Correct. It’s costly to ingest data and the app is in beta so it doesn’t make sense to invest - yet - in indexing broader content

bravura · on May 1, 2020

So charge people to ingest their favorite podcast

KMnO4 · on May 1, 2020

How much are people willing to pay? A quick search shows Google's STT API at $1.44/hour. As an example, the Joe Rogan Experience is ~1500 multi-hour episodes, meaning it would cost >$5000 for just that one show.

Presumably the OP is using an offline speech processing tool, but compute costs would still be expensive.

artificial · on May 1, 2020

Instead of Folding at home what about STT@Home and offer credits on the service for pooling resources? I've got compute I'd kick at this.

rapsey · on May 1, 2020

In that case lots of podcasts could afford to do it. Few have as many hours as JRE and 5k is peanuts compared to the tens of millions he is making.

_curious_ · on May 1, 2020

Cool, thanks for sharing this!