Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDF to Podcast – Convert Any PDF into a Podcast Episode (pdf-to-podcast.com)
118 points by knowsuchagency 3 months ago | hide | past | favorite | 42 comments
Hi HN!

I'm stoked to share a project I've been working on called PDF to Podcast. It's a free, open-source tool that automatically converts PDF documents into engaging, informative podcast-style audio content using large language models and text-to-speech tech.

Inspiration: The idea for this project came from the NotebookLM demo at Google I/O, where they showcased generating audio dialogue from uploaded PDFs and other sources. However, that audio feature hasn't been publicly released yet, and I wanted to challenge myself to build something similar using existing tools and APIs.

How it works:

The user uploads a PDF The tool extracts the text and feeds it into Google's Gemini Flash language model Gemini Flash generates a natural, engaging podcast dialogue script based on the key information in the document This script is then converted to audio using OpenAI's text-to-speech API The user can listen to the generated "podcast episode" and read along with the transcript I chose to use Gemini Flash for the language model because it's good at writing high-quality prose while being fast and cheap. We use OpenAI's TTS API to then bring the dialogue to life.

Under the hood, it's built with Python, FastAPI, Gradio for the web UI, and my own library, promptic, for calling the LLM and getting structured output. The code is open-source and available on GitHub.

Apart from the tool's practical utility, I'm hoping this project can serve as a helpful example for others looking to build applications on top of large language models. It demonstrates an end-to-end flow from document intake to language model usage to audio output, with a simple web interface on top.

I would love to hear any feedback or ideas from the HN community! I think there's a lot of potential to expand on this concept and make all sorts of written content more accessible and engaging through audio conversion. Let me know what you think :)




I always go straight for the prompt with this kind of thing - it's here: https://github.com/knowsuchagency/pdf-to-podcast/blob/512bfb...

It starts like this:

    Your task is to take the input text provided and turn it into
    an engaging, informative podcast dialogue. The input text may
    be messy or unstructured, as it could come from a variety of
    sources like PDFs or web pages. Don't worry about the
    formatting issues or any irrelevant information; your goal is
    to extract the key points and interesting facts that could be
    discussed in a podcast.
The way this uses different OpenAI TTS voices for the different roles is really neat!


I wonder what (if anything) is the impact of the leading spaces on each line of the multiline string, which are an artifact of wanting to keep the prompt pretty within code.

Hopefully not much, but I've heard horror stories about trailing spaces...


As far as I can tell that only really affects the smaller models - GPT-4 / Claude / Gemini all seem pretty much impervious to weird whitespace in my experience.


I imagine you could force this even further by specifying the names of the researcher and interviewer, and giving details of the structure of the episode


It might be a good idea to toss some kind of audio disclaimer at the beginning of the podcast that cites the source and that the audio is completely fabricated. Reason being, the "Attention is All You Need" example on your site has Anya Sharma (an actual AI researcher who is unrelated to the Attention paper) on as a guest. Not sure if this is intentional or a hallucination, but it seems like a huge liability


I tried the example physics article and it just made up a physicist to 'interview' that wasn't mentioned in the article.


Human podcasters hallucinate too.


Yeah definitely. I specifically only have dead presidents on my webdev podcast. When I interviewed George Bush Sr. last week, he said the housing crisis is overblown and we should all be focusing on moving our React apps to Vue


Awesome project!

However, I find that when I realize a podcast is generated using AI and synthetic audio, I immediately lose interest. For me, the value of podcasts lies in authentic human conversations, and AI-generated content just doesn’t have the same appeal.

Probably it's just me being obsessed with old-school podcasts, though. I do believe there are listeners (not sure if many or few) who don't mind if a podcast is AI-generated.


Funny, I've been using even primitive text-to-speech on PDFs for years and while nothing compares to an excellent human reader, I find TTS often better than a mediocre human reading. This is mainly because I don't get upset at (and then have to forgive) a machine when it says the "Loovree" instead of the Louvre or in an economic history book pronounces "Keens" for John Maynard Keynes (sound like "Kaynes"). Also the dead neutrality of a machine's reading can jar me less than a numbskull and/or phony human rendition. I must say though that excellent voice actors are to me heaven.


What extensions/apps would you recommend?

I have tried to set up something similar with text-to-speech browsers extension but I loose my place if I have to close and reopen.


On Mac with a pdf I just select say a chapter and let it read. Footnotes can be a problem. I usually use iOS now and I wrote PDF before but realize I use .epub files mostly. You can set up iOS to read entire pages. I use the local iOS Books app and have it so a two-finger swipe from the top of a page starts reading. It will usually turn pages by itself but can be a bit janky. I choose a good quality voice and have spent ten or twenty minutes rigging it up in Settings.

It's all far from perfect.


That's interesting, for me, podcasts are just news articles or books that I don't have the time to sit down and read. The only time I listen to podcasts and audiobooks are when I am walking around or doing chores. Yes, many podcasts have a human element to them that is nice, but just as many are still useful without a human, as for these ones, I'm primarily there for the information itself, not who conveys it.


It’s almost certainly the case that the most profitable and popular podcasts are ones built around the personality of their host(s) and not because the content is merely in audio form. So while this tool is useful for listening to information instead of reading it, the likelihood of a major podcast being entirely AI-generated is pretty low.


Just a tangent, fans are obssessed with certain artists, say, TSwift, because of their personality rather than pure voice and lyrics. That's why concerts are so fucking popular.


I tried the same thing for my kids:

Take some article or book written for adults. Maybe some archaeological discovery, interesting stuff from HN. Or science books from the 1960s.

Then have it turned into a conversation between the father and a curious, seven year old daughter. And convert it to audio with two different speakers.

While it’s been fun to build this, I never ended up letting my kids use it. It just feels wrong. The educational equivalent of Harlow’s Monkeys.


why does this feel wrong, it seems like supervised it could be very beneficial.


Looks good. As other people said, it's risky to give you my OpenAI key, so I'd make the app run locally with React maybe. Moreover, it'd be good to give an approximation of the price. It's kinda scary to click "Submit" and later on see that I was charged $3 by OpenAI.


The page has a link to the code, so I guess you can self-host it: https://github.com/knowsuchagency/pdf-to-podcast


Looks like a fun project!

Do you have any samples of the audio? It would be great to hear what it's like before trying it out.

Also, have you considered doing this all in client side JS? Would be a good way to protect the API key (at least in this demo case).


At the bottom of the page, there are examples.


I think it would probably help to take the PDF up front, do a combination of checking the DPI and page count to get an estimated word count (as OCRing to get an exact word count might be costly on your end), and then return back a “price preview” at which point the customer just pays the price to get their podcast.

Like others have mentioned, I’d be scared to accidentally upload a 100 page PDF only for it to cost me $100 without me really knowing up front.


Sounds exactly like the way the simply news podcast is put together. That is 100% ai for each topic (ai, tech, business, science etc) and combines multiple recent papers/stories for a hundreds of daily podcasts.

https://simply-ai.podbean.com

https://www.simplynews.ai


Love the idea, as I find never enough time to sit down and read but could listen to it while running or commuting. However, I'm hesitant to hand over my OpenAI API key to a website that's not under my control. No idea though how the trust problem can be solved.


All I can think of is some form of middle man.

An escrow agent.


Cool! But the really cool thing were would be a service that converts the contents of a text RSS/Atom feed to a podcast with a podcast feed. Imagine your favorite blogs being podcasts that you could listen to on the go.


I’ve been poking at this on and off! Got to the point where you can use a CLI command to turn any URL into a podcast episode. Even describes images and embedded code snippets so you can get the full experience.

Got distracted by other priorities so I haven’t done the RSS bits yet. In large part because that’s just boring old engineering stuff instead of playing with new toys. But I intend to get back and finish this thing by the time I start training for my next marathon. Need lots of listening material when that happens :)

Until then, hope this helps: https://github.com/Swizec/rss-to-podcast


How do you tell it where to put the Athletic Greens advertisements?


I never listen to a podcast on less than 1.5x because there is already too much crap conversation, and I only want the nuggets of value; so I would only use tts for listening to text.


This is cool! It works nice. Too bad the audio is only in English, even if you submit a PDF in another language


Should add an option to get Swedish Chef output, Bork Bork!


I want this too


Can I download as an mp3 for later playback or archiving?


If the examples are anything to go by, then yes, they are providing a link to a mp3 to download.


Can you imagine this, with an RVC pass to do voice transfer... what a time to be alive.

Just wondering why the choice of OpenAI TTS instead of elevenlabs?


Congrats on launch! Brilliant.


Nice work but I gotta provide my own OAI key? Why not just run one of the API demos at this point.


Can someone make something going the other way?

I don't like podcasts. I tune out after about 30 seconds of chit chat and intros and blah blah blah and end up missing stuff and can't search for it or copy and paste it.


Agreed. It's so annoying having good content buried in audio. For interviews of note on youtube, last week I cracked and spent 2 hours writing a yt-dl based ripper that converts the whole thing in to an html linkified webpage via intermedia VTT, opening the resulting subtitles based transcript file in the browser so you can easily scan and click anywhere you want to see the video and it will open in a new window at exactly that point. Not perfect but saves AGES.


I run MacWhisper on my laptop, and often dump podcast MP3s into it, extract the Whisper transcript and then feed that through a long context model like Claude 3 Haiku/Opus or Gemini Pro 1.5/Gemini Flash using my https://llm.datasette.io/ tool to answer questions against that transcript.


I don't like podcasters because they usually muddle through stuff and approach things in a kind of non-productive superficial way that drives easy engagement rather than hard work results.

That said if it's a topic that I'm really really ignorant about, a little podcast/YouTube can be helpful. For example Yannick kilchers YouTube videos, especially how he annotates and breaks down the math equations, can be very useful if the paper's domain is new to me.

I think about it as pre-reading the paper.

A more focused first and second reading mode, may I propose, would add even more value. In these modes, the paper would be read more faithfully.

A problem that text to speech has when you feed it a regular PDF is that it will choke on titles, headings, footers, inline citations, page numbers, acronyms, abbreviations, numerical tables, charts, and diagrams.

So I would like to build or see something that conversationally reads the PDF as if it were a peer reading to me, unpacking abbreviations, mentioning titles and authors and years of citations (when I want that), describing charts, and perhaps even letting me interrupt to discuss specific misunderstandings I'm having.

There's obviously a challenge that reading a paper is an active engagement depending on your own knowledge state. We might gloss over formulas, footnotes, and citations on a first read, for example.

Still, a low hanging fruit would be a converter mode that accurately strips out page numbers and headers. There is little in this world more aggravating than listening to a 30 page paper, and having to hear that paper title and authors repeated an additional 15 times because it's reading the header.


Stop downvoting me you delirious sheeple!

I accidentally wrote 'podcasters' instead of 'podcasts'.

I mean I'll grant that podcasters are the scum of the Earth but. But I didn't intentionally mean to insult them there. [Here I'm just doing it for fun, lol.]

And I swear to God and warn you!

You all are going to make me start a podcast if you end up downvoting this comment too! Is that what the world needs!?? For me to start an AI generated podcast!?!! Don't make me do it!!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: