Hacker News new | past | comments | ask | show | jobs | submit login

Question for the author: Why to markdown? It seems to me the hard part of this tool is parsing pdfs with high accuracy, not whatever you do with them. As such, I would love if this tool allowed the user to choose the output format. I know that I would use a high accuracy pdf parser to render into epub.



You would want to have some kind fo markup that preserves structural markup as much as possible. I manage ebooks for a university press, and we have a deep backlist waiting for conversion, a lot of which only exists as page scans of old print volumes. I want to be able to offer them as epubs, which means I need to know where there are chapter breaks, heads, tables, charts, math, blockquotes, and so on and so forth. I have vendors that can do this for me, but it costs more than we'd get for some of these books in sales. I'd love to be able to do soem of this myself.


I agree, the intermediate format should be plain text that could optionally be converted to any other format. I suppose that Markdown, however, is used as intermediate format here. It is close to plain text while it can preserve simple layout information.

In practice, I would use the Markdown output and plug it into any tool that converts that into the desired final output format.


That sounds reasonable. I might explore pdf -> markdown -> epub.

I wonder if this could somehow be used directly by calibre. I think calibre's pdf->epub conversion isn't amazing. In particular, tables often end up broken.


I chose markdown because I wanted to preserve equations (fenced by $/$$), tables, bold/italic information, and headers. I haven't looked into epub output, but this ruled out plain text.


Why not choose an unambiguously parseable output format such as JSON, and then convert JSON to markdown/ html / etc when needed?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: