My current workflow (for getting a magazine onto a website) is Calibre's HTMLZ e...

samuell · on Dec 1, 2023

Interesting! I tried it, but it seems to struggle with multi-column layouts (lines get intermingled). Is that something you tried?

afandian · on Dec 1, 2023

No, only standard paragraphs.

My workflow still takes manual tweaking. When I find floated figures with captions, the lines get intertwingled and need to be unintertwingled. So I'm not surprised it didn't work for you.

Good luck, report back if you find what you're looking for. I'm always on the lookout for a better way.

samuell · on Dec 2, 2023

I can report that the closest I've came before is with PDFMiner (https://pypi.org/project/pdfminer/) for Python. The benefit of this one is that it retains styling information, so that italics and the like can be retained, at least with some post-processing (I think one might need to convert certain CSS-classes to actual <i> or <em> tags).

The other option I have started looking into is the PDFCPU library for Go. It is a bit more low-level than PDFMiner, but one gets out very well structured info, that seem it might be possible to post-process quite well, for one's particular use case and PDF layouts: https://github.com/pdfcpu/pdfcpu

I also now tried the Marker tool in the OT, and it seems to do a reasonable job. It did intermingle some columns though, at least in some tricky cases such as when there were a round shaped image in between the two columns. One note is that Marker doesn't seem to retain styling like italics though.