Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Generate pdf with gitbook or mdbook url (github.com/lufengd3)
96 points by lufeng on Nov 11, 2023 | hide | past | favorite | 41 comments



Just recently I was tasked to convert some huge html pages (with lots of small entries) into a pdf file. The requirements are "fully automated solution" and "pdf must look the same as the page when viewed in a browser". Probably takes less than five minutes, right? I thought the same.

Wrong.

Chrome/Chromium crashes due to hard coded memory limit in V8.

Firefox has no command line option for printing pdfs.

No other libraries render the pdfs correctly because they are not full-fledged web engines.

So what were my options?

1. Read/understand Chromium source, recompile to lift the memory limit.

2. Read/understand Firefox source, recompile to add a command line option.

3. Use some UI testing framework to automate pdf printing in Firefox.

Eventually I did 4, which is split the html files into smaller chunks, convert and re-combine. Of course the problem is how do you know where to split the html so that it's at the boundary of the page? The solution is to do a binary search for the number of entries to put into each chunk when the number of generated pdf pages changes. What a pain.


In 2014 we used wkhtmltopdf[0] to generate PDF copies of Cloudfoundry docs for every version every release, and maybe that's what I'd reach for now. Not sure if Qt WebKit has similar limits as Chromium.

Not that you asked, but I am sitting here silently judging whoever let those pages get that large. Enough html to cap out RAM? Chesterton's Fence dictates that I presume your upstream's hands were tied, but wowee!

0. https://wkhtmltopdf.org/


Thanks for the suggestion. Yes I've tried wkhtmltopdf. It works, but unfortunately it doesn't interpret the CSS correctly. So the end result looks very different from the actual webpage.


Just because Firefox doesn't have a command line option to print to pdf doesn't mean it's not automatable; you could (have) automate(d) Firefox’s UI instead


Have you considered Playwright? It offers very straightforward APIs to do exactly this.


This is the way. Firefox may not have a CLI arg for saving PDFs, but they expose it through DevTools API and it can be automated with Playwright, Selenium (or plain http requests) etc


I use PrinceXML for converting long HTML into PDFs and haven’t had trouble with large documents, though I don’t know if we’re in the same ballpark in terms of file size or element count. It’s expensive but is a one-time purchase (and I think it’s free to use personally and to evaluate). You can also use it indirectly through DocRaptor (basically a PrinceXML SaaS with an API), though I’ve never tried it.


The overall size is not that big. The problem is that the html contains lots of small <div>s. But yeah I didn't bother trying any paid services. I probably should have.


Run the browser’s native print to pdf and save the result.


This destroys the CSS, forces paper sizes, etc.


For anybody else having the same problem: Orion Browser for MacOS is AFAIK the only application on any platform that is able to save HTML pages as PDF with perfect results. It is based on WebKit and not automated, but maybe it can be scripted.


Pandoc generally works just fine: https://pandoc.org/


I developed KeenWrite[0] with similar ideas to mdbook: typeset Markdown documents into PDF. Technically, this happens in three stages. First, the Markdown is converted to XHTML. Second, the XHTML is converted to TeX commands. Third, the ConTeXt typesetting system produces a PDF file. Both the GUI and CLI can export to PDF.[1] (This means that XHTML also can be converted to PDF.)

Like mdbook, the themes are isolated. Instead of CSS, KeenWrite themes are written in ConTeXt. There are several example starter themes.[2] A "thesis" theme would be a nice addition, but there's a problem.

Markdown lacks a standard for cross-references and citations. An open KeenWrite issue animates a possible UX solution.[3] The topic of references/citations has been discussed on CommonMark[4] without much movement. Parsing cross-references and citations would likely benefit all flexmark-java[5] integrations. KeenWrite uses flexmark-java, but I'm otherwise unaffiliated. If anyone is interested in helping, reach out (see profile).

[0]: https://keenwrite.com/

[1]: https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/cmd...

[2]: https://gitlab.com/DaveJarvis/keenwrite-themes/

[3]: https://gitlab.com/DaveJarvis/KeenWrite/-/issues/145

[4]: https://talk.commonmark.org/t/cross-references-and-citations...

[5]: https://github.com/vsch/flexmark-java


It’s my fear that markdown will eventually become as convoluted as SGML but with simpler syntax


and here all i want is to turn my useless pdf files into markdown


Baaaaaaaaased

ISO standardization of Markdown when so that browsers and stuff can natively render it? T_T


Sometimes I feel we're never going to be rid of PDFs. Everything can be converted into one, but you can't reliably convert from it.

They're going to be here in 200 years.


They have become a reliable way in office & legal processes around the world in terms of fixed layouts & content immutability (in a sort of layman view. Knowing Acrobat Pro exists & PDF editing too - but I'd argue that in majority of cases its not as trivial as modifying text file or markup sources, with intention to change or forge).

Correct me if wrong but Word/LibreOffice layouts could change depending on the machine & version number - but with PDF you get what you intended to show. I think that has always been the winning proposition for PDF


https://en.m.wikipedia.org/wiki/PDF/A

PDF/A is made especially for archival and long term preservation


It's just a useless label on the cover. PDF/A is nothing but a subset of PDF without proprietary expandability limited to what is considered to “work everywhere” when dealing with common printed matter. It adds nothing to non-existent error handling rules or parsing strategies. There are 5 different ways for an object to be found undefined/nil, but the specification is silent on whether there's any difference in meaning or handling based on the level it happens. Therefore libraries and tools do what they find most suitable, and anything generated by the numerous easy-to-use sites is potentially not quite the same as originally uploaded.

PDF resembles the state of HTML years after HTML4, it barely says what should happen in the best case.


Word offers content immutability with its read-only mode, not sure how much layout change there is, though how much pixel perfection do you need in legal processes that are mostly pure text?


Ever observed how legal documents, product documentation, books etc., are typeset - with exact placement of comment boxes, US code reference/legal disclaimers/ barcode, precise footnotes or margin notes. Many of those are meant to be also machine readable when printed out. Exactness is a very pressing need.

I would recommend taking a good look again. It might answer you why it is preferred in some situation to be typeset in PDF over a format where text could reflow.

About the immutability in Word, it seems optional & not something by design. You can edit any *docx & 'Save it as' back. This feature doesn't absolve immutability as a principal feature.


If you have a 300-page legal document, and other sources reference passages, e.g., a paragraph on page 234. It would be unreliable if over the years or depending on the viewer it moves to another page.


that's why legal documents number the paragraphs, otherwise it's unreliable over the days you edit the document, no need to wait years for the app to change layout


Isn't the solution to that not to fix your document’s pixels, but rather to have auto updating references (a la LaTeX)?


What version of word. PDF has many free implementations for both reading and writing.


Does read-only mode not suffer from some global settings? Also, what happens when the reading instance doesn't have all the used fonts for example?


What global setting?

Nothing happens, embedded fonts feature isn't exclusive to PDFs


Sometimes vital images disappear and the layout gets messed up. I haven't had that happen to me with a pdf.


Though keep in mind that a given PDF might not embed the font that it uses (the PDF creator might not even have the legal right to embed whatever font they're using), so opening a PDF on other machines can cause them to be messed up if those machines don't have the correct fonts installed.

If you want literally pixel-perfect layout, you need to use a raster image format like PNG.


PDF is a container. You can put TIFF images, one per page, into the file


I look at PDF like a virtual sheet of paper. Once you "print" to it, abandon any hope of getting things back out.

This sounds like a bad thing, but we need some kind of way to say "this is the final presentation no matter what".


The conversion tools in Acrobat recover most content pretty well. It's not a one to one conversion, but it is plenty usable for putting in new documents or whatever.


Which is bothersome when you disagree about not needing to extract data or modify.


PDF probably some some feature that let's you embed machine-readable data (or the original file in its entirety)

Reminds me of how Adobe Illustrator files were also pdf files and you could view them as PDFs when you rename the extension to .pdf. (this might also apply to Photoshop .psd files)


I feel that there's a lot of value in how a finished PDF document is visually inflexible. This is how the document looks, and this will be how the document look in the next generation PDF viewer on a completely different computing platform of a different type of device. If it works on my machine, it works on yours, too. (This is ignoring the dynamic PDFs with javascripts in them)

It doesn't evolve with the electronic device, which means you might need to zoom and pan, but it also means that it probably won't be completely bungled.

I've bought an EPUB which only works on iPad. If the screen is sized differently, or uses a different font, etc., the texts are all messed up. It simply wouldn't happen if the book was distributed via PDF.


> Everything can be converted into one, but you can't reliably convert from it.

My favorite “workaround” for turning a pdf into html is to render the pdf with something like pdf.js; create a canvas, render contents, scale responsively, done. It works good enough for e.g. displaying book previews. Demo: https://merely.xyz/seven-photo-challenges/ (photo exercise ebook).


indeed, such a sad state of affairs when one of the most popular digital document formats isn't really a proper digital document, but a hack to resemble the bad old paper days


Or you can just run the target in Sphinx, and get better organized output with real table of contents and indexes - it supports a huge number of output formats: https://www.sphinx-doc.org/en/master/usage/builders/index.ht...

If you must use markdown, there's always https://mystmd.org which integrates directly into Sphinx, modulo minor bits of weirdness due to markdown being a mishmash of extensions.


I suspect this is for creating PDFs for ChatGPT, not for generating project documentation from source. That's the usecase that immediately came to mind, since a lot of Rust content (for example) is in mdbook format.


But gitbook and mdbook use markdown, which ChatGPT supports, too.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: