Wkhtmltopdf, shell utility to convert html to pdf using webkit rendering engine

cletus · on April 16, 2012

I spent a lot of time 2-3 years ago assessing different tools to convert HTML+CSS to PDF [1]. At the time, this was to convert HTML plus custom tags into well-formatted legal documents.

At the time the hands down winner was Prince XML [2]. It's relatively expensive ($3800 for a single server license) but it just works, works from many languages and produces beautiful results quickly (look at their samples). It doesn't take a lot of developer time to make up that purchase cost.

I haven't checked out this particular project but with the others I have they tended to work for smaller samples but would die, take forever or have unpredictable results on even moderately large documents (~150k).

For any commercial project, honestly I'd just fork over the $3800 for Prince without hesitation. It's simply that good.

EDIT: actually, looking over the SO question I think I did check out an early version of this project and didn't have much success with it. The one thing that concerns me about this project now is the last news item is over a year old. Is it still being actively maintained?

[1]: http://stackoverflow.com/questions/391005/convert-html-css-t...

[2]: http://princexml.com/

drothlis · on April 16, 2012

Not to mention that Prince XML has excellent support for CSS paged media (margins, page breaks, headers & footers, etc). Contrast with the printed output of any major browser -- they're all quite disappointing.

It would be nice if that $3800 included free upgrades to subsequent releases, though...

solutionyogi · on April 16, 2012

+1 for Prince XML.

Ryan Tomayko (a githubber) is extremely impressed. And when Ryan (or any other community respected hacker) is impressed, I can use that product without any further thinking.

http://tomayko.com/writings/princexml

If you are on .NET, I also recommend Essential Objects PDF library (http://www.essentialobjects.com/Products/EOPdf/Default.aspx). I have been using it for a production project and it is rock solid. At 549$/developer, it is much more affordable.

mike-cardwell · on April 16, 2012

I used wkhtmltopdf in a previous project and found it to be extremely reliable and easy to use. I was extracting the HTML mime parts from incoming email, converting them to PDFs with wkhtmltopdf, then converting that to a PNG with ImageMagick and displaying the PNG to the user in a web browser.

pbhjpbhj · on April 16, 2012

Why bother with the intermediate stage and not go direct html-to-png?

mike-cardwell · on April 16, 2012

Originally because I didn't find a free app which would do that. Then I decided to keep the PDF as it was quite useful. Unlike with the PNG, the HTML links were retained in the PDF. Ie, HTML anchor tags are still clickable in PDFs generated by wkhtmltopdf.

EDIT: A PDF will also let you select text, unlike an image. However, an image is nicer to embed in a webpage. So I utilised both in order to get the best of both worlds.

blakeeb · on April 16, 2012

Oops just noticed this comment. See my other comment (wkhtmltoimage is part of the package, allows you to render HTML+CSS into PNG, compile into x64 and place binary in your git repo directory to use on heroku)

pilif · on April 16, 2012

There are two reasons I have with wkhtmltopdf that still have me fall back to my own printing stuff I've done in 2004 in order to create PDFs for our users:

* WebKit's support for printing is a bit behind the times and stuff like "display: table-header-group;" isn't quite supported, so whenever you have to print big lists across multiple pages, you are practically forced to do your own page breaking.

* Due to an issue somewhere between qt and webkit, it's not possible to hyphenate text. Well. It is possible, but it causes the hyphen not to be painted in most cases.

Having not to deal with manual page breaks or being able to hyphenate (and thus do real justification) were the two reasons for me to move off my home-grown solution, but as those two are the things missing in wkhtmltopdf, I'm staying with my own solution.

Aside of that: If you can live with these shortcomings and with the fact that you are for all intents and purposes forced into using their static build (patched qt, kerning issues for everybody trying a build with the same qt patches), then this might be the perfect solution for PDF generation.

It feels great to use CSS with mm measurements and getting exactly what you need. Or creating barcodes by just embedding SVG or being able to use the full capabilities of HTML, CSS and even JS when building your page.

jcr · on April 16, 2012

It's worth noting this log entry:

> Aug 11 2009: Development moved to git http://github.com/antialize/wkhtmltopdf

potyl · on April 16, 2012

WebKit is quite powerful and can be quite easily used for generating a PDF, SVG, PostScript of PNG in almost no effort.

I wrote a simple Deck.JS [1] and S5 [2] PDF converter using a few lines of scripting. These programs take a slide presentation written in HTML5 and convert them into a portable PDF document. This is very handy since you can then share a single file that includes all graphical elements (fonts, images, layout) intact.

I have a GitHub toy repo [3] where I made a few tests with WebKit. On the the programs there (screenshot.pl) even lets you use XPath to find the subnode to grab.

[1] https://github.com/potyl/perl-App-deckjs2pdf

[2] https://github.com/potyl/perl-App-s5pdf

[3] https://github.com/potyl/Webkit

imurray · on April 16, 2012

Every time I see a utility like this, I think maybe I could switch to producing some materials in HTML as the primary, or main intermediary, source format. Then I try the utility and realize that that would be silly.

For example, I currently make PDF slides for talks. In theory I'd like to make HTML slides, but would still like the ability to render a PDF for a robust record. However, neither this utility (or PhantomJS, which I just tried) immediately do a good job of converting something like: http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-...

EDIT: also just tried cutycapt, with similar results to wkhtmltopdf (got all slides rather than just visible one, with bad page breaks, and no TeX maths).

imurray · on April 16, 2012

I take some of it back. Getting the latest version of wkhtmltopdf and telling it to wait (probably longer than necessary) to process javascript, works pretty well.

    wkhtmltopdf --javascript-delay 10000 --no-stop-slow-scripts 'http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-03-30/ltg-talk.html#Lotka-Volterra' slide.pdf

It's a bit slow, and a bit too hacky for me. But this tool does the best job of those I've seen.

And I've just received an email pointing me to: http://search.cpan.org/perldoc?deckjs2pdf https://github.com/potyl/perl-App-deckjs2pdf that will specifically deal with Deck.JS slides.

kelvin0 · on April 16, 2012

Well, I am looking for some feedback on a project that converts XML to PDF. Give it a try: https://github.com/kelvin0/PyXML2PDF

imurray · on April 16, 2012

I am looking for a command-line utility that could do:

    webpage2pdf 'http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-03-30/ltg-talk.html#Lotka-Volterra' slide.pdf

and actually work (create a sensible PDF representation of what I can see in a browser). So my feedback wouldn't be useful, as my use case is out of scope for your project: "PyXML2PDF is NOT compatible with any XHTML/HTML/CSS. It uses a small set of tags to quickly allow generation of PDFs."

potyl · on April 16, 2012

It seems that what you want is deckjs2pdf, get it from CPAN [1] or GitHub [2]

[1] http://search.cpan.org/perldoc?deckjs2pdf

[2] https://github.com/potyl/perl-App-deckjs2pdf

pbhjpbhj · on April 16, 2012

Would it be sufficient to create PNGs of the web pages and extract the text of the webpage to place in the background of a PDF file (for search, screenreading)?

imurray · on April 16, 2012

Not for me. Personally I'll stick to ways of making decent PDFs that don't go via HTML.

driverdan · on April 16, 2012

Over the years I've tried various HTML to PDF utilities and have yet to find one that works correctly.

Previously I was using htmldoc, a project that was abandoned years ago and doesn't work with CSS. It worked for what I was doing but without CSS it's very inflexible and hard to maintain.

I recently moved to wkhtmltopdf but it has plenty of its own issues. The biggest problem I've found is that it doesn't wrap text between pages correctly. If you have a multi-page document it's likely the last line of text on a page will be split over 2 pages. IMO this is a show stopping bug. It has been known for a while but it seems no one is working on it.

The OS X version is broken. It was creating 5MB+ PDFs that should be about 50k. The Linux version doesn't have this bug.

hendrik-xdest · on April 16, 2012

It's practically unusable when not in an environment with X11. I had to use it on a Windows system and any text would have incorrect letter-space. Every letter would bleed into the next, it's a typographic nightmare. You could use Arial Unicode MS to get a somewhat acceptable result but that won't support bold or italic text cleanly.

I'm not quite sure but I think the fix isn't even part in the 0.11 release. One has to compile wk himself to get it working.

When this issue is resolved this will be perfect, though. It has great capabilites to render footers and headers and JS-based output (in my case Highcharts). For the time being you can't even switch to commercial systems - ActivePDF, for one, has the same issue in the latest release.

TimMontague · on April 16, 2012

You can work around to the text kerning issue by using custom web-fonts. But I agree, they definitely need to fix this issue.

http://code.google.com/p/wkhtmltopdf/issues/detail?id=72

leeoniya · on April 16, 2012

you can use xvfb as a lightweight alternative to full X, works well.

blakeeb · on April 16, 2012

Tip: wkhtmltoimage is part of the package, allows you to render HTML+CSS into PNG.

I used this for a project which needed a CSS powered image builder, which created sharable images:

Builder: http://circlek-flugtag.heroku.com/entries/shipomatic Thumbnails: http://circlek-flugtag.heroku.com/entries

thejosh · on April 16, 2012

I use to use this, but it has started segfaulting on a large range of websites.

I've since switched to cutycaps which handles all my needed features out of the box.

mdaniel · on April 16, 2012

I think you mean http://cutycapt.sourceforge.net/ which I only found by switching back to Google; DDG was not able to decipher your type-o. :-)

hieronymusN · on April 16, 2012

Wkhtmltopdf does have its share of warts, but if you need to do a quick and dirty PDF dump of an entire site, it can help.

I used it with wget to scrape a site for conversion: http://darrenknewton.com/blog/2011/10/30/mirror-site-and-con...

beggi · on April 16, 2012

I used this for a project the other day, but after discovering PhantomJS I feel like it has more traction.

snowmaker · on April 16, 2012

How specifically did you use PhantomJS for PDF generation?

imurray · on April 16, 2012

There's an example here: http://code.google.com/p/phantomjs/wiki/QuickStart#Rendering (gives examples of .png and .pdf generation)

ashconnor · on April 16, 2012

Thank you for posting. I recently did a bit of research on producing thumbnails with Ruby using this project: https://github.com/csquared/IMGKit.

Only problem was getting the screenshot to work with Flash. It seems as thought the javascript delay option on wkhtmltopdf didn't delay at all.

Can PhantomJS handle pages with Flash?

imurray · on April 16, 2012

Not any more, as a web search immediately reveals: http://code.google.com/p/phantomjs/issues/detail?id=418

qznc · on April 16, 2012

PhantomJS has a "render(fileName)" API which supports PDF. Essentially, it is trivial to implement something like Wkhtmltopdf. However, Wkhtmltopdf is a shrinkwrap solution, which does not require you to install nodejs.

chrisbroadfoot · on April 16, 2012

PhantomJS doesn't run on node.js

dantiberian · on April 16, 2012

Does anyone know of any other engines like this, either paid or free? We are using this to produce catalogs of 200+ complex pages and it is not handling generating PDFs of this size very well. It will often become unresponsive and create a memory leak.

hendrik-xdest · on April 16, 2012

That seems to be a qt problem, as far as I know. I think a saw somewhere how to recompile qt to get a more robust solution. The issue tracker of wkhtml is quite helpful here.

An easy solution could be to just use extremely short URLs as these seem to affect the space used by wkhtml as well. But that was just my solution for a 200 page output. In addition, if you are using HTML footers or headers, try not to give them any parameters, if possible.

Edit: I can't find the best entry at StackOverflow (I remember there must be a Python based solution as well), but this might be a good overview:

http://stackoverflow.com/questions/633780/converting-html-fi...

Some of those are commercial.

dochtman · on April 16, 2012

I think wkhtmltopdf and PhantomJS are the most active open source solutions.

There's also http://pypi.python.org/pypi/xhtml2pdf/, written in Python and using ReportLab (which certainly has some nice properties).

kelvin0 · on April 16, 2012

For XHTML to PDF: http://www.xhtml2pdf.com/

For XML to PDF: https://github.com/kelvin0/PyXML2PDF

syjer · on April 16, 2012

If you are not allergic to java, flying-saucer is quite good http://code.google.com/p/flying-saucer/ .

AshleysBrain · on April 16, 2012

Catchy name.

alok-g · on April 16, 2012

While this maintains hyperlinks as such, it breaks those using relative URLs.