I spent a lot of time 2-3 years ago assessing different tools to convert HTML+CSS to PDF [1]. At the time, this was to convert HTML plus custom tags into well-formatted legal documents.
At the time the hands down winner was Prince XML [2]. It's relatively expensive ($3800 for a single server license) but it just works, works from many languages and produces beautiful results quickly (look at their samples). It doesn't take a lot of developer time to make up that purchase cost.
I haven't checked out this particular project but with the others I have they tended to work for smaller samples but would die, take forever or have unpredictable results on even moderately large documents (~150k).
For any commercial project, honestly I'd just fork over the $3800 for Prince without hesitation. It's simply that good.
EDIT: actually, looking over the SO question I think I did check out an early version of this project and didn't have much success with it. The one thing that concerns me about this project now is the last news item is over a year old. Is it still being actively maintained?
Not to mention that Prince XML has excellent support for CSS paged media (margins, page breaks, headers & footers, etc). Contrast with the printed output of any major browser -- they're all quite disappointing.
It would be nice if that $3800 included free upgrades to subsequent releases, though...
Ryan Tomayko (a githubber) is extremely impressed. And when Ryan (or any other community respected hacker) is impressed, I can use that product without any further thinking.
If you are on .NET, I also recommend Essential Objects PDF library (http://www.essentialobjects.com/Products/EOPdf/Default.aspx). I have been using it for a production project and it is rock solid. At 549$/developer, it is much more affordable.
I used wkhtmltopdf in a previous project and found it to be extremely reliable and easy to use. I was extracting the HTML mime parts from incoming email, converting them to PDFs with wkhtmltopdf, then converting that to a PNG with ImageMagick and displaying the PNG to the user in a web browser.
Originally because I didn't find a free app which would do that. Then I decided to keep the PDF as it was quite useful. Unlike with the PNG, the HTML links were retained in the PDF. Ie, HTML anchor tags are still clickable in PDFs generated by wkhtmltopdf.
EDIT: A PDF will also let you select text, unlike an image. However, an image is nicer to embed in a webpage. So I utilised both in order to get the best of both worlds.
Oops just noticed this comment. See my other comment (wkhtmltoimage is part of the package, allows you to render HTML+CSS into PNG, compile into x64 and place binary in your git repo directory to use on heroku)
There are two reasons I have with wkhtmltopdf that still have me fall back to my own printing stuff I've done in 2004 in order to create PDFs for our users:
* WebKit's support for printing is a bit behind the times and stuff like "display: table-header-group;" isn't quite supported, so whenever you have to print big lists across multiple pages, you are practically forced to do your own page breaking.
* Due to an issue somewhere between qt and webkit, it's not possible to hyphenate text. Well. It is possible, but it causes the hyphen not to be painted in most cases.
Having not to deal with manual page breaks or being able to hyphenate (and thus do real justification) were the two reasons for me to move off my home-grown solution, but as those two are the things missing in wkhtmltopdf, I'm staying with my own solution.
Aside of that: If you can live with these shortcomings and with the fact that you are for all intents and purposes forced into using their static build (patched qt, kerning issues for everybody trying a build with the same qt patches), then this might be the perfect solution for PDF generation.
It feels great to use CSS with mm measurements and getting exactly what you need. Or creating barcodes by just embedding SVG or being able to use the full capabilities of HTML, CSS and even JS when building your page.
WebKit is quite powerful and can be quite easily used for generating a PDF, SVG, PostScript of PNG in almost no effort.
I wrote a simple Deck.JS [1] and S5 [2] PDF converter using a few lines of scripting. These programs take a slide presentation written in HTML5 and convert them into a portable PDF document. This is very handy since you can then
share a single file that includes all graphical elements (fonts, images, layout) intact.
I have a GitHub toy repo [3] where I made a few tests with WebKit. On the the programs there (screenshot.pl) even lets you use XPath to find the subnode to grab.
Every time I see a utility like this, I think maybe I could switch to producing some materials in HTML as the primary, or main intermediary, source format. Then I try the utility and realize that that would be silly.
For example, I currently make PDF slides for talks. In theory I'd like to make HTML slides, but would still like the ability to render a PDF for a robust record. However, neither this utility (or PhantomJS, which I just tried) immediately do a good job of converting something like: http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-...
EDIT: also just tried cutycapt, with similar results to wkhtmltopdf (got all slides rather than just visible one, with bad page breaks, and no TeX maths).
I take some of it back. Getting the latest version of wkhtmltopdf and telling it to wait (probably longer than necessary) to process javascript, works pretty well.
and actually work (create a sensible PDF representation of what I can see in a browser). So my feedback wouldn't be useful, as my use case is out of scope for your project: "PyXML2PDF is NOT compatible with any XHTML/HTML/CSS. It uses a small set of tags to quickly allow generation of PDFs."
Would it be sufficient to create PNGs of the web pages and extract the text of the webpage to place in the background of a PDF file (for search, screenreading)?
Over the years I've tried various HTML to PDF utilities and have yet to find one that works correctly.
Previously I was using htmldoc, a project that was abandoned years ago and doesn't work with CSS. It worked for what I was doing but without CSS it's very inflexible and hard to maintain.
I recently moved to wkhtmltopdf but it has plenty of its own issues. The biggest problem I've found is that it doesn't wrap text between pages correctly. If you have a multi-page document it's likely the last line of text on a page will be split over 2 pages. IMO this is a show stopping bug. It has been known for a while but it seems no one is working on it.
The OS X version is broken. It was creating 5MB+ PDFs that should be about 50k. The Linux version doesn't have this bug.
It's practically unusable when not in an environment with X11. I had to use it on a Windows system and any text would have incorrect letter-space. Every letter would bleed into the next, it's a typographic nightmare. You could use Arial Unicode MS to get a somewhat acceptable result but that won't support bold or italic text cleanly.
I'm not quite sure but I think the fix isn't even part in the 0.11 release. One has to compile wk himself to get it working.
When this issue is resolved this will be perfect, though. It has great capabilites to render footers and headers and JS-based output (in my case Highcharts). For the time being you can't even switch to commercial systems - ActivePDF, for one, has the same issue in the latest release.
PhantomJS has a "render(fileName)" API which supports PDF. Essentially, it is trivial to implement something like Wkhtmltopdf. However, Wkhtmltopdf is a shrinkwrap solution, which does not require you to install nodejs.
Does anyone know of any other engines like this, either paid or free? We are using this to produce catalogs of 200+ complex pages and it is not handling generating PDFs of this size very well. It will often become unresponsive and create a memory leak.
That seems to be a qt problem, as far as I know. I think a saw somewhere how to recompile qt to get a more robust solution. The issue tracker of wkhtml is quite helpful here.
An easy solution could be to just use extremely short URLs as these seem to affect the space used by wkhtml as well. But that was just my solution for a 200 page output. In addition, if you are using HTML footers or headers, try not to give them any parameters, if possible.
Edit: I can't find the best entry at StackOverflow (I remember there must be a Python based solution as well), but this might be a good overview:
At the time the hands down winner was Prince XML [2]. It's relatively expensive ($3800 for a single server license) but it just works, works from many languages and produces beautiful results quickly (look at their samples). It doesn't take a lot of developer time to make up that purchase cost.
I haven't checked out this particular project but with the others I have they tended to work for smaller samples but would die, take forever or have unpredictable results on even moderately large documents (~150k).
For any commercial project, honestly I'd just fork over the $3800 for Prince without hesitation. It's simply that good.
EDIT: actually, looking over the SO question I think I did check out an early version of this project and didn't have much success with it. The one thing that concerns me about this project now is the last news item is over a year old. Is it still being actively maintained?
[1]: http://stackoverflow.com/questions/391005/convert-html-css-t...
[2]: http://princexml.com/