Have you thought about the reverse, i.e., a tool that could convert pdfs to html faithfully?
I would be willing to pay money for a reliable tool that didn't need much manual editing after processing.
Unfortunately, the pdftohtml project (http://pdftohtml.sourceforge.net/) has been inactive, and the current version has trouble with even moderately complex layouts.
That's a non-trivial task. There are no such objects like tables, styles, lists or paragraphs in PDF so you would need to reconstruct this kind of information. Also, text and vector graphics is positioned absolutely. Tagged PDFs contains some meta-information about the document structure which could help but still it is a lot of work.
The fundamental problem is that PDF stores the document presentation while html defines the document and the presentation is created by the browser. And obviously, to restore a document definition from its presentation is hard as lot of information is missing.
I only bring it up b/c if your goal is to turn pdfcrowd into an app that people would pay money for (and I would be one of them), solving that problem would go a long way towards achieving it.
Solving it perfect is non-trivial (I've known entire PhDs to be spent working on a small subset of the problem). There are a number of products/projects that solve it to some extent (techniques include absolute positioning & making sweeping assumptions about what constitutes a paragraph) - would this be enough for you to consider paying for, given that their assumptions/workarounds might produce HTML files that aren't quite to your 'taste'?
There already many apps and pieces of software that charge for the feature he already has so I don't see why it is a requirement for him to monetize. It definitely would be an easy feature to charge for but I think what he has already has potential.
NitroPDF does a remarkably good job translating PDF to Doc and RTF. I think the application (windows :() is better/has more output options, but they have a free web service: http://www.pdftoword.com/
I can easily see a use for this. I'm doing a pro bono project for a small non profit, and part of the project requires generating simple PDF reports. They don't have any money so we need to keep it low cost.
One of the ways of doing this is to host it on a simple shared server (it's not a heavily used app).
Downside of this is that it's unlikely we'll be able to use any of the PDF tools I've used in the past (since they need to be installed). This should work fine for our purposes.
Thanks, I was wondering how I'd get around this.
To all those who were dissing this because they couldn't immediately see a use for it, try to have a more open mind.
I'm also developing an HTML->PDF feature and jumped when I saw this! I tried smashingmag.com - which is funny because i meant smashingmagazine.com but actually smashingmag.com is some japanese site. either way i got back a totally blank pdf - maybe japanese character set is to blame?
One other caveat is that having the ability to view flash would be awesome as well. main function of pdf as i understand it is to create a document that PRINTS completely identically on every setup, so frequently people are going to want to print flash, which is already a huge pain in the ass. Unfortunately it looks like it blanks out completely if there is flash on the page (2advanced.net)
if you could solve that i would start paying tomorrow.
The quality of http://www.princexml.com is amazing. It's not open source and there is a cost to use it commercially (<1K, if I recall). I used it to convert my HTML documentation for Sleep into a camera-ready PDF.
Nice execution - as per the comment below, something like this would've saved me lots of manual fiddling back when I was doing lots of PDF stuff.
Given the focus on APIs I guess you're aiming it at those wanting to programmatically generate PDFs using a familiar markup, rather than conversion of existing (static) content into PDF? If so, maybe investigate the ability to overlay rendering onto an existing PDF template at some point - in my experience it's been a common requirement (think form letters, account statements, etc).
Interesting that it appears to execute Javascript; guess it's a sign of the times that you need to in order to render many sites correctly nowadays. I haven't poked it too hard, but suspect there might be one or two security challenges there...
Well, your default HTML generates one screwey PDF. When viewed in Mac OSX Preview, I get the text "T pe our HTML here..." Then, when I select the text, certain letters get partially removed or overwritten and I end up with gibberish.
I've just spent weeks working on HTML -> PDF conversion code, so I know it's not just my viewer. I've put all kinds of crazy stuff through there.
There is no doubt that many developers will use wkhtmltopdf.
I think that the Pdfcrowd's selling points could be 1) wide availability - only HTTP is needed so it can be used theoretically on any platform 2) no need to install any 3rd party software which makes the applications more portable 3) API bindings
We used an html->pdf conversion service (I believe it was http://www.htm2pdf.co.uk/ but I'm not positive) for awhile to do billing and our biggest problem was that it went down all the time. We ended up purchasing a (pretty cheap) license to a Java library that does pdf generation for us and is pretty easy to use. This is definitely a service that people will pay money to use - best of luck to you!
I don't know the exact status of how WebKit handles these properties. I know that at least "page-break-after: always" works since that is what I use when the user clicks the 'Insert Page Break' button in the editor (http://pdfcrowd.com/editor/).
NICE!
You have beat me (and I am sure a dozen others hackers) to the realization of this idea...
Here is an idea for an extra feature: make a print bookmarklet -- clicking on it you get a nice PDF version of the page you are viewing right now. I can't stand firefox's print renditions of some pages... terrible...
(also you might want to set the page size to letter or A4 depending on the geolocation of your visitor's ip address)
I notice there are some questions about how to make money. One may be to position yourself as a way to get PDF reports generated from phone apps, in which case you may want to do per app licensing and provide facilities for email delivery of PDFs.
I could see this being useful porting apps from iPhone (can easily generate PDFs) to Android (which does not appear to support PDF output).
I used this for a major company's site-edit auditing system. (No, they didn't want HTML snapshots of each revision. It had to be a screenshot of the browser...)
It works really well. The only quirk is that it needs a fake X server (for font loading), but Xvfb works just fine for that.
I did not know about this project at the time I started with pdfcrowd. But anyway, I just took my existing pdf library and integrated it with WebKit which was not that hard as one could think.
First of all, I don't know how well wkhtmltopdf works, but there are many, many solutions to the HTML-to-PDF problem, and most of them suck. It's not surprising the creator decided to put together a library from scratch, it's the special sauce for his business.
Also, the "value add" comes from the fact that wkhtmltopdf is a library, and PDFcrowd is an API.
The pdf conversion is awesome! I just tried printing http://times.com/ to a pdf in firefox and it ended up putting the main content of the site on page 2, whereas yours seemed to render it perfectly.
Looks good. I'm keen to use (and pay for) a service like this - if its reliable and quick. With a ruby gem its particularly attractive as all other rails to pdf solutions are incomplete, require a pdf specific dsl or are very expensive.
This is awesome. I'm at once excited about using this in the future, and dismayed thinking of the time I've spent manually generating PDFs because none of the HTML -> PDF options worked.
I fed it my homepage, and it nailed it. I'm impressed.
Haven't tested this, but great idea. I've used a couple of the PDF creation tools and it seems so tedious to build out even a simple table view on a PDF. Good luck with this!
That's a known problem on my todo list. The colors are dulled only in Acrobat but other PDF readers render the colors correctly. Please, could you post the link to that page if possible? Thanks.
Thanks for getting this done. But come on, CURL is documented pretty well. There are even examples. What's there to know? Init a connection, set the flags, pass in whatever you like, submit, and check response. Pretty straightforward to me.
The CURL stuff seems so oddly unlike the rest of the PHP commands; it's more-or-less a direct port of the c++ library, names & all included.
The place where it reallllly irritates is the cookie management, but thankfully I didn't have to deal with that in this case. (I did for a client at a newspaper - nightmarish.)
I would be willing to pay money for a reliable tool that didn't need much manual editing after processing.
Unfortunately, the pdftohtml project (http://pdftohtml.sourceforge.net/) has been inactive, and the current version has trouble with even moderately complex layouts.