Ask HN: please review my app - html to pdf API

dpapathanasiou · on March 31, 2010

Have you thought about the reverse, i.e., a tool that could convert pdfs to html faithfully?

I would be willing to pay money for a reliable tool that didn't need much manual editing after processing.

Unfortunately, the pdftohtml project (http://pdftohtml.sourceforge.net/) has been inactive, and the current version has trouble with even moderately complex layouts.

jgresula · on March 31, 2010

That's a non-trivial task. There are no such objects like tables, styles, lists or paragraphs in PDF so you would need to reconstruct this kind of information. Also, text and vector graphics is positioned absolutely. Tagged PDFs contains some meta-information about the document structure which could help but still it is a lot of work.

The fundamental problem is that PDF stores the document presentation while html defines the document and the presentation is created by the browser. And obviously, to restore a document definition from its presentation is hard as lot of information is missing.

dpapathanasiou · on March 31, 2010

That's a non-trivial task.

Yes, that's true.

I only bring it up b/c if your goal is to turn pdfcrowd into an app that people would pay money for (and I would be one of them), solving that problem would go a long way towards achieving it.

thepsi · on March 31, 2010

Solving it perfect is non-trivial (I've known entire PhDs to be spent working on a small subset of the problem). There are a number of products/projects that solve it to some extent (techniques include absolute positioning & making sweeping assumptions about what constitutes a paragraph) - would this be enough for you to consider paying for, given that their assumptions/workarounds might produce HTML files that aren't quite to your 'taste'?

latortuga · on March 31, 2010

There already many apps and pieces of software that charge for the feature he already has so I don't see why it is a requirement for him to monetize. It definitely would be an easy feature to charge for but I think what he has already has potential.

brandnewlow · on March 31, 2010

Total noob question, couldn't you programmatically capture a browsershot and then convert that into a PDF?

HTML -> png seems to have been figured out. Is .png -> pdf that hard to do?

vibhavs · on March 31, 2010

No, .png to .pdf is not difficult.

I believe dpapathanasiou's suggestion is not to blindly convert a pdf into html file with one giant image file of the pdf.

Instead, he wants to create an html document that maintains the same content and layout from the pdf.

brandnewlow · on March 31, 2010

D'Oh! Got myself mixed up there a bit.

dmv · on March 31, 2010

NitroPDF does a remarkably good job translating PDF to Doc and RTF. I think the application (windows :() is better/has more output options, but they have a free web service: http://www.pdftoword.com/

petesalty · on March 31, 2010

I can easily see a use for this. I'm doing a pro bono project for a small non profit, and part of the project requires generating simple PDF reports. They don't have any money so we need to keep it low cost.

One of the ways of doing this is to host it on a simple shared server (it's not a heavily used app).

Downside of this is that it's unlikely we'll be able to use any of the PDF tools I've used in the past (since they need to be installed). This should work fine for our purposes.

Thanks, I was wondering how I'd get around this.

To all those who were dissing this because they couldn't immediately see a use for it, try to have a more open mind.

wdewind · on March 31, 2010

I'm also developing an HTML->PDF feature and jumped when I saw this! I tried smashingmag.com - which is funny because i meant smashingmagazine.com but actually smashingmag.com is some japanese site. either way i got back a totally blank pdf - maybe japanese character set is to blame?

One other caveat is that having the ability to view flash would be awesome as well. main function of pdf as i understand it is to create a document that PRINTS completely identically on every setup, so frequently people are going to want to print flash, which is already a huge pain in the ass. Unfortunately it looks like it blanks out completely if there is flash on the page (2advanced.net)

if you could solve that i would start paying tomorrow.

sjs382 · on March 31, 2010

mpdf is really nice if you care about page breaks and the like...

raffi · on March 31, 2010

The quality of http://www.princexml.com is amazing. It's not open source and there is a cost to use it commercially (<1K, if I recall). I used it to convert my HTML documentation for Sleep into a camera-ready PDF.

http://sleep.dashnine.org/manual/ - original docs http://sleep.dashnine.org/download/sleep21manual.pdf - result

jonallanharper · on March 31, 2010

I have exhausted myself trying to persuade prince xml to not blur my images. That's the biggest hurdle for me.

If PDFCrowd can effectively handle images, I'll brand their logo into my bicep.

corruption · on March 31, 2010

Have you tried flying saucer? I found it to be excellent for my purposes.

jonallanharper · on March 31, 2010

Have not. I will definitely look into. Thanks!

thepsi · on March 31, 2010

Nice execution - as per the comment below, something like this would've saved me lots of manual fiddling back when I was doing lots of PDF stuff.

Given the focus on APIs I guess you're aiming it at those wanting to programmatically generate PDFs using a familiar markup, rather than conversion of existing (static) content into PDF? If so, maybe investigate the ability to overlay rendering onto an existing PDF template at some point - in my experience it's been a common requirement (think form letters, account statements, etc).

Interesting that it appears to execute Javascript; guess it's a sign of the times that you need to in order to render many sites correctly nowadays. I haven't poked it too hard, but suspect there might be one or two security challenges there...

DanHulton · on March 31, 2010

Well, your default HTML generates one screwey PDF. When viewed in Mac OSX Preview, I get the text "T pe our HTML here..." Then, when I select the text, certain letters get partially removed or overwritten and I end up with gibberish.

I've just spent weeks working on HTML -> PDF conversion code, so I know it's not just my viewer. I've put all kinds of crazy stuff through there.

thepsi · on March 31, 2010

Exact same thing works perfectly for me (OS X Preview, version 5.0.1). I'd be interested to know what this turns out to be.

jgresula · on March 31, 2010

Thanks for the report. I will look into it as there should be no default HTML in the editor.

karanbhangui · on March 31, 2010

Slick design, but out of curiosity, why wouldn't developers just use http://code.google.com/p/wkhtmltopdf/ ?

jgresula · on March 31, 2010

There is no doubt that many developers will use wkhtmltopdf.

I think that the Pdfcrowd's selling points could be 1) wide availability - only HTTP is needed so it can be used theoretically on any platform 2) no need to install any 3rd party software which makes the applications more portable 3) API bindings

latortuga · on March 31, 2010

We used an html->pdf conversion service (I believe it was http://www.htm2pdf.co.uk/ but I'm not positive) for awhile to do billing and our biggest problem was that it went down all the time. We ended up purchasing a (pretty cheap) license to a Java library that does pdf generation for us and is pretty easy to use. This is definitely a service that people will pay money to use - best of luck to you!

sjs382 · on March 31, 2010

A lot of html to pdf conversion is useless if page-break-* properties are not followed. Shame, too. I've been building something like this all week.

jgresula · on March 31, 2010

I don't know the exact status of how WebKit handles these properties. I know that at least "page-break-after: always" works since that is what I use when the user clicks the 'Insert Page Break' button in the editor (http://pdfcrowd.com/editor/).

ivan_ah · on March 31, 2010

NICE! You have beat me (and I am sure a dozen others hackers) to the realization of this idea...

Here is an idea for an extra feature: make a print bookmarklet -- clicking on it you get a nice PDF version of the page you are viewing right now. I can't stand firefox's print renditions of some pages... terrible...

(also you might want to set the page size to letter or A4 depending on the geolocation of your visitor's ip address)

watmough · on March 31, 2010

Excellent stuff.

I notice there are some questions about how to make money. One may be to position yourself as a way to get PDF reports generated from phone apps, in which case you may want to do per app licensing and provide facilities for email delivery of PDFs.

I could see this being useful porting apps from iPhone (can easily generate PDFs) to Android (which does not appear to support PDF output).

ricmo · on March 31, 2010

have you seen this? http://code.google.com/p/wkhtmltopdf/

jrockway · on April 1, 2010

I used this for a major company's site-edit auditing system. (No, they didn't want HTML snapshots of each revision. It had to be a screenshot of the browser...)

It works really well. The only quirk is that it needs a fake X server (for font loading), but Xvfb works just fine for that.

jgresula · on March 31, 2010

Yes, I have seen it but have not tried it yet.

gridspy · on March 31, 2010

Woah. A 3rd party does your entire value add and yet your hack up your own?

jgresula · on March 31, 2010

I did not know about this project at the time I started with pdfcrowd. But anyway, I just took my existing pdf library and integrated it with WebKit which was not that hard as one could think.

dandelany · on April 1, 2010

First of all, I don't know how well wkhtmltopdf works, but there are many, many solutions to the HTML-to-PDF problem, and most of them suck. It's not surprising the creator decided to put together a library from scratch, it's the special sauce for his business.

Also, the "value add" comes from the fact that wkhtmltopdf is a library, and PDFcrowd is an API.

anfractuosity · on March 31, 2010

The pdf conversion is awesome! I just tried printing http://times.com/ to a pdf in firefox and it ended up putting the main content of the site on page 2, whereas yours seemed to render it perfectly.

juliancox · on April 1, 2010

Looks good. I'm keen to use (and pay for) a service like this - if its reliable and quick. With a ruby gem its particularly attractive as all other rails to pdf solutions are incomplete, require a pdf specific dsl or are very expensive.

qeorge · on March 31, 2010

This is awesome. I'm at once excited about using this in the future, and dismayed thinking of the time I've spent manually generating PDFs because none of the HTML -> PDF options worked.

I fed it my homepage, and it nailed it. I'm impressed.

pstinnett · on March 31, 2010

Haven't tested this, but great idea. I've used a couple of the PDF creation tools and it seems so tedious to build out even a simple table view on a PDF. Good luck with this!

carbocation · on March 31, 2010

This is great! The only downside that I saw after converting one of my pages is that the colors dulled substantially.

jgresula · on March 31, 2010

That's a known problem on my todo list. The colors are dulled only in Acrobat but other PDF readers render the colors correctly. Please, could you post the link to that page if possible? Thanks.

gridspy · on March 31, 2010

http://your.gridspy.co.nz/powertech dulled substantially.

Also, you don't support the CSS3 styling of the header text.

The fonts look super aliased.

Finally, you don't snap the rendered HTML to the nearest page, leading to a page containing only the footer.

oskee80 · on March 31, 2010

Worked great for me, good job. I'd be interested in a PHP binding too, and knowing what the eventual cost will be.

washingtondc · on March 31, 2010

I like it, but my site didn't come out correctly (www.convertyourcds.com). Perhaps my html is screwed up?

jgresula · on March 31, 2010

Sorry, don't know why. Your site does not validate but it could be problem on my side as well.

va_coder · on March 31, 2010

I tried a relatively complex site - CNN - expecting the results to look bad, but it looks great

mleonhard · on March 31, 2010

How much will it cost?

jgresula · on March 31, 2010

My current plan is to charge for conversion tokens but I'm not decided how much yet.

juliancox · on April 1, 2010

Check out: http://www.htm2pdf.co.uk/htm2pdf-web-service.aspx Their pricing indicates 40,000 conversions for $90. I'd pay that.

Tomazaz · on April 14, 2010

Similar, also support pdf by e-mail http://www.web2pdfconvert.com

jrockway · on April 1, 2010

Uh, "a2ps file.html"? Doesn't even need an API key...

asnyder · on March 31, 2010

Why no PHP API binding?

jbm · on April 1, 2010

I've gone ahead and built one.

Try it:

http://www.tokyomuslim.com/2010/04/php-class-to-run-pdfcrowd...

I don't blame anyone for not wanting to use PHP's poorly documented CURL classes.

asnyder · on April 2, 2010

Thanks for getting this done. But come on, CURL is documented pretty well. There are even examples. What's there to know? Init a connection, set the flags, pass in whatever you like, submit, and check response. Pretty straightforward to me.

jbm · on April 3, 2010

The CURL stuff seems so oddly unlike the rest of the PHP commands; it's more-or-less a direct port of the c++ library, names & all included.

The place where it reallllly irritates is the cookie management, but thankfully I didn't have to deal with that in this case. (I did for a client at a newspaper - nightmarish.)

jgresula · on March 31, 2010

No special reason - I just don't know PHP. But it is on the todo list.

alilja · on March 31, 2010

Why do I need this?