Hacker News new | past | comments | ask | show | jobs | submit login
Pdf2htmlEX – Convert PDF to HTML without losing text or format (github.com/coolwanglu)
161 points by coolwanglu on May 5, 2013 | hide | past | favorite | 48 comments



I've actually been using this to convert large PDF files to HTML to be displayed in-browser. It's for my work, so I don't feel comfortable posting a link to the demo instance here.

It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:

* One Gigantic (600kb) CSS file from a single PDF

* Hundreds of individual fonts

* HTML semantics are non-existent

These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.

Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.

Thanks for a cool piece of software!


" * HTML semantics are non-existent

These are all relatively easy to fix, I believe. "

How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).

Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.


Hey thanks for the info!

2nd & 3rd are in the future plan, as I'm still working on accuracy and speed. And #115(https://github.com/coolwanglu/pdf2htmlEX/issues/115) is about the 2nd issue.

About the first one, I've not got an elegant solution yet, maybe a CSS file per page?

Please file new issues at GitHub if you think it's necessary :)


I love this! Kudos for this awesome app.


Can anyone recommend an equally good opposite (HTML to PDF)?

wkhtmltopdf [0] is probably the most popular, but it's also ridiculously buggy.

0: https://code.google.com/p/wkhtmltopdf/


http://phantomjs.org/ is the best so far in my experience since it handles all the client side javascript properly.

The PDF's it outputs are full vector not just rasters, it the same engine used in Chrome to view PDF's and print web pages from my understanding.


We've tried everything, including PrinceXML, and PhantomJS has been the best for us so far.


Not open source, but you might want to check out PrinceXML. It is really good.


Flying Saucer worked great for me: http://code.google.com/p/flying-saucer/


I've had good results with htmldoc (http://www.msweet.org/projects.php?Z1).


Print the HTML document to a postscript printer, but have it print to file.

Then use ps2pdf from ghostscript.

You can automate this with a small amount of work.


In OS X, you can print to PDF from every application.


This works and displays correctly, but is unbearably slow on iPad 2 whereas the PDF loads instantly. What is the point then or does it work a lot better in desktop browsers?


The OSX Quartz graphic layer (also used in iOS) uses PDF internally as graphic object model.

It is no surprise iOS handles rendering PDF's so quickly and so well and without the need for an third party app, it always has from the release of the first iPhone. This is also why print to PDF is built in on OSX.


I thought it was postscript internally?


NeXT and NeWS were Display Postscript. OS X / Quartz is PDF.


It is more accurate to say that both Quartz and PDF (and Postscript) use the same primitives (cubic Beziers, color models, graphic state, etc)

PDF the file format adds many, many things to that (forms, encryption, DRM, notes, a JavaScript engine, reflow information, etc)


I heard that with careful optimization on the server side and a clever JS may solve this. So far the default UI just demostrates the ability of reading-while-downloading.

The idea is that now the document becomes more controllable and accessible, say you can put Google Analytics in your resume written in LaTeX; or maybe an social reading service, where you can comment, annotate and share.

Unlike PDF viewers, web browers are never optimized for this kind of messy inputs. The next version of pdf2htmlEX will be focused on optimizations, e.g. smaller size of background images, hopefully that would help.


> social reading service

I truly wish there was at least one ground that hadn't been touched by "social" crap.


Usually I don't use social services, at least not "socially" (e.g. twitter as public text messaging). IMHO whether the service is crap or not, depends on what kind of stuff that it encourages you to do, either finding useful information, or playing boring games and pay for higher rankings.

Still like old Google Reader with its OLD social features.


Porn?


Nope.


Off topic, but is your username missing a "ke"?


um? why?


I believe he's wondering if your username is a reference to Cool Hand Luke.

http://www.imdb.com/title/tt0061512/


I was indeed, thanks.


Scrolling is laggy (rMBP 15 default spec) but usable.


Slow as well on a dual core P8700 with 8 GB Ram.


Interesting. So it converts all vector graphics to a background image per page, but keeps all text as browser-rendered on top of it.

I guess I don't really see much practical purpose for it -- most browsers these days seem perfectly fine opening PDF files natively, after all. But it's a very cool technological demonstration.

Maybe this could be some kind of bridge tool for generating sites with fancy typographical layout? You could use Adobe Illustrator etc. to do fancy column work, drop caps, hyphenation, all that jazz -- and then "render" into HTML. It would certainly be as anti-"responsive" as you can get, but it would certainly have the ability to generate more advanced typography much faster than you can produce with HTML/CSS by hand.


It definitely has a few practical purposes. I have used this for a website for a small magazine. Their issue was that they didn't have resources to design for the web. This was a good solution, wherein they just needed to upload a PDF once an issue was out. And this provides a bit more flexibility from other PDF viewers - organize by articles, add social sharing, commenting per page/article etc. etc.


As a practical purpose, how about being able to edit a PDF document? I understand that it can be done through some other tools, but this is one more - and would be free and easy.

Convert to HTML -> Edit -> Print back to PDF (if needed)


I'm not sure the html will be clean enough to edit, sadly...


It's for embedding, when you want to control the document or access the content.

Say you have a resume written in LaTeX and you want to insert Google Analytics inside?


I do this almost daily. I use a PDF converter driver found on the internet . Install it and it becomes a selectable converter option.Then you can convert PDFs to many forms in any program at all, including Adobe Acrobat . Just open a PDF, select convert, and choice a form you want, the task will be finished in several seconds. if you haven't found a good choice , you can have a try. best wishes. http://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-...


Question, does your public folder periodically delete files? I accidentally uploaded something confidential and it seems to be gone. I was wondering if this was a manual deletion or just expired since I still see files that were uploaded around the same time still there.


Can't Mozilla's pdf.js be used to get the same result? Great results anyway!


You don't want to rely on the computing power at the client side, do you? :)


I guess one possible setup would be pdf.js running on server-side and having its output captured. One advantage of this, from what I can see, is that there would probably be fewer external dependencies than this setup.


Yes, actually they had this kind of plan, but I am not sure how it has been going.

It would be definitely interesting in that way, but in that case it may not be worth it to rewrite everything in JS.


Promising start. Hopefully performance improves with each release.


Right, that is in the schedule, just heard enough complains, in a good way.


I didn't see any mention of tables in the doc. Does this means it's outside of the "good enough" range? Table extraction would be a great feature.


It's still a startup, so currently it's focused on accurate rendering, and fast speed(which is not achieved yet so far).

Features about recognition would be planned in the future, usually PDF viewers do not recognize too many things, do they? :)


Have you tried using tabula? https://github.com/jazzido/tabula


How did you manage to get Mediafire to host your demo?


MF uses pdf2htmlEX :) And it also provides public folder and public dropbox <- I really like that.

This means that you can create one of your own.


路过拜大牛


路过拜大牛 +1




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: