I've actually been using this to convert large PDF files to HTML to be displayed in-browser. It's for my work, so I don't feel comfortable posting a link to the demo instance here.
It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:
* One Gigantic (600kb) CSS file from a single PDF
* Hundreds of individual fonts
* HTML semantics are non-existent
These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.
Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.
These are all relatively easy to fix, I believe.
"
How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).
Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.
This works and displays correctly, but is unbearably slow on iPad 2 whereas the PDF loads instantly. What is the point then or does it work a lot better in desktop browsers?
The OSX Quartz graphic layer (also used in iOS) uses PDF internally as graphic object model.
It is no surprise iOS handles rendering PDF's so quickly and so well and without the need for an third party app, it always has from the release of the first iPhone. This is also why print to PDF is built in on OSX.
I heard that with careful optimization on the server side and a clever JS may solve this. So far the default UI just demostrates the ability of reading-while-downloading.
The idea is that now the document becomes more controllable and accessible, say you can put Google Analytics in your resume written in LaTeX; or maybe an social reading service, where you can comment, annotate and share.
Unlike PDF viewers, web browers are never optimized for this kind of messy inputs. The next version of pdf2htmlEX will be focused on optimizations, e.g. smaller size of background images, hopefully that would help.
Usually I don't use social services, at least not "socially" (e.g. twitter as public text messaging). IMHO whether the service is crap or not, depends on what kind of stuff that it encourages you to do, either finding useful information, or playing boring games and pay for higher rankings.
Still like old Google Reader with its OLD social features.
Interesting. So it converts all vector graphics to a background image per page, but keeps all text as browser-rendered on top of it.
I guess I don't really see much practical purpose for it -- most browsers these days seem perfectly fine opening PDF files natively, after all. But it's a very cool technological demonstration.
Maybe this could be some kind of bridge tool for generating sites with fancy typographical layout? You could use Adobe Illustrator etc. to do fancy column work, drop caps, hyphenation, all that jazz -- and then "render" into HTML. It would certainly be as anti-"responsive" as you can get, but it would certainly have the ability to generate more advanced typography much faster than you can produce with HTML/CSS by hand.
It definitely has a few practical purposes. I have used this for a website for a small magazine. Their issue was that they didn't have resources to design for the web. This was a good solution, wherein they just needed to upload a PDF once an issue was out. And this provides a bit more flexibility from other PDF viewers - organize by articles, add social sharing, commenting per page/article etc. etc.
As a practical purpose, how about being able to edit a PDF document? I understand that it can be done through some other tools, but this is one more - and would be free and easy.
Convert to HTML -> Edit -> Print back to PDF (if needed)
I do this almost daily. I use a PDF converter driver found on the internet . Install it and it becomes a selectable converter option.Then you can convert PDFs to many forms in any program at all, including Adobe Acrobat . Just open a PDF, select convert, and choice a form you want, the task will be finished in several seconds. if you haven't found a good choice , you can have a try. best wishes. http://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-...
Question, does your public folder periodically delete files?
I accidentally uploaded something confidential and it seems to be gone. I was wondering if this was a manual deletion or just expired since I still see files that were uploaded around the same time still there.
I guess one possible setup would be pdf.js running on server-side and having its output captured. One advantage of this, from what I can see, is that there would probably be fewer external dependencies than this setup.
It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:
* One Gigantic (600kb) CSS file from a single PDF
* Hundreds of individual fonts
* HTML semantics are non-existent
These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.
Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.
Thanks for a cool piece of software!