Hacker News new | past | comments | ask | show | jobs | submit login

What people often miss about PDF is that it's closer to an image format in some ways than to a Word document. Word documents, PDFs and images are in document editing what DAW projects, midis and mp3 files are in music and what Java source code, JVM bytecode and pure x86 machine code are in software.

The primary purpose of a PDF file is to tell you what to display (or print), with perfect clarity, in much fewer bytes than an actual image would take. It exploits the fact that the document creator knows about patterns in the document structure that, if expressed properly, make the document much more compressible than anything that an actual image compression algorithm could accomplish. For example, if you have access to the actual font, it's better to say "put these characters at these coordinates with that much spacing between them" than to include every occurrence of every character as a part of the image, hoping that the compression algorithm notices and compresses away the repetitions. Things like what character is part of what word, or even what unicode codepoint is mapped to which font glyph are basically unimportant if all you're after is efficiently transferring the image of a document.

If you have an editable document, you care a lot more about the semantics of the content, not just about its presentation. It matters to you whether a particular break in the text is supposed to be multiple spaces, the next column in a table or just a weird page layout caused by an image being present. If you have some text at the bottom of each page, you care whether that text was put there by the document author multiple times, or whether it was entered once and set as a footer. If you add a new paragraph and have to change page layout, it matters to you that the last paragraph on this page is a footnote and should not be moved to the next one. If a section heading moves to another page, you care about the fact that the table of contents should update automatically and isn't just some text that the author has manually entered. If you're a printer or screen, you care about none of these things, you just print or display whatever you're told to print or display. For a PDF, footnotes, section headings, footers or tables of contents don't have to be special, they can just be text with some meaningless formatting applied to it. This is why making PDF work for any purpose which isn't displaying or printing is never going to be 100% accurate. Of course, there are efforts to remedy this, and a PDF-creating program is free to include any metadata it sees fit, but it's by no means required to do so.

This isn't necessarily the mental model that the PDF authors had in mind, but it's an useful way to look at PDF and understand why it is the way it is.




The original goal of PDF was to have a portable print fidelity copy; WYSIWYG for real. You could take a PDF file and print it on a laser printer, a linotype, or a screen and it would look the same.

If you printed it on a postscript printer it would look exactly the same (or better, if you used type 1 fonts).


PDF is pretty much a symbolic representation of what needs to be printed out. It's symbolic so it can get rasterized onto whatever device in question, in a way that should be as accurate as possible to a print version.

That's the primary requirement of PDF, and has been since the beginning.

They added a bunch of interactive stuff to it, which are used occasionally (forms). But to understand PDF you need to understand the above first.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: