Hacker News new | past | comments | ask | show | jobs | submit login

Why there's that much exploit in this PDF software?

Where does this complexity comes from?





Wow. Is there something other than PDFs that can be used to meet the same purpose? PDFs are looking really old and stanky right now.


If you were to try to meet the full specs of PDF for satisfying the same purpose, the outcome would be 10-20 separate specs, all of the same complexity.

the better idea is to segment out what exactly you want to use it for and use a specific file format for it.

IE. Do you want vector graphics? Do you want document signing? Do you want to just do printing of a text only document? Do you want to encode picture bitmap information? Do you want to show a document online? Do you care about colour spaces? Unicode? If unicode, what kinds of unicode? Font rendering? How do you like your glyphs and ligatures to look?

The spec is so big because it has like 10-20 purposes.


> the better idea is to segment out what exactly you want to use it for and use a specific file format for it

What I want is basically an entirely static (no javascript, forms, media elements, etc) copy of a web page, with a logical deterministic rendering, and a fixed page size (no reflowing). Basically, if you took a web page and printed in color on pieces of paper, the HTML + CSS that describes the stuff shown on the piece of paper is what I want a "portable document format" to be. (Along with a set of rules that specify exactly how that code should be rendered.)

What I want in the spec is basically dictated by that:

* vector graphics: yes, SVG is supported in all major browsers. https://developer.mozilla.org/en-US/docs/Web/SVG

* bitmap support: yes, let's start with PNG, JPEG, etc, and updates to the spec can introduce new formats

* color management: yes, should be required by the spec

* unicode: yes, we can probably be UTF-8 only at this point?

* font rendering: deterministic; make it part of the spec. Fonts should be embeddable in the document. Ideally the font rendering for the end users should be as high quality as possible (this is quietly one of the things PDFs are already doing very well).

* glyphs / ligatures: should look exactly as they are determined by the author of the document. The spec should allow for the full use of the capabilities of an OTF font.

I think this probably covers the stuff 95% of people want from 95% of their PDFs, and it's vastly simpler than what's currently in the spec.

Honestly, PDF/A comes pretty darn close to getting there. The most recent version allows embedding arbitrary files, however, and there's lots of annoying cruft from the PDF format. (Renderers have to support displaying embedded XML forms, for example.)


this comes out to be a pretty good rundown! interesting discussion!


Hmmm... just for fun, this is what I would like:

* All rendering done by raster chunks that get pieced together. If the pdf has a photo in it, it would be used as its own raster chunk.

* No special font rendering, but an idea of where text is so it can copy paste as though it is selecting text. Really it just outlines parts of the pre-rasterized text. Potentially text could be rasterized per letter for compression, but no dependency on font rendering abilities or local fonts should exist.

* No vector rendering, but the ability to select a rasterized vector image chunk and save as either .svg or .imgType.

* The ability to click html links

* The ability to write (with non-special fonts) into areas as to fill out a form

* A basic Regex (limiter => error/warn message) for form fields

-------

I think this would be enough to cover everything I've done with a pdf. Tests to pass:

1. Looks the same everywhere

2. Can click links (great for resumes)

3. Can view photos, and select them for download

4. Can fill out forms

5. Can copy text

6. ???

-------

Obviously size would be an issue here as you get to larger documents but I suspect compression could be made efficient enough to be just fine in most cases.


Looks like DjVu might be what you want?


I'll just add that's version 1.4 (Acrobat 5) which is typically what many digital printing companies will request if possible. After 1.4 it was basically all useless features being added which bloat the file size (though 1.4 has a bunch too). So later versions of the spec will be longer.

I do like the spec a lot and have actually used it to track down bugs in files before. It's very easy to follow if you're just looking at certain operations.


These days, I'd say the web–it's similarly complicated!


Correct me if I'm wrong, but PDF contains a JS engine within it.

The spec is also partially used for specifying and bootstrapping a publishing and printing system on its own, so it's like JS + cups + PostScript + Unicode + font rendering all combined into one mega spec.


Don’t forget a complete 3D rendering engine (based on u3d models)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: