As much as many of us lament the state of much of today's software, if you think of products from a certain era - IE6, Flash, Java web applets - they all had a commonality in their code quality. These are mostly a non-issue these days, but it's not because they suddenly stopped having bugs and still get active use.
I remember rolling out Adobe Reader in those days and as a product, I don't believe its core has changed much. They've certainly managed to bolt on a whole lot of new features, but that can only make the position worse.
As much as this sounds like a call to kill Adobe, something needs to happen before that's feasible. For the average enterprise, Adobe Reader is far more ingrained than these products were. Case in point, in organisation I asked the question of whether Chrome's PDF viewer would cut it for them. One large department then ordered Adobe Professional for every user. They told me they didn't need it, they just knew I wouldn't propose removing a product they'd actually paid for.
Adobe Reader needs its HTML5 moment - an alternative that's not just "good enough for most people", but one that's actually better.
To be fair, in my experience Chrome's and Firefox's PDF viewers don't cut it. They are good for a quick preview, but especially when printing they occasionally render things slightly wrong, which is unacceptable for a file format whose entire point is to look the same everywhere. Also forms.
That doesn't mean that there aren't any alternatives. Foxit for example is pretty good. But in-browser alternatives just aren't there yet
pdf.js has had 26 pull requests merged in the last month.
5,622 additions and 6,991 deletions. That's just in the project directly, not in the dependencies.
Of course, Acrobat also has PDF rendering bugs, and other various bugs apart from the security issues mentioned (their JavaScript implementation for example).
As for printing... browsers aren't even good at printing HTML. The best browser for printing is based on the old Opera software(PrinceXML), and Safari is probably second. Remember the Apple display system used to be based on PDF rendering... and they do a lot with CUPS and graphic designers.
However, printing PDFs on many browsers can go directly to the printer or the OS (which mostly all support rendering PDFs directly now).
We don't find bugs with it it's just limited. Can't highlight, can't rotate individual pages, can't save rotated pages (you have to "print" it to "Save as pdf" again, awful ux)
Why can't we just burn PNGs[0] or lossless JPEGs and just use OCR / other simple machine learning for text selection? Like, I get that there are some unfortunate souls out there that need to edit CAD documents in their PDF but for 99.999% of people PDFs do one thing that websites do not:
Print reliably well given a page format like A4.
I shouldn't have to wince ever time I open a PDF. They're so insecure that a no-click RCE only fetches $10k.
[0] Or ideally SVG, but there are some problems with fonts and licensing that I'm struggling to remember at the moment.
Because OCR is expensive (to write as software and to process for the end user) and very error prone, especially if your text is anything other than a 12 point black font in on a white background with no formatting (italics, underlines, etc.). If my document's information is valuable, I'm not going to be willing to rely on the quality of my recipient's OCR software to get a digitally readable copy of my work. I mean, at the very least, what if they're blind?
The general hatred for PDFs in the tech community is almost completely rooted in Adobe's initial decision to make PDF editing and creation cost $500. You have access to a document that want to make changes to, but you can't because it's a PDF and don't have access to the document source because the owner/publisher didn't provide that. It's a PDF because PDFs make documents that look the same everywhere, even when printed, which is and will remain critical to the purpose of publishing documents. Well, images don't solve this problem, either, because you still can't edit text in an image, and now you lose the ability to be sure about how they'll print (margins, scaling, etc.).
Furthermore, images, even compressed, are significantly larger than a well made PDF. For example, I've got a 6,700 page document of special ed student progress reports that include detailed, full-color charts and graphs of student progress with respect to goals. It's 60 MB. 8.5 KiB per page.
Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code while they're actually writing documentation. Nowhere else will you find people telling you to use a set of programs that require a build environment when someone asks about the best home office application to use. (Yes, I know that LaTeX is a typesetting language. My cynicism is that some tech people tell others to use LaTeX when they're asked what word processor someone should use.)
> Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code
Rude remarks notwithstanding, LaTeX and its ilk let you make PDFs, which are indeed portable. Setting up LaTeX is the same as setting up any other program, some of which are not portable either. ShareLatex.com [0] also exists for the purpose of using LaTeX anywhere.
People recommend LaTeX because it's in another league when it comes to typesetting and rendering more niche notation. It's also not user hostile when it comes to binary files. LaTeX source files will always be readable decades later, <binary app here> makes no such guarantees.
Whether it's a viable alternative depends on whether the user wants to make a minimal learning investment or not. If they don't, google sheets > export to pdf always exists.
No the hatred for PDFs is that they're filled with bloat and horribly insecure.
As for OCR, we're able to handle underlines and italics for most fonts, though I take your point on colour. If it's especially bad they fail. Ideally it wouldn't be PNGs it would be some stripped down thing. Maybe even HTML with embeded CSS / images via data tags would fit the bill, but now we're bringing in XML-esque parsers and those are garbage too. I'm just so frustrated with dealing with PDFs. They serve a billion different purposes and they're good at none of them.
Accessibility, plus print is at ridiculous DPI compared to screen. To achieve compression you want to use the fact that there is a font being repeated across the page. OCR just isn't good enough.
Are you telling me that our compression algorithms can't compress a page of "e"s tighter than a page of random Chinese characters?
Accessibility is a fair point, but for print-to-file applications we're surely at the point where OCR can at least get the text to a readable format, no?
I've never noticed rendering errors, but my problem with the browser built-in PDF viewers is that they can't handle big complex PDFs, especially on older machines. They'll gobble up 4GB of RAM like it's nothing and start swapping on PDFs that Acroread or xpdf display in less than a second.
A counterpoint: for a while I was working at my uni's help desk, and we would ask all clients to print PDFs from Chrome as a matter of course just because it was so much more reliable at producing the correct output on paper, even when compared to Adobe Reader.
My organization uses Adobe extensively and we could never make do with the Chrome viewer. When you're opening 200-page documents with links, highlighted text and bookmarks, a browser plugin just won't be fast and responsive enough.
I would imagine this is the case in most large organizations.
> One large department then ordered Adobe Professional for every user. They told me they didn't need it, they just knew I wouldn't propose removing a product they'd actually paid for.
I'm assuming that they didn't actually need any of the professional features, they just saw it as a way to avoid having Adobe Reader/Acrobat removed from systems in favor of something they like less but admins like more.
Being able to run JS in a PDF sounds scary to a lot of people, but I wouldn't throw that idea out entirely.
If you follow the work by Bret Victor & others on "explorable explanations"[0][1] and interactive scientific papers[2], you probably appreciate the need for a self-contained format for interactive documents. Could PDF be this? I don't know, I hear the spec is too scary. But I'd say we should have something like that.
Distill[1] is another example of interactive scientific papers (with a focus on machine learning).
But is there really a good reason to not just keep these in browser? I don't really know if there's much value in reading these locally. Maybe this would be a good fit for an electron app?
I would like HTML files to mostly replace pdf documents. However, they lack a couple things:
* A way to save back form data. I believe google is working on a js api to access local files (given a few conditions).
* A way to bundle the html with every js script, ressource, css, etc, in one file, without making a huge mess.
If you had a tar.gz with an index.html inside, and the browser was to transparently allow r/w access to the archive contents from contained js scripts, this could solve a lot of use cases (heck, even "electron" apps could be replace by this). One exception being printed documents (postscript), at which pdf is quite good.
I'm in the same boat with browser-based vs electron apps (see my question [1]). I don't think PDF-based forms are an alternative to reactive web forms though, as they aren't dynamic enough. The sole purpose of PDF is page-oriented print, which html(+js) can't deliver.
<Insert the long list of arguments about Internet (especially _fast_ Internet) not being as ubiquitous as living in SV could make you think.>
That behind us, there's also a matter of reliability and control. Services live much shorter than data they process; given today's trend, I wouldn't expect an online-only paper to be available after 5-10 years. Having a self-contained bundle would let me archive it independently, and would prevent any third parties from being able to interfere with my reading/exploration.
I recall listening to a presentation in RSAC around 2013 or 2014 where Adobe CISO or CIO or someone pretty much said that they don’t give fucks about product security. E.g. zero impact on sales. I suspect it was thrown in as a bit of trolling attempt in a conversation but looking at their track record maybe that is the reality.
I've heard a variant of that talk delivered by a non-C-level at an appsec/prodsec-focused conference where the rehashed quote above (though I'm blatantly paraphrasing) was the justification used. Something more closely reflecting the truth might be "we can't realistically tackle the many security defects in Acrobat and Flash, so we sandboxed both applications instead to generally reduce the technical risks posed by any vulnerabilities in code."
Except somehow we still end up with horrendous security vulnerabilities in both. Putting things in a sandbox does not necessarily mean that you did it correctly.
Honest question: why can't Adobe hire product security engineers to do this kind of vulnerability discoveries or even hire 3rd party consultants to fix bugs/vulnerabilities before they even get into production?
Every CVEs exposed by outside 3rd parties like this is a shame on their software quality and reputation, IMMO.
This. I left Adobe in 2008 (involuntarily :-) ), and it boggles my mind that they haven't done this sort of fuzz testing and fixed the issues in the last 10+ years. Sure, putting the code in a sandbox covers a multitude of sins, but I don't think that is sufficient. Many other Adobe products use the same code to read/write PDF files, and AFAIK they don't do it in a sandbox.
This is a great question and I have thought a bunch about it and the only conclusion I could make is that they dont care enough. This kind of news does not affect Adobe's stock price or their profits. Their users probably mostly don't care. So why bother paying $$$ for security engineers.
If important zero-day affected by Adobe software causing rippling effects perhaps they will care more? At this quality and enough time, it probably is bound to happen.
If you have a PDF document on your web site, please consider putting a link to https://pdfreaders.org/ instead of unfair advertisement of Adobe Reader.
Which gives (except for pdf.js) more PDF readers written in C, some with a long history of CVEs, and typically not sandboxed by default.
Since many people are using a PDF reader to read PDFs from relatively untrusted sources, do yourself a favor and at least use a reader that does not have full system access.
macOS: Preview.app (uses macOS sandboxing)
Linux: Evince Flatpak on Wayland (Flatpak uses sandboxing. Wayland because X11 apps can read all keystrokes, mouse events, do screengrabs.)
Windows: no clue
All platforms: in-browser PDF reader with a browser that sandboxes.
Applications that can send commands to X.org servers can completely control it. The same isn't true for Wayland.
Flatpak is providing the actual application sandboxing, but being allowed to talk to the X server is a huge amount of privilege that can't really be restricted.
Unfortunately, yes it is. Just yesterday, my wife tried to open a pdf transcript from her college. It would not open on anything other that Adobe Reader on a traditional os, putting it out of reach for her, being an Android/Chromebook user. Neither Chrome nor Google Drive/Docs could open it. And I could only open it in Adobe Reader on my laptop - not Firefox, not Chrome, and not whatever default viewer my laptop has. We've had this problem with PDF's from another organization, too. It is a real problem.
Yeah official transcripts from my undergrad have (or had, haven't needed one in a while) some sort of authentication thing. Fortunately adobe reader for Linux was still supported when I needed one...
It's amazing browsers have so far decided to just not have an HTML archive format that could replace PDF. The majority of what PDF does can be better done in a webpage. Why not just an extension like .phd but is actually a .tar.gz that contains a webpages assets. Present like pdf's are, and done.
I don't know how accurate that is for PDF's, but webpages are supposed to look the same, and given known compatible styling, it should be on any modern browser. Browsers are extremely consistent in content presentation, that's why webpages from early 2000s still look the same.
No. Take for example font-family: sans-serif. That can look like anything, can have different widths on different devices, etc. Browser windows can have any size, devices can have various pixel densities, users can work at different zoom levels, etc. The previous big thing was responsive design.
Good point. I noticed that too. At least web pages are made so that the content is re-flown [seems like the whole point], so it doesn't look like shit. It seems like [many?] PDFs place each character separately so if the actually used font is different from the one used during creation, the result will look very messy.
Web page rendering is far from similar on different browsers. I agreed that an alternative to PDF would be a good thing, but it probably would be more like a lightweight
PDF than what HTML is today.
What? Lot's of webpages look different after simply resizing the window! The fact that this is on purpose, doesn't mean it doesn't happen (quite the opposite!).
That's because they're designed that way. You can do styling in a way that is not effected by browser window sizing, typically with specified document dimensions, or absolute positioning.
> I don't know how accurate that is for PDF's, but webpages are supposed to look the same,
One of Adobe's early talking points for the value of PDF's was that they would "look the same on all systems". Of course some context is necessary. PDF first appeared in 1993. In 1993, while the internet did exist, most individuals who were not associated with a university, research lab, or govt. agency, had no access to 'the internet'.
As well, the computing world was much more diverse. One had Dos, early Windows, and various MacOS variants all coexisting, one had numerous different variants of Unix on the numerous different RISC workstations in existence. And, here was the big deal, 'documents' created on each of these systems were to a large extent incompatible with each other. In this context, 'document' should be thought of as "a file used to create paper printouts" as opposed to what we think of a 'document' now in 2018. There was some compatibility, in that Windows systems would, sometimes, read 'documents' produced by Dos based word processors, and of course the lowest common denominator, plain text file, was 'almost' compatible (line ending differences was the biggest incompatibility). But for anything more complicated, if person X created a 'document' on Dos, and they wanted person Y, using SunOS, to see a version that "looked the same", their best bet was to print their document to paper and give Y the printer output. Because if they could send the electronic file to Y somehow, chances were that Y would be unable to open it, and even if they could, there was a good chance that it did not 'look the same' (from a 'looks like the same paper printout' level of same).
PDF came about in this world where paper was still king, and Adobe's marketing of "looks the same" was really meant to be "produces the same paper printout for the receiver Y as it does for creator X". That is why, today, in 2018, that viewing a PDF still looks like one is viewing WYSIWYG of a paper printout. PDF is, quite intimately, tied to the concept that there are discrete sheets of paper that it is formatting data onto. Yes some viewers do provide an 'almost' HTML continuous scroll look, but that is done 100% in the viewer, the underlying PDF format is very paper page oriented at its core.
So, when comparing PDF intent to web page intent, the phrase "looks the same" has different meanings. For PDF, it was designed such that "looks the same" means that a paper printout looks identical to the original. And that the designer/creator has full control over the look, while the viewer has no control over the look. For web pages, "looks the same" is far less strict, and is really not the same meaning, because the web was always intended to allow the viewer much freedom in deciding how to display the HTML content, taking away the designers ability to strictly determine look and presentation. With the result that HTML data was never meant to "look the same" with the same strictness intended by PDF.
That was really informative, thank you. Given the same rendering on browsers across platforms I imagine you could achieve the same effect as PDF, but it would be a spec on top of html+css, not inherently built for documents like PDF is as you said. There may be some differences in important edge cases, but PDF would still exist for business that relies upon it in that manner. I'm talking more of a replacement that fits the 90% of cases that don't deal with signatures and legally bound documents and such.
You can take a PDF and plot it, print it, display it on screen and it will always look the same. SVG is a closer to PDF than html is - and svg gets a lot of grief for having an overly complicated spec too.
I remember saving .mht files with IE as a kid when working on assignments so I could disconnect the dialup and give my parents their phone line back :)
Sort of, but mhtml isn't a good format. It was a hacky way of taking what emails did. It's embedding all content in a single file, not as an archive. Rather it should be you can open up the HTML archive like an actual archive and see the individual files.
Opera 12 (the original one, before the managers decided that it should be based on Chromium) had the .zip files support built in; that means that if the URL was
somepath/archive.zip/index.html
and index.html refers to other files, they would be read from the same zip, even if they are only inside of the zip.
I used it a lot for the local archives of the bigger content, it is amazingly convenient, and I'm sad that the same approach was not used anywhere else.
It's not trivial to get it right, in the security aspects (the zip implementation has to be robust, the url handling too) but it's doable and it would be very practical to have it.
Tangentially, the good thing of the zip format is that it has so called "central directory" which means that you don't even have to load the whole archive if not all data is needed, just the last part of the file, and from there you get the offset and the location of the needed file. So the Zip files could work beautifully with the
when they are huge(1). The small ones are most effective to be downloaded at once, of course.
1) I've actually done such a sequence by hand a few times when I had a slow internet and knew that I don't need the whole zip file, but just to see that all files are inside: I've made the range request for the end of the file which would be enough for the estimated number of files inside, and so I've had the list of all the files in the archive without needed to download the whole archive: I've reconstructed the same file size but left the rest of it be zeroes, and some of the zip tools I've used treated the archive directory exactly as I needed it.
The reason I don't suggest zip is due to it's insecurity, like zip bombing. Itd be better for archival if we just had tar, and then sometime lightweight on top of it if compression is wanted. That way you could have js generate the archive client side.
It is interesting how the older web got some things right, though, and now it's 2018 and those ideas one would think should be robust by now, isn't even there.
Zip is not inherently insecure anymore than any other URL parsing and archive handling is. Technically it's many orders of magnitude better solution for the random access (due to the existence of the central directory as I've already mentioned) than tar-gz, if it's done right.
Out of every archive format a pathological case can be constructed, just like it can from the relative file names etc, but such attempts can be simply rejected during the processing once some thresholds are reached. The original article demonstrates that JPG reading implementation can be bad enough, and the same can be said for every format, even text based. It simply has to be done right (including fuzzing at the end).
90% of PDFs could be replaced using a background PNG/JPEG file and a visible/invisible text overlay.
Instead of forms embedded in the “.phd”, one could just use HTML forms and and then use javaScript to export it as a “.phd” document, covering 99% of PDF use cases.
Acrobat Reader has been the poster boy for poor software for many years and it appears that Adobe have been good at adding new features to make it largely impossible for their competitors to keep up.
What is one to do?
Surely, the obvious answer is to ringfence PDF (or another new format) for the most basic features. These could more easily be handled by 3rd-party apps both securely and to render correctly. Let Adobe do whatever they want with their own format by adding loads of stuff people don't want, then the sell is harder for them:
Get a cheaper, safer app for writing portable docs which can do most things or pay more money for a very insecure format that does stuff you don't need.
I assume that others have attempted at some point to make an OSS alternative to PDF and I'm guessing it hasn't worked yet?
If you're talking about the developer of the software? Potentially. As to third parties, this article goes into painstaking detail on how difficult it is to set up fuzzing for closed source binaries.
You need to understand a certain amount of "rules" around each API call, and while you can duplicate their normal usage, there's a certain amount of thought that has to go into it.
We've been working on something like this for the past couple of months and we'll be launching in early/mid January![0] We've got experience working on large scale fuzzing infrastructure (Chrome fuzzing team, Coinbase fuzzing), and have modelled it similarly to Google's oss-fuzz[1], but for private projects and clouds.
We're always looking for companies and security researchers that want to fuzz but don't have the time/knowledge on how to do so (we automate a lot of the set up process and integrate nicely into your GitHub workflow) - drop me a line if you're interested - andrei@fuzzbuzz.io
Since KDE, GNOME, and FSF foundations recently got a significant contribution this year, I wonder why they wouldn't join forces and hire a couple full-time developers to make poppler and all poppler-based PDF viewers (Evince, Okular, etc) actually useful for PDF Forms, animated and interactive content.
Adobe's software is large enough and ingrained deep enough that it seems people give it a pass with today's standards for software stability. Lowering the threshold of well-shaped review nearer to git and libgit2 would yield even more value toward stepping through the software stargate.
Do people still use Adobe reader nowadays? Last time that I tried it in my school's library it took half a minute to load and render my document, and after that the whole UI was unresponsive.
I had a much better experience with Sumatra on windows and Zathura on Linux where my documents open almost instantly.
Libpoppler has poor support for PDF Forms (especially Unicode[1][2]), embedded animation and 3D extensions. In my opinion these areas are very important in real world document exchange to be ignored (as it is a case for PDF FOSS tools).
I have never seen anyone use any of these features in the real world. I presume that embedded animation and 3D extensions are used in art-related fields? If so that would explain my ignorance.
PDF forms are used all over the place from what I can tell -- including a bunch of county government stuff I just had to deal with. No JS was involved though.
Interestingly enough I can't fill it out in Firefox, but I can with Preview.app. Running pdfinfo -js yielded some script, but it basically only looks like it's there as a gatekeeper so that you don't open the file with an older version of Reader. Is there more JS in there that pdfinfo can't extract?
The PDF used to apply for the BSA's Eagle Scout rank needs that stuff. I believe the reason is related to expandable text fields that might need to insert pages into the document. None of the non-Adobe viewers can handle it.
I have to use Reader to fill out my state tax forms because they use some modern JavaScript driven system to auto fill that no other reader can handle.
We had the same situation in the UK until recently, thankfully the new API-based system has opened it up to other platforms (eg Xero) and works very well.
A long time ago, when I used PDF exclusively as the format for my slides, I used features like animations and auto play videos embedded within PDFs. Very helpful if you want to give a presentation.
I was also convinced to install it, although after trying it, it looks more like less for PDFs than vim for PDFs (as many of the commands search or scroll in complex ways, but none of them modify the PDF).
Still, it's interesting to have something like less for PDFs!
Zathura can use mupdf as its PDF renderer (it can also use poppler). I like using the mupdf library through Zathura rather than using the mupdf application, because Zathura has plugins for other file formats too, like PostScript and DJVU, and that way I learn a single set of keystrokes to view all sorts of documents.
What do you use? I also had good experiences with Okular in the past but I have not tried it in ages. Before moving to zathura I used xpdf and evince but I was not satisfied by them (not to mention that they do not support postscript and djvu).
Zathura does have some vim keybinds but other than that I can't see any similarities.
I may have missed something, but it looks to me like this is really a test of just the JPEG 2000 part of Acrobat Reader. It is possible that Adobe built this part of the reader by taking some open source implementation of JPEG 2000 (such as the reference implementation), and modding it - probably by changing memory allocation to be consistent with ARs memory model. So it is possible that some or many of the discovered vulnerabilities are in fact part of the JEPG 2000 library, in which case the problem goes beyond Adobe Acrobat.
If you read the PDF spec from the late 90's, it is Stephen King novel-scary... container format, multiple encodings, encryption, embedded binaries, embedded JavaScript and more.
While working with the PDF format I sometimes get the impression that this complexity is what Adobe wants. As a result, Adobe Reader is the only viewer that implements the entire spec and can handle all (or most) quirks.
This is especially apparent when trying to edit arbitrary PDF files, which is sometimes not so easy or even impossible. Just the definition of fonts and the text layout is already so complicated that this is the logical consequence.
But perhaps the format has simply grown and led to additional requirements such as PDF/A, PDF/X, PDF/E and now PDF 2.0, the next standard that makes everything even more complex... Will this every stop?
PDF is an unusual format in the sense that it had a rather specific thing it tried to do and then it achieved that goal, so that it could be considered "done", but the product it was most associated with, Acrobat, tried to expand still.
PDF has the semantics of a digital print that is resolution-independent and supports copypaste and search (mostly by mapping glyphs back to text).
In addition to resolution independence being something that's higher-level than strictly "digital print", being able to capture transparency is such a higher-level feature.
From the above perspective, PDF peaked in 1.4 when it got transparency support. Supporting roughly the PDF 1.4 feature set was that allowed the Mac Preview app be good enough for Mac users so that Apple could stop bundling Acrobat Reader with Macs.
After 1.4, PDF has gotten better compression algorithms that don't really change what the format is about. PDF/A and PDF/X fit well the notion of PDF as "digital print".
But Adobe has been trying to leverage Acrobat/PDF to other areas that don't fit the notion of "digital print". These include pre-Macromedia acquisition attempts to make PDFs a more dynamic platform and later inclusion of 3D models in PDFs. Other PDF viewers still work for users most of the time without this stuff, which is a signal of what PDF really is to users ("digital print").
(Filling in paper-like forms, while not true to the notion that PDF is a final-form format sort of make sense from the point of view of digital paper, though.)
> While working with the PDF format I sometimes get the impression that this complexity is what Adobe wants. As a result, Adobe Reader is the only viewer that implements the entire spec and can handle all (or most) quirks.
While that certainly does play in Adobe's favor, the complexity of the spec. is also what occurs when over time new features, some never even envisioned by the original creators, are bolted on to keep the whole "relevant" and/or to add new "features" to keep the 'thing' from becoming obsolete.
We can certainly argue whether the addition of different features was worth the complexity increase, but simply taking an existing system and bolting on the latest "hotness" to use to add to the checklist of "why one should upgrade" features also produces similar levels of complexity.
So some of the complexity increase is merely the fact that the pdf spec. has been evolved to do things it was likely never designed to do in the first place.
The Office formats are well specified, they are complex because that is the nature of the software but it is a world away from something like PSD or even PDF.
PDF is actually quite well specified, there are not many holes in the specification itself.[0] As to what Adobe Reader will do when it encounters an out-of-spec file, that is a lot fuzzier.
On the other hand, the Office file formats (especially Word) have many un- or underspecified cases.
[0] The only one I know of is finding the end of compressed inline image data.
I agree, the PDF spec is great, and very easy to understand (if slow to wade through). The hardest parts are when you have to duck out to read another spec for a contained format like TrueType.
Regarding Reader, I work with PDFs a lot, and the majority of issues have a fairly common pattern. The supplier has created a PDF in a 3rd party tool, which is invalid in a subtle way (production printers in particular are very specific about what they want to accept).
But it works fine in Adobe Reader, since it was built to be very tolerant in what it accepts, so it's often hard to convince the non-technical users that the file has an issue. It's great for end users but has meant that a lot of tools out there just didn't have to try too hard to make PDFs that mostly work, so programming workflows can be an issue.
I found quite a few areas that were vague when I was working with it.
The advantage of the office formats is they are Zip files with a ton of XML, ie they are well defined. The application parts are another matter of course.
The original criticism was that some parts are just binary blobs encoded in XML elements, which wouldn’t suprise me at all, with Microsoft being allowed to tick the ‘XML file format’ checkbox and still getting to keep the binary format advantages.
I see. I was mostly referring to semantic problems, of which I heard there are a lot (I haven't really worked with Office internals much), and also I was thinking of the pre-XML Office formats.
I remember reading in the past that Microsoft had corrupted the ISO standards body to publish essentially fake standards that were different to what MS Office actually produced, so software like Libreoffice would output files that didn't work properly in Office or visa versa. Are you saying that now this is not the case and they are full specified? I sometimes tell people about this so I want to make sure I have my facts straight.
Okay, apparently I totally didn't notice the release of PDF 2.0 a year ago, even though I was working a lot with PDFs at that time. Also, this new version is an ISO standard that costs 198 CHF to download, so I hereby predict that it is basically dead in the water, since few people will bother implementing it. The new features also don't seem very interesting, and from what I gather the spec is still backwards compatible despite the major version number increment.
Like every bit of business software, there’s a load of stuff that shouldn’t be in there. It’s a really flexible container format though, and every one of these features went in because there was a need. Times change, things change and it could do with a tidy up, but it’s probably impossible without breaking everything for a load of businesses.
I was under the impression that PDF was created as a response to postscript being "too programmable" and not "document enough" but then they decided that PDF is too minimal and so they ended up bloating it beyond what PostScript ever was by including JavaScript, Flash, and other trash in it.
Honestly, it's possible to make VM windows show up as if they were normal programs, or you could just do things like Chrome-level internal sandboxing. There's no reason this has to be clunky.
I don’t know if this is true, but I’ve been told that the pdf spec at one point did/does contain some MS DOS emulation.
I’m seriously close to banning acrobat the program for my employees, just haven’t found a rock solid alternative that I can trust to not implement the same dumb parts of the spec.
Pretty sure that pdf.js from Firefox is safe. At least it runs as sandboxed javascript in the browser. I believe a standalone client may exist as well.
My issue with pdf.js is that it is really bad at copying. every time that I try to copy something from it every word (and sometimes different letters from a word) end up in a different line. I also had issues with rendering, in some (rare) cases I had it show squares instead of the actual content.
Not to mention that it is actually horrifyingly slow compared to most of the viewers that I tried.
DJVU is raster format. It's intended for scans and archiving printed media. It's possible to use it for documents produced digitally, but I don't think it will be a good idea.
PDF "core" is not that bad, but 90s "multimedia" craze turned it into badly designed graphical application runtime.
Thanks for disambiguation, the raster-vector part is really a major difference. Is PS a viable alternative (even though it is a programming language itself)?
AFAIK, PDF is mostly a container for PS with compression and better handling of fonts (BTW, can fonts be embedded in PS? How fonts are sent to printer?).
Still, both formats are too much printing-oriented. Reading documentation in PDF on computer screen is not especially pleasant, and unbearable on phones.
The fact that companies still have the email-> employee pc -> acrobat reader pipeline enabled says a lot about what companies really think about security, posturing aside.
The reason being that there are almost no high-profile breaches I can think of where PDF vulnerabilities have been blamed. Unlike Office macros, Flash etc etc.
I remember rolling out Adobe Reader in those days and as a product, I don't believe its core has changed much. They've certainly managed to bolt on a whole lot of new features, but that can only make the position worse.
As much as this sounds like a call to kill Adobe, something needs to happen before that's feasible. For the average enterprise, Adobe Reader is far more ingrained than these products were. Case in point, in organisation I asked the question of whether Chrome's PDF viewer would cut it for them. One large department then ordered Adobe Professional for every user. They told me they didn't need it, they just knew I wouldn't propose removing a product they'd actually paid for.
Adobe Reader needs its HTML5 moment - an alternative that's not just "good enough for most people", but one that's actually better.