Hacker News new | past | comments | ask | show | jobs | submit login
Dangerzone: Convert potentially dangerous PDFs, documents, or images to safe PDF (github.com/firstlookmedia)
317 points by panarky on March 8, 2020 | hide | past | favorite | 56 comments



It's nice that PDF security is getting a bit more attention, but there are a number of things that this approach will trash, for instance, I don't have high hopes for the accessibility of the resulting PDF. (edit: and needless to say, any software in your pipeline which does full interpretation of an untrusted file will itself become the target for attacks, so this is only a useful tool if it is run in an extremely restricted environment)

I for one have been looking a lot into PDF/A for security. PDF/A is really meant for archival, but as a side effect has disallowed an awful lot of weird PDF features which are a security nightmare and pdf readers tend to implement badly/buggily. PDF/A-1 for example, the strictest level, disallows JPEG2000, TIFF, JavaScript, PostScript, embedded files... (PDF/A-3, FWIW is essentially useless from this angle, because they decided to allow arbitrary embedded files, so a valid PDF/A-3 could have pretty much anything in it).

There now exists a good PDF/A validator (https://verapdf.org/) which can be used to ensure PDFs conform to the standard, but of course, won't fix them if they're not.

PDF/A has an interesting implementation detail however - compliant PDF readers are supposed to automatically "turn off" non-PDF/A features when they encounter a PDF which declares itself as a particular PDF/A variant (even if it then goes on to attempt to use non-compliant features), which would hopefully prevent dangerous sections from being decoded and avoid exploitation). Another interesting feature of PDF is its appendable nature, which might raise the possibility of being able to "declare" an arbitrary PDF as PDF/A by simply appending an extra section to it, hopefully rendering it less harmful (though possibly at the expense of it appearing to have missing content when rendered).


Couldn't you just use a whitelist for the features?

If the reader can open a PDF/A-1 file and ignore the bad parts, can't it open a PDF file as PDF/A-1 and remove the bad parts, before saving it again?

Then you could use the "re-rendering" technique to extract images:

1. Create a stripped PDF/A-1 file from the original PDF.

2. In a VM, render the two PDF files to two high-resolution image sets.

3. Use some CV algorithm to find the differences. For example, gaussian blur, subtract, threshold, find islands.

4. Use this to come up with areas which use complicated PDF features and/or images. Say this returns that there is an image on page 9 in the rectangle ((17, 338), (400, 300)).

5. Crop out page 9 in rectangle ((17, 338), (400, 300)) from the original PDF. Use some CV algorithm to detect the DPI and whether it's best to encode as JPEG or PNG. Encode it and add to list of images.

6. Add sanitized images back from list, mark PDF/A-1 file with images as PDF/A-2.

Of course, you could do this for links or whatever as well. Spit out a list of rectangles and link targets in the PDF, and then put them back in.


There are myriad ways that PDFs could be re-written and re-rendered, but they would all be quite complicated and/or throw away a lot of extremely useful "meta" information (bookmarks, signed sections etc.) and almost certainly make files much bigger. The idea of the "appending" trick would be to mutate the original file as little as possible, but convince the reader to open it in a safer mode.


The small issue with the append trick is that is that it assumes the reader application will now respect the new format and not open insecure parts... which might not be fully implemented in all cases and is reader specific.

Fully sanitizing the PDF yields better guarantees of security at the cost of lost functionality.


> The small issue with the append trick is that is that it assumes the reader application will now respect the new format and not open insecure parts... which might not be fully implemented in all cases and is reader specific.

Yes, note my original emphasised use of the term "supposed to".

> Fully sanitizing the PDF yields better guarantees of security at the cost of lost functionality.

Indeed it's a tradeoff. But if you're willing to throw away the features which this extreme sanitization would trample across and have any ability to design PDF out of your system, you're probably better off not using PDF at all in favour of some straightforward image format.


I am not sure whether bookmarks or signed sections are "extremely useful". This has some minor advantages, but it's also a large attack surface. At least in this way, you keep most useful features.

As jrowley said, if you trust the reader to sanitize it safely, why not trust it to open normal PDF files safely?


> I am not sure whether bookmarks or signed sections are "extremely useful".

Well, they are. You try using a thousand page manual that doesn't have a section outline, or try sending around PDFs whose authenticity is legally important (to people who aren't capable of using gpg). This is only scratching the surface - there are many features which are important to people with e.g. accessibility needs which aren't directly visible.

> This has some minor advantages, but it's also a large attack surface.

These are not a large attack surface. TIFF, JavaScript/PostScript, 3D content and video (yes - PDFs can contain video!) are a "large attack surface".

> As jrowley said, if you trust the reader to sanitize it safely, why not trust it to open normal PDF files safely?

Well, firstly, I emphasised how "PDF readers are supposed to automatically 'turn off' non-PDF/A features", so I already acknowledge the caveat. And as far as "trusting them" goes, disabling the decoding of certain features involves very few LOC. Completely implementing the whole specification for the more exotic PDF features is an incomparably huge number of LOC, which are proportionately more likely to have flaws.


The trick of retroactively declaring a PDF as PDF/A by appending an incremental update won't work well for signed PDFs, because the PDF reader would recognize the PDF as having been modified since the signature, and when displaying the signed version of the PDF (i.e. removing all incremental updates after the signature) the PDF/A declaration will not be part of it and hence the PDF/A restrictions not be observed by the PDF reader. Put slightly differently: A PDF signature effectively freezes the non-PDF/A nature of a PDF.


Yes - I think this is only true following https://www.pdfa.org/recently-identified-pdf-digital-signatu...


Those vulnerabilities really have nothing to do with the PDF/A question. PDF signature validators have to check for them regardless of PDF/A, and the issue I raised above is independent of those vulnerabilities.


This is definitely a good brute force strategy

I ... think there’s another technique that relies a bit on trusting the printing drivers to do the right thing, where you can tell Ghostscript to print your document, and target another PDF. This should at least remove interactive components in a PDF


I used to deal with PDF at my day job. Among all the tools we use in production, Ghostscript probably has the most 0 day. Thankfully we're paranoid about security and run everything in sandbox. Still it's no fun getting nagged by security to upgrade our Ghostscript version.


It's funny how they all seem to have been found by one person (Tavis Ormandy) too. It's like the setup and PostScript standard are so baroque that only one human understands them, and that human takes a week or so every year or two to research and drop another 0day.

https://bugs.chromium.org/p/project-zero/issues/detail?id=16...


Honestly it is starting to feel like most major exploits in most major software/platforms are being found by Tavis Ormandy these days. He's really good at what he does.


Well, Tavis is one of the infosec superstars, the amount of stuff he comes up with is amazing.


It's definitely brute force, in that it's the equivalent of printing a document onto paper and then scanning it back in. This "flattening" is highly effective at sanitising, but also removes all the semantic content in the process; the output should be several times larger than the input (and if it isn't, then it's an indication that something very suspicious was in the input....)


IIRC this is also how Firefox is doing it for its pdf.js print feature: https://github.com/mozilla/pdf.js/blob/master/web/pdf_print_...

Which is why when you print a PDF from Firefox, it doesn't look very nice. But it's safer than sending unsanitized PDFs to printers.


Indeed, I was hoping for something smarter, that would remove only the "risky" bits of PDF, but keep the overall structure (and size).


What about converting PDF to PostScript and back? It should keep most of the semantic information while removing the exploits.


Yeah, perhaps. The "gruntwork" would be to figure out if that is sufficient. Heck, taking your idea further, perhaps convert to something non-derivative like HP's PCL5 and back. Or SVG or...


Be very wary of exposing Ghostscript to untrusted data.


In (1) the author use as a cv a pdf that is also a bootloader , and in the comments it seems that he has improved the code. I wonder if he could render dangerzone as futile.

(1) https://news.ycombinator.com/item?id=19344146

Edited: Added in the comments of that post there is a reference to pocorgtfo16.pdf: is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver which hosts Kaitai Struct’s WebIDE which, allows you to view the file’s own annotated bytes. The zip archive has further resources to insane reversing deep dives, code to study and more.

[2] https://www.alchemistowl.org/pocorgtfo/:w


Good, but I don't think the change from running the sandbox in a VM (in Qubes case even Xen, not kvm) to docker is an improvement. Escaping docker is much easier than Xen.


Useful tool -- it's trivial to make a RAT bypass chat/email .doc/.PDF attachments.

I don't open any files on my PC from people I don't personally know -- use webviewers.


FWIW... you probably shouldn’t even trust your contacts. People get phished all the time.


As far as phishing goes, few things are more effective than popping a medium sized law firm and sending form letters from their (legit) systems as a real person.

Click-through rate for a technically legitimate "you are party to a lawsuit" email must be sky high.


Odd question. Why would a webviewer be safer in this case?

edit: Thank you for both answers. I thought it had to do with sandbox rationale, but couldn't mentally get past the fact that sandbox could potentially be escaped too. Eh, I think it is time for sleep.


If it's rendering locally, at least the browser is sandboxed. And if it's rendering server-side, then at worst someone else's machine gets compromised instead of yours.


Well is it safe? Don't know.

Safer: definitely. Given that the collective amount of PDF attacks is some number, now this particular PDF needs to attack PDF and the webviewer. Assuming that 1% of all PDFs do that, I'd say it's 100 times safer than not using a webviewer.

If you still think that 1% of all potential PDF attacks is too unsafe, then that's a different discussion.

If you think my 1% is off, then that's a different discussion too. All I'm saying is that it's safer.


Well, PDF attacks need to attack the viewer you're using too…


True, but in most cases this is assumed to be a popular PDF reader. If it is specifically targeting a webviewer, I agree. But that still means that there is some JS PDF parser in between, though that provides very little in terms of security, I doubt that such a parser will check for malicious input.


Afaik chromium uses the same pdf renderer as foxit (pdfium)


>a popular PDF reader

What's the most popular, Chrome browser I'd have thought?


On macOS, the PDF reader Preview.app runs sandboxed by default.

The Preview.app sandbox is not quite as secure as what is used by browsers such as Firefox or Chrome on macOS for web content, so there is probably still a benefit to viewing PDFs in the browser, depending on whether the browser or PDF viewer is more hardened against these attacks.


This is similar to how QubesOS converts “untrusted PDFs” in a disposable VM. Quite nifty, particularly if you use OCR afterwards.

See: https://theinvisiblethings.blogspot.com/2013/02/converting-u...


This kind of makes me wonder why PDFs can even act maliciously in the first place. Why does it have the ability to do these things?


PDF derives from PostScript which is a full-blown programming language so it's an "original sin" either way.

Then over time Adobe added a number of interactive (forms), multimedia and rich media (embedded JS) features, leading to even more vectors.


The page description language part of PDF is based on Postscript, but explicitly simplified to be non-Turing-complete and safe (if implemented sanely). The later additions are the main culprit I think.


"if implemented sanely" - oh well. The original idea was nice.


I think it's just that Adobe wanted to add more features, even ones that have no place in the PDF format.


Because computer, a benign feature of pdf can still lead to an exploit in a viewer.

Note the attempts at the link to sanitize image formats that don't have over the top complexity.

If your question is why an electronic document format has support for images and interactivity, I don't know what to tell you.


Very nice but fyi: most malicious pdf just contains links to something else,usually shortlinks. Social engineering is hard to mitigate.


If I understand this correctly, a link wouldn't survive this as the pdf is turned into images and then those images back into a pdf. So it's essentially like a scan of very high quality.

What you would end up with is an image that looks like a link but would not be clickable.


Doesn't matter if it is not clickable. There are existing phishing campaigns that use jpeg attachments asking the user to copy paste it (or just let the user try to copy and fail and manually type it out). Perhaps adding a text warning users to not open any links on the document will easily prevent that?


> Dangerzone can optionally OCR the safe PDFs it creates, so it will have a text layer again

I'm not completely sure, but wouldn't this parse links and make them accessible again, possibly even clickable?


A links displayed text and its destination URL are not necessarily the same. Rendering the document to a bitmap then OCRing that would get the display text rather than the URL. I would think that it would be normal for a malicious URL to be obscured with an innocent looking display text.


Given the security focus here I'd be somewhat surprised if they did this - links are one of the main threat vectors associated with pdfs.


Maybe use pdftotext wrapper to extract text along side the image based pdf.


A PDF smartform can run ActionScript.

Fortunately smart forms require Adobe Viewer, and there's an approval step (similar to agreeing to Excel macros), but after that it can do whatever the hell it likes.


If the OP renamed to “HighwayToTheDangerzone”, would this help?


This wouldn't work for businesses because PDF forms and legitimate links and even embedded objects. I think this is a great idea if you have users preview the image version before opening the whole thing maybe? Even as an individual, I have had to fill out pdf forms for various usgov applications, job applications,etc...


Obligatory humblebrag/shameless plug for my open source PDF(+ other docs) to Image converter, which runs as a web app. It's self hosted open source (but easiest to run on FreeBSD). Uses Ghostscript/OpenOffice under the hood:

https://github.com/dosyago/p2..git


[flagged]


When one enables functionality, this results in both good and bad behaviour.

Microsoft did not build the guns, it made engineering possible. Some engineers are bad. But a lot are good.


    Download Dangerzone
So instead of opening a PDF from the internet on my machine, I now run code from the internet on my machine?


There's a big difference between opening a random PDF and downloading code which you can view yourself on github, no? Or does your computer (from cpu microcode up) only run code which you have personally written and is not "from the internet"?


You have not yet downloaded “HighwayToTheDangerzone”. You are still ok.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: