What’s Hiding in Your PDF?

m3nu · on Nov 5, 2018

Earlier this year I was working on hybrid PDFs[1] that embed a full XML invoice. Standardized and promoted by the German and French.[2] One more thing to hide.

1: https://github.com/invoice-x/factur-x-ng 2: http://fnfe-mpe.org/factur-x/factur-x_en/

masa331 · on Nov 5, 2018

This is actually pretty cool. I'v been working in an accounting company and i'v been thinking about such thing a lot lately. Is Factur-X used in practice in Germany and France? Do you know about some other similar things?

m3nu · on Nov 5, 2018

The lower-level library is used in Odoo ERP for example. Possibly others, as there are a few implementations[1] (mostly Java). This is lots of work because you need to create the XML yourself. I tried to make mappings between simple keywords and the XPaths of different XML standards (there are a bunch)[2]

So in theory you could just give a few keywords and get the full XML. Currently this is on hold, but if someone has a use case and wants to contribute, I'd be happy to continue working on it. I also provide this other library[3] that extracts some essential data from PDFs. The plan is to use them together to automatically build the XML from just the PDF.

1: https://www.invoice-x.org/related/

2: https://www.invoice-x.org/standards/

3: https://github.com/invoice-x/invoice2data

masa331 · on Nov 6, 2018

Thank you! I will look into these. I know the problems. As i was working in the accounting firm i was dealing with imports/exports from various accounting softwares a lot. In fact i'm currently working on an app which can convert between them as my side project.

Leace · on Nov 6, 2018

As far as I understand this the "standard" is the XML that is being embedded inside PDF, so you need a small tool to extract the XML from the PDF and then you've got the standard XML invoice. The PDF is just for nice presentation.

I did a similar approach of using XML Stylesheets (XSL) to render the standard invoice as HTML when opened, this also looked nice.

IshKebab · on Nov 5, 2018

Is the PDF rendered from the XML? If not that sounds mad.

codetrotter · on Nov 5, 2018

Mad how? PDF is a notoriously bad format to extract data from, but great for visually representing the data to humans. XML is human unfriendly but good for structuring data so that software can read it. Embedding the machine readable XML representation of the data in the PDF ensures that both representations of the data are available always.

btgeekboy · on Nov 5, 2018

I think OP is concerned about conflicting data - the visual representation may not match the XML data.

leovander · on Nov 5, 2018

...the visual representation may not match the XML data

Hopefully OP doesn't learn about how his medical history is possibly passed around.

https://en.wikipedia.org/wiki/Continuity_of_Care_Document

askvictor · on Nov 6, 2018

Given that it's possible to have javascript in a PDF as well, it wouldn't be too hard to have a bit of code that verifies the XML matches the human-readable version. Or, failing that, some sort of crypto signature to check both to see things haven't been tampered with.

josefx · on Nov 6, 2018

As far as I can tell the standard prohibits PDF files with any dynamic content, including javascript. Also there is no point in embedding the verification code in the PDF if you don't trust its contents.

IshKebab · on Nov 5, 2018

Because it will be easy to accidentally make the invoice say something different to the XML. Imagine a company accepts and pays out invoices via this PDF format...

josefx · on Nov 5, 2018

It doesn't sound very safe and the draft I found online only describes various restrictions on the PDF and the expectation that the XML is to be seen as an alternative representation for processing. Nothing to enforce that the contents of both are identical or tamper proof. At best you could claim that an invoice were both don't match is invalid, however that would require manual verification.

Meanwhile I write the documentation of my xml configuration files in xsd and convert them to something readable using xslt. One set of data for processing and display everywhere and one less headache about duplicated and badly maintained data.

m3nu · on Nov 5, 2018

Mostly the other way round. There are 2 use cases I know of:

1. You generate invoices in your ERP and have all the info. To make your client's life easier, you embed the info as XML, so he doesn't need to type it from the PDF.

2. You get invoices without XML and use e.g. invoice2data[1] to extract key fields and then add them as XML for later.

1: https://github.com/invoice-x/invoice2data

jamescostian · on Nov 5, 2018

No, but it does include validation. I agree that it's mad for a person to use this, but having a machine use the API can help ensure that the PDF and XML always match

tonyedgecombe · on Nov 6, 2018

I seem to remember JDEdwards marked up their PDF output in a way that made it easy to parse out the data.

lower · on Nov 5, 2018

The vector drawing program Ipe stores all its data in a PDF. So, you create a drawing, save it as a PDF and can later open the PDF again for editing. Text can be written in LaTeX format and the PDF will contain the LaTeX source for later editing.

https://en.wikipedia.org/wiki/Ipe_(software)

jacobolus · on Nov 5, 2018

Illustrator has been doing this for a long time. You can choose to “preserve editing capabilities” at save time. I assume it is similar for other vector drawing programs.

Kenji · on Nov 5, 2018

Ipe is fantastic. I used it for technical images in my thesis and papers and the graphics turn out beautifully despite the program being so simple.

0xmohit · on Nov 5, 2018

PDFs can also have file attachments.

https://helpx.adobe.com/acrobat/using/links-attachments-pdfs...

The official documentation also seems to recognize those as a security risk :)

https://helpx.adobe.com/acrobat/using/attachments-security-r...

alexis_fr · on Nov 5, 2018

This French startup claims it is able to run some Javascript in the PDF, therefore notify you when a customer reads your offer, tell you which part he read, where he stayed the longest, etc. Does PDF support JS?

https://www.tilkee.com/

favorited · on Nov 5, 2018

Yup. If you open this PDF in Chrome, you can even play Breakout:

https://rawgit.com/osnr/horrifying-pdf-experiments/master/br...

nothis · on Nov 5, 2018

That's creepy and makes me glad I don't use Chrome (doesn't work in Firefox).

Kalium · on Nov 5, 2018

You can run JS in a PDF. You can even embed Flash in a PDF. We tend to think as PDFs as an innocuous document format, but there's a lot more than that baked in.

Adobe actually offers "features" like readership tracking in PDFs as part of their commercial offerings.

ndnxhs · on Nov 6, 2018

Thankfully about half the "features" of PDF like js tracking and restrictions on editing or printing are ignored by almost every PDF reader not made by adobe.

SlowRobotAhead · on Nov 6, 2018

Strongly considering pulling Acrobat company wide and replacing with something like FoxIt.

Leace · on Nov 6, 2018

Last time I opened PDF with JS even Acrobat warned about that and disabled it by default (but that was some time ago).

mikeleeorg · on Nov 5, 2018

Wow, yikes, you CAN run Javascript in a PDF:

https://www.adobe.com/devnet/acrobat.html

SlowRobotAhead · on Nov 6, 2018

Horrible but not terribly surprising.

I’ve been told the PDF spec has some low level functionally to support a MS-DOS emulator. Don’t know how true that is.

thedirt0115 · on Nov 5, 2018

You bet it does! https://pspdfkit.com/blog/2018/how-to-program-a-calculator-p...

kanzure · on Nov 5, 2018

Academic publishers also insert IP addresses and other deanonymizing information (about the user) into PDFs of academic papers, which should be removed.

https://github.com/kanzure/pdfparanoia

burtonator2011 · on Nov 5, 2018

I've been doing a TON of work around PDF lately with my Polar project and wanted get feedback from you guys:

https://getpolarized.io/

I just implemented bulk PDF import this weekend.

It uses pdf.js for the rendering of the PDFs and extracting the metadata (including the fields discussed in this article)..

I have to manage a ton of PDFs for my work / research. Mostly textbooks and compsci whitepapers) and and before working on Polar I was really struggling to manage all the data.

oever · on Nov 5, 2018

You can create a Hybrid PDF in LibreOffice. It's a PDF with ODF embedded.

https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid

DyslexicAtheist · on Nov 5, 2018

there are some great pdf's on the PoC||GTFO site that go deep into the subject :) https://www.alchemistowl.org/pocorgtfo/

FredFS456 · on Nov 5, 2018

Sort of related, the company whose website this blog post is on makes the best Android PDF reader, and it's free: https://play.google.com/store/apps/details?id=com.pspdfkit.v...

(Disclaimer: I am not associated with the company)

everybodyknows · on Nov 5, 2018

From the Permissions disclosure:

  >full network access
  >run at startup
  >prevent device from sleeping

Guess is they're selling user's PDF reading activity data.

MartinMond · on Nov 5, 2018

PSPDFKit CTO here. We're not selling any user data, PDF Viewer exists because we sell an SDK and having a great app in the store makes it 1) easy to showcase the SDK to potential customers and 2) gives us a broad user base that tests the SDK and gives feedback for free.

everybodyknows · on Nov 6, 2018

Misbehavior by other apps makes such permissions a red flag to some of us.

Prime offender: Amazon Music. Latest rev AFAICT has no "Quit". Have to resort to Settings->Apps->Force_Stop to get its clutter off the screen.

doxavore · on Nov 6, 2018

I don't know that this isn't to somehow collect up user data, but this could be to showcase PDFs with embedded video. All my experience with the company has suggested they're on the up-and-up, though.

Disclaimer: I've used the company's SDK fairly extensively in my own product (but I'm not otherwise affiliated with them).

pmoriarty · on Nov 5, 2018

Are there any command-line tools that will let you see and/or edit all this metadata?

murkle · on Nov 5, 2018

This is very good - I used it to decompress the streams in a PDF https://mupdf.com/docs/manual-mutool-clean.html

snazz · on Nov 5, 2018

pdfinfo comes with Xpdf and allows you to view the metadata. See the man page: https://linux.die.net/man/1/pdfinfo

vuln · on Nov 5, 2018

Didier Stevens has some great cli tools https://blog.didierstevens.com/programs/pdf-tools/

sigjuice · on Nov 5, 2018

Exiftool might be able to do this. https://en.wikipedia.org/wiki/ExifTool

phonon · on Nov 5, 2018

You can check out http://pdfedit.cz/en/index.html (no longer updated)

Theodores · on Nov 5, 2018

This is really old and unmaintained, it also has a x-window type UI that takes you back to 1997.

I have found that despite the crude nature of it there are things you can find in documents that other tools gloss over.

Genuine thanks for the link and the nudge - I have to go forensic on a few PDFs and I had forgot what the tool was as it has been a while.

Luckily the PDFs I need to edit are old and there should be no problem doing apt-get install rather than compiling a tarball.

phonon · on Nov 5, 2018

Yup...not easy finding a free tool that will "decompile" a pdf. Too bad it's unmaintained :-(

http://pdfedit.cz/screenshots/screenshot1.jpg

http://pdfedit.cz/screenshots/screenshot2.jpg

http://pdfedit.cz/screenshots/screenshot3.jpg

vuln · on Nov 5, 2018

Didier Stevens has some great cli tools

https://blog.didierstevens.com/programs/pdf-tools/

phonon · on Nov 5, 2018

https://github.com/pdfminer/pdfminer.six is also useful

vuln · on Nov 5, 2018

Awesome, thank you!

mixmastamyk · on Nov 5, 2018

> it also has a x-window type UI that takes you back to 1997.

Or '87, by '97 I was using Gimp already.

jayalpha · on Nov 5, 2018

Master PDF Editor can do a lot. (Linux, commercial but the watermark somehow was never inserted into the file in my version)

https://code-industry.net/masterpdfeditor/

Theodores · on Nov 6, 2018

Thanks for that, it is really good if you just need to 'see' what is what and get images out of a PDF.

mpweiher · on Nov 5, 2018

CodeDraw, my live-code-drawing + GraphViz tool stores its source in the PDF. Also found a place to stash it in PNGs.

But you can treat a PDF very much like a big zip with some special purpose features. If you want to.

xte · on Nov 5, 2018

Well. Even files can be attached to pdfs, in few case normally as a "pdf source" in reproducible research... But well... We have pdftk and other simple utilities to manage pdfs metainformation... Also it's common to embed other kind of "steganographic" information that's maybe really hard to discover, the simpler are like "dot printers", simple white-on-white etc content in plain pdf that can be read easily by a bot, other may use single pdf's content with a Caesar-like cipher etc.

Pdf are a vast and not-so-clean world...

anigbrowl · on Nov 6, 2018

Meanwhile most academic authors don't make any effort to add metadata that would actually be useful. I can't count how many pdfs I have called things like '0378final.pdf' with none of the fields like Author or Title filled in.

FetchBen · on Nov 5, 2018

I built a PDF template designer, but overlooked the metadata side of things. Might have to look into it after reading this.

https://fetchpdf.com

nothis · on Nov 5, 2018

I had to deal with PDF metadata for work and it's a deep rabbit hole. Lots of standardization headaches, too, especially when combined with PDF/A and such.

fareesh · on Nov 5, 2018

I assume mobile devices are relatively immune to most of the scary aspects of what can be hiding in PDFs, is that correct?

ndnxhs · on Nov 6, 2018

Mobile PDF tools won't strip all the metadata and comments out but almost every PDF reader not from adobe or chrome ignores all the dumb parts of PDF like running js, having network access or restricting you from printing without a password.

tonyedgecombe · on Nov 6, 2018

It's a shame XPS never took off, it is superior to PDF in nearly every way but obviously failed in the market.