Earlier this year I was working on hybrid PDFs[1] that embed a full XML invoice. Standardized and promoted by the German and French.[2] One more thing to hide.
This is actually pretty cool. I'v been working in an accounting company and i'v been thinking about such thing a lot lately. Is Factur-X used in practice in Germany and France? Do you know about some other similar things?
The lower-level library is used in Odoo ERP for example. Possibly others, as there are a few implementations[1] (mostly Java). This is lots of work because you need to create the XML yourself. I tried to make mappings between simple keywords and the XPaths of different XML standards (there are a bunch)[2]
So in theory you could just give a few keywords and get the full XML. Currently this is on hold, but if someone has a use case and wants to contribute, I'd be happy to continue working on it. I also provide this other library[3] that extracts some essential data from PDFs. The plan is to use them together to automatically build the XML from just the PDF.
Thank you! I will look into these. I know the problems. As i was working in the accounting firm i was dealing with imports/exports from various accounting softwares a lot. In fact i'm currently working on an app which can convert between them as my side project.
As far as I understand this the "standard" is the XML that is being embedded inside PDF, so you need a small tool to extract the XML from the PDF and then you've got the standard XML invoice. The PDF is just for nice presentation.
I did a similar approach of using XML Stylesheets (XSL) to render the standard invoice as HTML when opened, this also looked nice.
Mad how? PDF is a notoriously bad format to extract data from, but great for visually representing the data to humans. XML is human unfriendly but good for structuring data so that software can read it. Embedding the machine readable XML representation of the data in the PDF ensures that both representations of the data are available always.
Given that it's possible to have javascript in a PDF as well, it wouldn't be too hard to have a bit of code that verifies the XML matches the human-readable version. Or, failing that, some sort of crypto signature to check both to see things haven't been tampered with.
As far as I can tell the standard prohibits PDF files with any dynamic content, including javascript. Also there is no point in embedding the verification code in the PDF if you don't trust its contents.
Because it will be easy to accidentally make the invoice say something different to the XML. Imagine a company accepts and pays out invoices via this PDF format...
It doesn't sound very safe and the draft I found online only describes various restrictions on the PDF and the expectation that the XML is to be seen as an alternative representation for processing. Nothing to enforce that the contents of both are identical or tamper proof. At best you could claim that an invoice were both don't match is invalid, however that would require manual verification.
Meanwhile I write the documentation of my xml configuration files in xsd and convert them to something readable using xslt. One set of data for processing and display everywhere and one less headache about duplicated and badly maintained data.
Mostly the other way round. There are 2 use cases I know of:
1. You generate invoices in your ERP and have all the info. To make your client's life easier, you embed the info as XML, so he doesn't need to type it from the PDF.
2. You get invoices without XML and use e.g. invoice2data[1] to extract key fields and then add them as XML for later.
No, but it does include validation. I agree that it's mad for a person to use this, but having a machine use the API can help ensure that the PDF and XML always match
The vector drawing program Ipe stores all its data in a PDF. So, you create a drawing, save it as a PDF and can later open the PDF again for editing. Text can be written in LaTeX format and the PDF will contain the LaTeX source for later editing.
Illustrator has been doing this for a long time. You can choose to “preserve editing capabilities” at save time. I assume it is similar for other vector drawing programs.
This French startup claims it is able to run some Javascript in the PDF, therefore notify you when a customer reads your offer, tell you which part he read, where he stayed the longest, etc. Does PDF support JS?
You can run JS in a PDF. You can even embed Flash in a PDF. We tend to think as PDFs as an innocuous document format, but there's a lot more than that baked in.
Adobe actually offers "features" like readership tracking in PDFs as part of their commercial offerings.
Thankfully about half the "features" of PDF like js tracking and restrictions on editing or printing are ignored by almost every PDF reader not made by adobe.
Academic publishers also insert IP addresses and other deanonymizing information (about the user) into PDFs of academic papers, which should be removed.
It uses pdf.js for the rendering of the PDFs and extracting the metadata (including the fields discussed in this article)..
I have to manage a ton of PDFs for my work / research. Mostly textbooks and compsci whitepapers) and and before working on Polar I was really struggling to manage all the data.
PSPDFKit CTO here. We're not selling any user data, PDF Viewer exists because we sell an SDK and having a great app in the store makes it 1) easy to showcase the SDK to potential customers and 2) gives us a broad user base that tests the SDK and gives feedback for free.
I don't know that this isn't to somehow collect up user data, but this could be to showcase PDFs with embedded video. All my experience with the company has suggested they're on the up-and-up, though.
Disclaimer: I've used the company's SDK fairly extensively in my own product (but I'm not otherwise affiliated with them).
Well. Even files can be attached to pdfs, in few case normally as a "pdf source" in reproducible research... But well... We have pdftk and other simple utilities to manage pdfs metainformation... Also it's common to embed other kind of "steganographic" information that's maybe really hard to discover, the simpler are like "dot printers", simple white-on-white etc content in plain pdf that can be read easily by a bot, other may use single pdf's content with a Caesar-like cipher etc.
Meanwhile most academic authors don't make any effort to add metadata that would actually be useful. I can't count how many pdfs I have called things like '0378final.pdf' with none of the fields like Author or Title filled in.
I had to deal with PDF metadata for work and it's a deep rabbit hole. Lots of standardization headaches, too, especially when combined with PDF/A and such.
Mobile PDF tools won't strip all the metadata and comments out but almost every PDF reader not from adobe or chrome ignores all the dumb parts of PDF like running js, having network access or restricting you from printing without a password.
1: https://github.com/invoice-x/factur-x-ng 2: http://fnfe-mpe.org/factur-x/factur-x_en/