Hacker News new | past | comments | ask | show | jobs | submit login
Minimal PDF (brendanzagaeski.appspot.com)
363 points by ingve on July 8, 2017 | hide | past | favorite | 99 comments



I've spent much of the last year down in the internals of pdfs. I recommend looking inside a PDF to see what's going on. PDF gets a hard time but once you've figured out the basics, it's actually pretty readable.

Some top tips; if you decompress the streams first, you'll get something you can read and edit with a text editor

    mutool clean -d -i in.pdf out.pdf
If you hand mess with the PDF, you can run it through mutool again to fix up the object positions.

Text isn't flowed / layed out like HTML. Every glyph is more or less manually positioned.

Text is generally done with subset fonts. As a result characters end up being mapped to \1, \2 etc. So you can't normally just search for strings but you can often - though not always easily find the characters from the Unicode map.


Here you go, a good talk about pdf.

https://media.ccc.de/v/27c3-4221-en-omg_wtf_pdf


I have used qpdf for similar purposes (QDF mode) and it's great tool too!

A long time ago when I only had access to an extremely slow 2G network but I had to send a large-ish PDF file, I used qpdf to decompress the whole file as much as possible and then using xz -9 to compress it. Way better compression ratio.


That was informative, thank you.

I think a lot of people in the dev/power uesr community would mind paying $1 for a Kindle ebook where you note all your findings.

There have been so many instances where I wanted to do stuff with pdfs but ended up deflated.

> subset fonts

So you mean if a font has been embedded with three glyphs, 0x41=A, 0x61=a, 0x62=b, then string Aba would be \1\3\2?


That's correct. As a sibling has said, there other ways to do it but most the pdfs I need to work with are done by simply remapping in order of occurrence. (E.g., if an X is the first char in the doc, it's referenced as \1). You can tell subset fonts because they're named as RANDPREFIX+fontname so different subset fonts from the same base font won't collide.

You can get a good overview of the state of the fonts in your PDF using:

    pdffonts file.pdf
There's a column which tells you if there's s Unicode map available for the font. That's important. Because PDF is just rendering glyphs at positions, it doesn't even know what the character names are. To allow you to copy and paste, most fonts in most pdfs will have a Unicode map from the glyph id to the Unicode symbol.

If that's not available, in some cases you can rebuild it yourself by looking at the character encodings and substitutions.

On the book, do you have any examples? I'll probably never get around to writing anything down, but if it looks easy enough it's probably worth having a stab at.

Also, large caveat, I'm not a PDF or font expert. I've probably decimated the terminology here but hopefully it gives you a rough idea.


> a Kindle ebook

I think you mean "a handcrafted pdf"?


The PDF reference is freely available and pretty readable too. I would recommend just read that.

To answer your question, subsetting a font just means taking a portion of its glyphs and it doesn't imply remapping. In fact for almost sane PDF files you will find ASCII characters mapped to themselves, making text search within decompressed PDF possible. My dirty watermark remover script basically uses qpdf to decompress the thing and then use regular expressions to search for Tj or TJ right after the specified string.


This is a copy of the ISO 32000 PDF specification:

http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/p...

This is a long document but it is very well written, if you read it on the bus or while you're waiting for your compiler to finish, you will get to understand it.


Thanks for sharing!


Adobe used to publish and distribute the pdf spec on their developers site. Used to be able to read it and hand code PDFs. Not sure if such a resource is still available.

Wish I still had a copy but it was a while back.


The spec is still available freely: http://www.adobe.com/devnet/pdf/pdf_reference.html

For historical interest, older versions (going back to 1.3) are here: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

1.2, 1.1, and 1.0 can be found elsewhere on the Internet.


Sweet, thanks. Haven't been on the Adobe devnet in a while


See also on the same site: Hand-coded PDF tutorial | https://brendanzagaeski.appspot.com/0005.html

If you need more, the "free" (trade for your email) e-book from Syncfusion PDF Succinctly demonstrates manipulation barely one level of abstraction higher (not calculating any offsets manually): https://www.syncfusion.com/resources/techportal/details/eboo...

"With the help of a utility program called pdftk[1] from PDF Labs, we’ll build a PDF document from scratch, learning how to position elements, select fonts, draw vector graphics, and create interactive tables of contents along the way."

[1] https://www.pdflabs.com/tools/pdftk-server/


Alas, pdftk is mostly dead. It's written in the ersatz dialect of C that only GCJ could compile, and GCJ is officially gone.

The underlying library is just fine, but a decent front end is lacking.


What is that? Googling ersatz gcj gives your comment as the only relevant result. Also GCJ is/was a Java compiler as far as I am aware.


pdftk is a sort-of-C++ program that contains code like:

  #include "pdftk/com/lowagie/text/pdf/PdfReader.h"
  ...
                  if( input_pdf_p->m_password.empty() ) {
                        reader= new itext::PdfReader( JvNewStringUTF( input_pdf_p->m_filename.c_str() ) );
PdfReader is actually java/pdftk/com/lowagie/text/pdf/PdfReader.java in the pdftk source distribution. Yes, this is a C++ program that's instantiating a Java class. As far as I can tell, what's actually going on is that all the Java code is compiled to C++-ABI-compatible .o files using GCJ and pdftk.cc links against them, giving a native program that is nonetheless mostly written in Java. Yikes!

Perhaps unsurprisingly, GCJ didn't get a huge amount of traction, and it has been deleted from the GCC tree entirely. Good riddance, maybe, but it makes it rather difficult to compile pdftk.


Someone should port all that C++ code (there's not that much of it, really) to Java, that would make it a lot easier to compile, wouldn't it?


>ersatz >ˈəːsats,ˈɛːsats/ >adjective >(of a product) made or used as a substitute, typically an inferior one, for something else.


>If you need more, the "free" (trade for your email)

Just a note that Google lets you get a new email address that isn't spammable, through security through obscurity:

-> You can use your gmail address, add a + after it, and add a keyword. So if you are jsmith@gmail.com you can give out jsmith+syncfusionpdfsuccintly@gmail.com and then later if that starts getting spammed you can redirect it.

NOTE:

This is an incorrect solution (Google, please fix this) because anyone can run a regex removing the + part.

Instead the correct solution is that if you have gmail open, in a single click you should be able to generate a high-entropy gmail address (that does not deplete the namespace) and link it on your end with "syncfusionpdfsuccintly".

If I already have gmail open, 7 seconds to create a new gmail address as follows:

  1.  Click something to start the process

  2.  Type "syncfusionpdfsuccintly" to tag it on my end

  3.  Click something to copy a resulting high-entropy gmail name into the clipboard.
I should then be able to paste it into a form, get it delivered straight into my inbox (never spam), and redirect it to spam if it starts getting spammed.

This would allow people to contact us without ever getting into spam, while entirely removing their ability to contact us if this email address starts getting spammed. There are no downsides.

I believe Google's engineers are smart enough to move from security through obscurity (relying on the knowledge that no spammer can ever invent and run the exact regex s/\+[.]+@/@/g to remove the security through obscurity, as this would entirely break this security, exposing the underlying "protected" email addresses) to something that works.

Until that day comes you can rely on the security through obscurity to give out a secure email address that can't be spammed. Just add a + and a tag!

Please.

Google: I believe you are smart enough to understand this comment and implement this solution, which can be prototyped in 30 minutes and solves the spam problem forever. You can do it! I believe in you. You're 99.999% there and your security through obscurity works very well for me. I use it.

I hope you will go above and beyond and solve the remaining 0.001%. It would just make me feel better to know that a 13-character regex couldn't defeat your solution.


Couldn't you do something like:

1. Register foo@gmail.com 2. Give out your email address to friends and family as foo+bar@gmail.com 3. Give out your email address to services as foo+{service name}@gmail.com 4. Reject anything coming directly to foo@gmail.com


That is a different kind of obscurity, as if that were my protocol a spammer could make up a keyword and it would be delivered until I realized that it wasn't myself who made it up.

Maybe there is a way to whitelist keywords and only deliver tags I add one at a time via filters, but it is not the usual interface.


Many spammers have caught onto the foo+bar trick. When they detect such an address they will spam both foo+bar@gmail.com and foo@gmail.com.


Read his message carefully. They don't know about foo+bar@gmail.com


I wrote something in node.js that does exactly this. But I have only used it for personal use. I'm honestly surprised that nobody has done this already.

Right now it just silently drops expired addresses. But it is so satisfying to think about bouncing (but stuff like bouncing behavior is something you have to consider when running a mail service).

I was thinking of turning it into a service, but I'd have to read up on how to scale it. Running an SMTP server takes a lot of care. I've found that just using nearlyfreespeech.net's mail forwarding is most reliable to receive emails. So I do that for now, since it is on a small scale.

I just got so frustrated with how this problem has such an obvious technical solution. At least for us users, anyway. It's not a solution for the marketers.

I strongly suspect that Google is very reluctant to do anything to make the email landscape unstable. I think that if Google started offering this, it would shake up so much of their business.


Bouncing email doesn't make much sense if you can reject easily (which would be your case).

Running an inbound SMTP server is much easier than running outbound smtp for laypeople. The software (postfix, exim, etc.) is rock solid (you have to REALLY mess up to lose emails) and the protocol is very forgiving (all serious senders have good retry policies). I encourage you to try!


Actually, I'm gonna try https://grr.la/ryo/


As I mentioned, Google already offers this. Just add a tag with + to your existing gmail address and you instantly create a tagged new one.


Except as pointed out, a spammer (etc) can remove the tag and get the original address using a regex.


only until Google fixes it as suggested. they have some smart people there and I believe in them!


> Instead the correct solution is that if you have gmail open, in a single click you should be able to generate a high-entropy gmail address (that does not deplete the namespace) and link it on your end with "syncfusionpdfsuccintly".

This a reasonably priced paid service that does already exist, it's not from Google and no, I won't name it neither publicly nor privately because I used it for years and I don't want their domain to be banned by those requiring your email address for everything. If you know how to search you will probably find it. Basically you sign with them using one valid email address, then on their interface you can create as many addresses as you need (IIRC there is a limit but I used even dozens at a time without problems) and add a keyword to them. All of those addresses will be redirected to the email you signed with, but the From: field will also contain the keyword you specified so that if you create an address for each service you sign up for, you instantly recognize who is spamming you when they use that address. This is very effective and I filtered out a lot of spammers. I'm surprised there are no more services like this one around, or probably there are many but they keep a low profile to avoid being banned. That's why I'm not going to name that service, sorry. But it does exist indeed and is technically easy to implement.


you're afraid of naming the service but if Google implemented my suggestion they could never be blacklisted. (Unless their high-entropy email tags followed some easily identified pattern.)

I'm not asking for the service to "exist". I'm asking Google to take twenty minutes and fix their solution, which already works but is security through obscurity.


> if Google implemented my suggestion they could never be blacklisted

Google seems to have built their brand intentionally to be the opposite of what you're asking for though; and absolutely they could be blacklisted with a simple "GMAIL ADDRESSES NO LONGER ACCEPTED HERE".

>which already works but is security through obscurity.

I'm not sure which one you are saying is security through obscurity here... blah+real.id@gmail.com... or the high entropy mkKAjgsdf788hf87hf@gmail.com, both are obscure, but its a stretch of imagination to start labelling this a security issue.


> > if Google implemented my suggestion they could never be blacklisted

> Google seems to have built their brand intentionally to be the opposite of what you're asking for though; and absolutely they could be blacklisted with a simple "GMAIL ADDRESSES NO LONGER ACCEPTED HERE".

I think "you can't block GMail" here is meant in the sense that "you can't block the Google crawler". It's certainly technically trivial to do so, but the opportunity cost from lost users will be, for most businesses, unacceptably high.


>I think "you can't block GMail" here is meant in the sense that "you can't block the Google crawler". It's certainly technically trivial to do so, but the opportunity cost from lost users will be, for most businesses, unacceptably high.

Excellent interpretation. Gmail = Google crawler. I've made a note of this now.

What needs to happen next is a deep discussion between yourself and logicallee, in the context of Google crawler as well as how to make gmail come further out of the dark ages with high entropy and no security obscurity.


it's not blah+real.id@gmail.com - it's real.id+blah@gmail.com which currently gets delivered to real.id@gmail.com with a tag of "blah". However this tag can be removed by spammers, hiding where they got my email address.

mkKAjgsdf788hf87hf is not the only possible high-entropy format, it could be if the type that gfycat uses such as "uncommongrimyladybug". That is quite hard to blacklist.

Nobody is ever going to stop accepting gmail addresses, that suggestion is pretty ridiculous. Especially since I suggest that these addresses should be delivered straight to your real inbox (unless they start getting spammed). There's no reason people should stop accepting them.


Google mostly solves this with spam filters instead.


The biggest complexity (and security) problem with PDF is that it's also effectively an archive format, in which more or less every display file format conceived of before ~2007 can be embedded.


Yeah pretty much. There's JBIG2, JPEG2000, CCITT Fax and Flash to name a few. Oh and a bunch of TIFF stuff without the wrapper. Some good news though: the PDF-A standards define various archive-safe subsets of PDF for which various verification tools exist.


On the other hand, PDF is probably the only widespread use of formats like JBIG2 and JPEG2000 --- which are rarely encountered as individual files, unlike JPEG, PNG, or GIF.

A lot of the scanned PDF ebooks on archive.org use JPEG2000+JBIG2, and the filesize vs. quality difference compared to more traditional formats like JPEG is quite apparent. They do take a noticeably longer time to render, however...


> They do take a noticeably longer time to render, however...

That's mostly due to distinct lack of good JPEG2000 decoding libraries. We're building a PDF renderer library and JPEG2000 is a constant pain int he ass due to it - JPEG decompression is hardware accelerated on many platforms and also has a bunch of SIMD optimized libraries. For JPEG2000 there's practically nothing and due to complexity of the format we count decoding times in seconds for some images even on fast mobile phones.


I've been playing around a bit with JPEG2000 (slowly learning about the format, trying to write a decoder for it) --- whereas JPEG normally uses Huffman compression for the bitstream, which although not really parallelisable is relatively fast (essentially 1 table lookup per output value), AFAIK the bottleneck in JPEG2000 decoding is the arithmetic compression, which can't be parallelised either, and involves quite a few more operations than Huffman's inner loop.


XPS solved many of the problems with PDF but it was far too late by then, PDF was well established.


Maybe it's time for a PDF-2017 standard that drops support for those older exploitable formats


Yes, it's called PDF/A.


If they’re exploitable, how would a new version help? Attackers would just use the older, exploitable versions. And if PDF viewers only allowed the newer version, you’d break support with every PDF made.


>And if PDF viewers only allowed the newer version, you’d break support with every PDF made.

That is not at all something that would have to be true.


Its called deprecation / forwards compatibility


See also klange's resume: https://github.com/klange/resume. Resume pdf that's also a valid ISO 9660, bootable toaru OS image.


Looks like he did it "the hard way" --- and unfortunately it's not a truly valid PDF since the startxref isn't within the last 1KB of the file and the version number in the header is corrupt. Not all PDF readers will accept that.

On the other hand, it is possible to make a completely valid PDF and bootable ISO. The first 32KB of an ISO is officially "unused", which is probably why GRUB decided to put itself there, but that can be relocated somewhere else --- the El Torito boot descriptor will need to be updated to point to it --- and the PDF signature (which can be a valid one) and as many objects as will fit can be put in that area, with the rest anywhere else. The xref table can be moved to the very end and the offsets updated to point to the objects.


I've encountered .pdf files which internally embed a proprietary Adobe extension called XFA[1]. I think they are created using Adobe's LiveCycle product.

They are a real pain because they render fine in Adobe Acrobat, but most other PDF renderers (including browser built-in ones) can't render them. Instead they render a blob of interstitial "loading..." text that is also embedded in the PDF (which the XFA rendering would then overwrite). It was a pain to me personally because I had to figure out a way to do programmatic form-filling of some fillable form XFAs, and most PDF libraries don't work with them (they expect traditional AcroForms fillable forms).

But in reading the XFA specification I found it interesting it had its own JavaScript interpreter (including supporting XHR requests as part of some internet-integrated form-filling feature) and another proprietary scripting language called FormCalc. I guess it opened my eyes to PDFs being a container format and the kinds of things they allow you to embed.

[1]: https://en.wikipedia.org/wiki/XFA


That's unfortunate. I guess the same thing that happened to HTML pages (turning into less accessible JS-based SPAs) is now happening to PDFs.


Plain text ... but with hard offsets ... encoded as decimal integers. Yikes!


This is good but Postscript is even better. Someday I'll learn it and see what I can do with it.


When you want to learn it, I recommend the "Blue Book", aka "PostScript Language Tutorial and Cookbook" By Adobe Systems Incorporated. It's a very thin book, but a great tutorial and example reference. I enjoyed going through it, and still occasionally generate PostScript for visualizations.


Very much worth learning, if for nothing else being an extremely cool stack language. I learned it for my first job that I only had Turbo C 2.0, foxbase, and an HP4 printer with the Postscript module to do graphical reporting on a dataset.

chris_st's recommendations are how I learned it.


You lucky guy.

I remember going though HP and Epson printer manuals, writing down their control escape codes into a xBase table so that our Clipper application could talk to the printers and do the respective formatiing.

Having access to a PS printer would have been a much more positive experience.


I used to hand-code Postscript files back when the Apple LaserWriter was launched. I had a little kaleidoscope-like thing that did patterns for Xmas decorations, and once I did a text-to-workflow routine to print out diagrams. It's all gone now (I did part of it on a VAX and part on an SE/30), but it was lots of fun at the time.


Postscript is easier to get started with, IMHO. See this example: https://en.wikipedia.org/wiki/PostScript#.22Hello_world.22

When I was actively play with Sudoku programs, I wrote a bit of code that generated sudoku images in svg, (e)ps and and a few other formats. It was a bit fiddly, but not really complicated.


I'm frustrated by governments using PDF for full in forms and yet open source tools are very weak in this area.

This is not better than paper and pencil, in terms of accessibility. And we need to do better somehow.


If you like this you might enjoy this repo: https://github.com/mathiasbynens/small


Would be nice if browsers would support saving pages directly as pdf using there own pdf librarys.


Chrome has great HTML to PDF saving, which is in fact all I use it for.

My home-brew accounting software generates HTML invoices; I need them in PDF also, though.


I had no idea Chrome could do that. Tried it just now and it works perfectly. Thanks!


Save page as? Note, I'm not refering to print to pdf here.


What's the difference?


Printing tries to reformat the page to fit on some paper size and often removes details such as the background. Often it would be nice to make a PDF that shows a screenshot of the full current page, extensions exist for this but ntjifn natively.


Why PDF then; it could be saved as a PNG, which an ImageMagick one-liner can turn into a PDF (or several of them into a multi-pager).

It would be good if a way was available for any application to take a full window screenshot, rather than the viewport screenshot.


PNG is an image format which is usually significantly larger than the corresponding PDF document and rendered at fixed quality. PDF is (mostly) a vector format which can be resized at will.


But "details such as the background" will probably consist of a whole lot of raster images.


Or just a block of colour which will be very efficient. If there are images they are probably jpegs and can be embedded in the PDF without any loss of quality while still keeping the quality benefits of a vector format for text.


> Why PDF then

It's nice to have the web page's text searchable without having to OCR the PNG.


Everything on the Mac has Print to PDF natively and it’s so nice

It’s one of the things I miss most when using an other OS.


Not just print, but also the ability to natively _manipulate_ PDF, because the Mac still has the display Postscript stack from the NeXT era and PDF essentially an envelope for it.

It's underused these days, but still available to apps, and they can interchange data in that format. Linux support for PDF isn't anywhere near as integrated.


> the Mac still has the display Postscript stack from the NeXT era and PDF essentially an envelope for it.

This is a common misconception. Display PostScript was never present in any released version of macOS. It was replaced by the Quartz renderer, which is rather different.

Quartz can display and output to PDFs, but it does not use PDF as an internal format.


Windows has that now.


Got to admit - not used Win10 much, is it’s PDF support good out of the box system wide now?


Yeah, there's a printer called "Microsoft Print to PDF" now, which works in pretty much everything (even something as crappy as notepad).


Notepad is probably the only program that I wouldn't call crappy in windows. I think it follows the unix philosophy of doing only one thing and doing it right. I'm not a windows user but it's been useful.


No, Notepad does one thing but it doesn't do it right. Can't open large files without locking up. Still saves UTF-8 files with a BOM. It can't deal with unix-style newlines.


Linux has had that since around 2000.


It’s had the ability to but even in recent Ubuntu versions you have to install a pdf printer.

The level of integration is not even close to the same & it’s disingenuous to pretend it is currently, let alone from 2000.


Weird every Ubuntu install I've ever seen had a "Print to file" option in the print dialogue - which just worked.


Indeed. I can't remember installing a PDF-printer any time the past decade. It's just been there.


This SE answer from six years ago corroborates that: https://askubuntu.com/questions/81817/how-to-install-a-pdf-p...

Although, they say it doesn't show up in every application.


Indeed - it does not. I’m looking at it right now not doing so.

So, as I said, not the same level of integration.


Yeah, apparently it requires the application to use GTK and its printing dialog: https://developer.gnome.org/gtk3/stable/GtkPrintUnixDialog.h...


That makes sense. It also highlights why Linux/Unix will likely never have the kind of seamless system wide integration I’m talking about - different design choices for the structure of the OS & GUI.

Nothing wrong with those choices (they give the end user more flexibility & control for example) but it is a trade off


Applications from the KDE side do have a different printing dialog.


> Most PDF files do not look readable in a text editor. Compression, encryption, and embedded images are largely to blame. After removing these three components, one can more easily see that PDF is a human-readable document description language.

Of course, PDF is intentionally so weird: it was a move by Adobe because other companies were getting too good at handling postscript.

Embedding custom compression inside your format is seldom worth it: .ps.gz is usually smaller than pdf.


Hmm, in my experience vim is actually pretty good at showing PDFs since not a lot of them use compression for text and other streams.


It's hardly custom compression. It's just deflate and ZIP, IIRC.


Yes, true. But it's build into the format, instead of transparently applied afterwards.


The same could be said for PNG and many other common formats that use zlib.


This page was helpful to me a couple years ago while crafting the tiny PDF used for testing in Homebrew. https://github.com/Homebrew/legacy-homebrew/pull/36606


PDF is like c++

it's used everywhere because you can do everything with it.

This also leads to the problem where you can do anything with it.

so each industry is kind of coming up with their own subset of pdf that applies some restrictions in the hopes of making them verifiable.

the downside is that these subsets slowly start bloating until they allow everything anyway.

i'm looking at you PDFa. grr.


PDF is literally the worst possible format for document exchange because it has the most unnecessary complexity of all document formats, which makes it the hardest to access. But popularity and merit are two totally different things.


That’s unfair. It’s terrible from a technical perspective (due to cruft, mostly) but nothing comes close from a deliverable standpoint.

It’s simply the most reliable print-ready format. It is a Portable Document Format in every way that matters to the end user

There’s a reason one of the main uses of LaTeX is outputting to PDF


Completely agree, as a dumb user of latex I only care that I can make a document and that it looks the same on every computer or browser or printed out.

That solution just happens to be latex, especially since virtually all computers will have some way of viewing and printing pdfs by default.


I really wish there was a better solution to typesetting than LaTeX (well, XeTeX if you’re serious about Unicode & language support I guess, which you should be)

I hate it so much, if it wasn’t for its excellent abilities in specific areas (hyphenation, etc) I’d much prefer CSS.

Yes, TeX makes me prefer CSS for layout. That’s how painful I find it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: