I've spent much of the last year down in the internals of pdfs. I recommend looking inside a PDF to see what's going on. PDF gets a hard time but once you've figured out the basics, it's actually pretty readable.
Some top tips; if you decompress the streams first, you'll get something you can read and edit with a text editor
mutool clean -d -i in.pdf out.pdf
If you hand mess with the PDF, you can run it through mutool again to fix up the object positions.
Text isn't flowed / layed out like HTML. Every glyph is more or less manually positioned.
Text is generally done with subset fonts. As a result characters end up being mapped to \1, \2 etc. So you can't normally just search for strings but you can often - though not always easily find the characters from the Unicode map.
I have used qpdf for similar purposes (QDF mode) and it's great tool too!
A long time ago when I only had access to an extremely slow 2G network but I had to send a large-ish PDF file, I used qpdf to decompress the whole file as much as possible and then using xz -9 to compress it. Way better compression ratio.
That's correct. As a sibling has said, there other ways to do it but most the pdfs I need to work with are done by simply remapping in order of occurrence. (E.g., if an X is the first char in the doc, it's referenced as \1). You can tell subset fonts because they're named as RANDPREFIX+fontname so different subset fonts from the same base font won't collide.
You can get a good overview of the state of the fonts in your PDF using:
pdffonts file.pdf
There's a column which tells you if there's s Unicode map available for the font. That's important. Because PDF is just rendering glyphs at positions, it doesn't even know what the character names are. To allow you to copy and paste, most fonts in most pdfs will have a Unicode map from the glyph id to the Unicode symbol.
If that's not available, in some cases you can rebuild it yourself by looking at the character encodings and substitutions.
On the book, do you have any examples? I'll probably never get around to writing anything down, but if it looks easy enough it's probably worth having a stab at.
Also, large caveat, I'm not a PDF or font expert. I've probably decimated the terminology here but hopefully it gives you a rough idea.
The PDF reference is freely available and pretty readable too. I would recommend just read that.
To answer your question, subsetting a font just means taking a portion of its glyphs and it doesn't imply remapping. In fact for almost sane PDF files you will find ASCII characters mapped to themselves, making text search within decompressed PDF possible. My dirty watermark remover script basically uses qpdf to decompress the thing and then use regular expressions to search for Tj or TJ right after the specified string.
This is a long document but it is very well written, if you read it on the bus or while you're waiting for your compiler to finish, you will get to understand it.
Adobe used to publish and distribute the pdf spec on their developers site. Used to be able to read it and hand code PDFs. Not sure if such a resource is still available.
If you need more, the "free" (trade for your email) e-book from Syncfusion PDF Succinctly demonstrates manipulation barely one level of abstraction higher (not calculating any offsets manually): https://www.syncfusion.com/resources/techportal/details/eboo...
"With the help of a utility program called pdftk[1] from PDF Labs, we’ll build a PDF document from scratch, learning how to position elements, select fonts, draw vector graphics, and create interactive tables of contents along the way."
PdfReader is actually java/pdftk/com/lowagie/text/pdf/PdfReader.java in the pdftk source distribution. Yes, this is a C++ program that's instantiating a Java class. As far as I can tell, what's actually going on is that all the Java code is compiled to C++-ABI-compatible .o files using GCJ and pdftk.cc links against them, giving a native program that is nonetheless mostly written in Java. Yikes!
Perhaps unsurprisingly, GCJ didn't get a huge amount of traction, and it has been deleted from the GCC tree entirely. Good riddance, maybe, but it makes it rather difficult to compile pdftk.
>If you need more, the "free" (trade for your email)
Just a note that Google lets you get a new email address that isn't spammable, through security through obscurity:
-> You can use your gmail address, add a + after it, and add a keyword. So if you are jsmith@gmail.com you can give out jsmith+syncfusionpdfsuccintly@gmail.com and then later if that starts getting spammed you can redirect it.
NOTE:
This is an incorrect solution (Google, please fix this) because anyone can run a regex removing the + part.
Instead the correct solution is that if you have gmail open, in a single click you should be able to generate a high-entropy gmail address (that does not deplete the namespace) and link it on your end with "syncfusionpdfsuccintly".
If I already have gmail open, 7 seconds to create a new gmail address as follows:
1. Click something to start the process
2. Type "syncfusionpdfsuccintly" to tag it on my end
3. Click something to copy a resulting high-entropy gmail name into the clipboard.
I should then be able to paste it into a form, get it delivered straight into my inbox (never spam), and redirect it to spam if it starts getting spammed.
This would allow people to contact us without ever getting into spam, while entirely removing their ability to contact us if this email address starts getting spammed. There are no downsides.
I believe Google's engineers are smart enough to move from security through obscurity (relying on the knowledge that no spammer can ever invent and run the exact regex s/\+[.]+@/@/g to remove the security through obscurity, as this would entirely break this security, exposing the underlying "protected" email addresses) to something that works.
Until that day comes you can rely on the security through obscurity to give out a secure email address that can't be spammed. Just add a + and a tag!
Please.
Google: I believe you are smart enough to understand this comment and implement this solution, which can be prototyped in 30 minutes and solves the spam problem forever. You can do it! I believe in you. You're 99.999% there and your security through obscurity works very well for me. I use it.
I hope you will go above and beyond and solve the remaining 0.001%. It would just make me feel better to know that a 13-character regex couldn't defeat your solution.
1. Register foo@gmail.com
2. Give out your email address to friends and family as foo+bar@gmail.com
3. Give out your email address to services as foo+{service name}@gmail.com
4. Reject anything coming directly to foo@gmail.com
That is a different kind of obscurity, as if that were my protocol a spammer could make up a keyword and it would be delivered until I realized that it wasn't myself who made it up.
Maybe there is a way to whitelist keywords and only deliver tags I add one at a time via filters, but it is not the usual interface.
I wrote something in node.js that does exactly this. But I have only used it for personal use. I'm honestly surprised that nobody has done this already.
Right now it just silently drops expired addresses. But it is so satisfying to think about bouncing (but stuff like bouncing behavior is something you have to consider when running a mail service).
I was thinking of turning it into a service, but I'd have to read up on how to scale it. Running an SMTP server takes a lot of care. I've found that just using nearlyfreespeech.net's mail forwarding is most reliable to receive emails. So I do that for now, since it is on a small scale.
I just got so frustrated with how this problem has such an obvious technical solution. At least for us users, anyway. It's not a solution for the marketers.
I strongly suspect that Google is very reluctant to do anything to make the email landscape unstable. I think that if Google started offering this, it would shake up so much of their business.
Bouncing email doesn't make much sense if you can reject easily (which would be your case).
Running an inbound SMTP server is much easier than running outbound smtp for laypeople. The software (postfix, exim, etc.) is rock solid (you have to REALLY mess up to lose emails) and the protocol is very forgiving (all serious senders have good retry policies). I encourage you to try!
> Instead the correct solution is that if you have gmail open, in a single click you should be able to generate a high-entropy gmail address (that does not deplete the namespace) and link it on your end with "syncfusionpdfsuccintly".
This a reasonably priced paid service that does already exist, it's not from Google and no, I won't name it neither publicly nor privately because I used it for years and I don't want their domain to be banned by those requiring your email address for everything. If you know how to search you will probably find it.
Basically you sign with them using one valid email address, then on their interface you can create as many addresses as you need (IIRC there is a limit but I used even dozens at a time without problems) and add a keyword to them.
All of those addresses will be redirected to the email you signed with, but the From: field will also contain the keyword you specified so that if you create an address for each service you sign up for, you instantly recognize who is spamming you when they use that address. This is very effective and I filtered out a lot of spammers.
I'm surprised there are no more services like this one around, or probably there are many but they keep a low profile to avoid being banned. That's why I'm not going to name that service, sorry. But it does exist indeed and is technically easy to implement.
you're afraid of naming the service but if Google implemented my suggestion they could never be blacklisted. (Unless their high-entropy email tags followed some easily identified pattern.)
I'm not asking for the service to "exist". I'm asking Google to take twenty minutes and fix their solution, which already works but is security through obscurity.
> if Google implemented my suggestion they could never be blacklisted
Google seems to have built their brand intentionally to be the opposite of what you're asking for though; and absolutely they could be blacklisted with a simple "GMAIL ADDRESSES NO LONGER ACCEPTED HERE".
>which already works but is security through obscurity.
I'm not sure which one you are saying is security through obscurity here... blah+real.id@gmail.com... or the high entropy mkKAjgsdf788hf87hf@gmail.com, both are obscure, but its a stretch of imagination to start labelling this a security issue.
> > if Google implemented my suggestion they could never be blacklisted
> Google seems to have built their brand intentionally to be the opposite of what you're asking for though; and absolutely they could be blacklisted with a simple "GMAIL ADDRESSES NO LONGER ACCEPTED HERE".
I think "you can't block GMail" here is meant in the sense that "you can't block the Google crawler". It's certainly technically trivial to do so, but the opportunity cost from lost users will be, for most businesses, unacceptably high.
>I think "you can't block GMail" here is meant in the sense that "you can't block the Google crawler". It's certainly technically trivial to do so, but the opportunity cost from lost users will be, for most businesses, unacceptably high.
Excellent interpretation. Gmail = Google crawler. I've made a note of this now.
What needs to happen next is a deep discussion between yourself and logicallee, in the context of Google crawler as well as how to make gmail come further out of the dark ages with high entropy and no security obscurity.
it's not blah+real.id@gmail.com - it's real.id+blah@gmail.com which currently gets delivered to real.id@gmail.com with a tag of "blah". However this tag can be removed by spammers, hiding where they got my email address.
mkKAjgsdf788hf87hf is not the only possible high-entropy format, it could be if the type that gfycat uses such as "uncommongrimyladybug". That is quite hard to blacklist.
Nobody is ever going to stop accepting gmail addresses, that suggestion is pretty ridiculous. Especially since I suggest that these addresses should be delivered straight to your real inbox (unless they start getting spammed). There's no reason people should stop accepting them.
The biggest complexity (and security) problem with PDF is that it's also effectively an archive format, in which more or less every display file format conceived of before ~2007 can be embedded.
Yeah pretty much. There's JBIG2, JPEG2000, CCITT Fax and Flash to name a few. Oh and a bunch of TIFF stuff without the wrapper. Some good news though: the PDF-A standards define various archive-safe subsets of PDF for which various verification tools exist.
On the other hand, PDF is probably the only widespread use of formats like JBIG2 and JPEG2000 --- which are rarely encountered as individual files, unlike JPEG, PNG, or GIF.
A lot of the scanned PDF ebooks on archive.org use JPEG2000+JBIG2, and the filesize vs. quality difference compared to more traditional formats like JPEG is quite apparent. They do take a noticeably longer time to render, however...
> They do take a noticeably longer time to render, however...
That's mostly due to distinct lack of good JPEG2000 decoding libraries. We're building a PDF renderer library and JPEG2000 is a constant pain int he ass due to it - JPEG decompression is hardware accelerated on many platforms and also has a bunch of SIMD optimized libraries. For JPEG2000 there's practically nothing and due to complexity of the format we count decoding times in seconds for some images even on fast mobile phones.
I've been playing around a bit with JPEG2000 (slowly learning about the format, trying to write a decoder for it) --- whereas JPEG normally uses Huffman compression for the bitstream, which although not really parallelisable is relatively fast (essentially 1 table lookup per output value), AFAIK the bottleneck in JPEG2000 decoding is the arithmetic compression, which can't be parallelised either, and involves quite a few more operations than Huffman's inner loop.
If they’re exploitable, how would a new version help? Attackers would just use the older, exploitable versions. And if PDF viewers only allowed the newer version, you’d break support with every PDF made.
Looks like he did it "the hard way" --- and unfortunately it's not a truly valid PDF since the startxref isn't within the last 1KB of the file and the version number in the header is corrupt. Not all PDF readers will accept that.
On the other hand, it is possible to make a completely valid PDF and bootable ISO. The first 32KB of an ISO is officially "unused", which is probably why GRUB decided to put itself there, but that can be relocated somewhere else --- the El Torito boot descriptor will need to be updated to point to it --- and the PDF signature (which can be a valid one) and as many objects as will fit can be put in that area, with the rest anywhere else. The xref table can be moved to the very end and the offsets updated to point to the objects.
I've encountered .pdf files which internally embed a proprietary Adobe extension called XFA[1]. I think they are created using Adobe's LiveCycle product.
They are a real pain because they render fine in Adobe Acrobat, but most other PDF renderers (including browser built-in ones) can't render them. Instead they render a blob of interstitial "loading..." text that is also embedded in the PDF (which the XFA rendering would then overwrite). It was a pain to me personally because I had to figure out a way to do programmatic form-filling of some fillable form XFAs, and most PDF libraries don't work with them (they expect traditional AcroForms fillable forms).
But in reading the XFA specification I found it interesting it had its own JavaScript interpreter (including supporting XHR requests as part of some internet-integrated form-filling feature) and another proprietary scripting language called FormCalc. I guess it opened my eyes to PDFs being a container format and the kinds of things they allow you to embed.
When you want to learn it, I recommend the "Blue Book", aka "PostScript Language Tutorial and Cookbook" By Adobe Systems Incorporated. It's a very thin book, but a great tutorial and example reference. I enjoyed going through it, and still occasionally generate PostScript for visualizations.
Very much worth learning, if for nothing else being an extremely cool stack language. I learned it for my first job that I only had Turbo C 2.0, foxbase, and an HP4 printer with the Postscript module to do graphical reporting on a dataset.
I remember going though HP and Epson printer manuals, writing down their control escape codes into a xBase table so that our Clipper application could talk to the printers and do the respective formatiing.
Having access to a PS printer would have been a much more positive experience.
I used to hand-code Postscript files back when the Apple LaserWriter was launched. I had a little kaleidoscope-like thing that did patterns for Xmas decorations, and once I did a text-to-workflow routine to print out diagrams. It's all gone now (I did part of it on a VAX and part on an SE/30), but it was lots of fun at the time.
When I was actively play with Sudoku programs, I wrote a bit of code that generated sudoku images in svg, (e)ps and and a few other formats. It was a bit fiddly, but not really complicated.
Printing tries to reformat the page to fit on some paper size and often removes details such as the background. Often it would be nice to make a PDF that shows a screenshot of the full current page, extensions exist for this but ntjifn natively.
PNG is an image format which is usually significantly larger than the corresponding PDF document and rendered at fixed quality. PDF is (mostly) a vector format which can be resized at will.
Or just a block of colour which will be very efficient. If there are images they are probably jpegs and can be embedded in the PDF without any loss of quality while still keeping the quality benefits of a vector format for text.
Not just print, but also the ability to natively _manipulate_ PDF, because the Mac still has the display Postscript stack from the NeXT era and PDF essentially an envelope for it.
It's underused these days, but still available to apps, and they can interchange data in that format. Linux support for PDF isn't anywhere near as integrated.
> the Mac still has the display Postscript stack from the NeXT era and PDF essentially an envelope for it.
This is a common misconception. Display PostScript was never present in any released version of macOS. It was replaced by the Quartz renderer, which is rather different.
Quartz can display and output to PDFs, but it does not use PDF as an internal format.
Notepad is probably the only program that I wouldn't call crappy in windows. I think it follows the unix philosophy of doing only one thing and doing it right. I'm not a windows user but it's been useful.
No, Notepad does one thing but it doesn't do it right. Can't open large files without locking up. Still saves UTF-8 files with a BOM. It can't deal with unix-style newlines.
That makes sense. It also highlights why Linux/Unix will likely never have the kind of seamless system wide integration I’m talking about - different design choices for the structure of the OS & GUI.
Nothing wrong with those choices (they give the end user more flexibility & control for example) but it is a trade off
> Most PDF files do not look readable in a text editor. Compression, encryption, and embedded images are largely to blame. After removing these three components, one can more easily see that PDF is a human-readable document description language.
Of course, PDF is intentionally so weird: it was a move by Adobe because other companies were getting too good at handling postscript.
Embedding custom compression inside your format is seldom worth it: .ps.gz is usually smaller than pdf.
PDF is literally the worst possible format for document exchange because it has the most unnecessary complexity of all document formats, which makes it the hardest to access. But popularity and merit are two totally different things.
Completely agree, as a dumb user of latex I only care that I can make a document and that it looks the same on every computer or browser or printed out.
That solution just happens to be latex, especially since virtually all computers will have some way of viewing and printing pdfs by default.
I really wish there was a better solution to typesetting than LaTeX (well, XeTeX if you’re serious about Unicode & language support I guess, which you should be)
I hate it so much, if it wasn’t for its excellent abilities in specific areas (hyphenation, etc) I’d much prefer CSS.
Yes, TeX makes me prefer CSS for layout. That’s how painful I find it.
Some top tips; if you decompress the streams first, you'll get something you can read and edit with a text editor
If you hand mess with the PDF, you can run it through mutool again to fix up the object positions.Text isn't flowed / layed out like HTML. Every glyph is more or less manually positioned.
Text is generally done with subset fonts. As a result characters end up being mapped to \1, \2 etc. So you can't normally just search for strings but you can often - though not always easily find the characters from the Unicode map.