PDF is a fabulous format. I mean, it’s an awful format in so many ways, technically speaking, but the net effect of having a self-contained static file in your custody stands in blissful contrast to the user-hostile dynamic/SaaS website that can be taken away at a moment’s notice. PDF/A is the true PDF - it strips out most of the dangerous cruft.
Anyway, if you like weird PDF hijinks, here’s a polyglot PDF/A CSV file that is also its own original soundtrack as a polyglot Amiga soundtracker mod:
For better or worse, the years I spent working on Preview for Apple (and PDFKit) I felt bad that our (Apple's) PDF implementation was far short of Adobe's.
Radars would show up with PDFs attached, "Preview Does Not Display 3D Image in PDF Like Acrobat" or similar. And I would feel so ... inadequate.
PDFKit could render and capture basic annotations ... and that was about it. We could show you forms, allow editing, but if the PDF had Javascript that would add two fields and put the sum in a third field I had to shrug and say, "Oh well." The effort of hoisting a JavaScript interpreter/runtime was beyond my skillset anyway.
But then I kind of came to see our subset of PDF support as a kind of feature. It's true, we left out the kitchen sink. Adobe was/is clearly interested in putting everything into PDF.
And I mean, as pointed out here, at least you could open a PDF in Preview and not worry about any Javascript executing. ;-)
If it makes you feel any better, Preview is by far the best PDF viewer and editor (I use it for signatures and adding text) I've ever used. I like that the PDF previews in Finder are instant and accurate. I like that it shows as much PDF and as little UI/menubar as possible. I like that it never asks me to upgrade or log in. The search tools work well. I can stitch PDFs together (if I google how to, always forget) and pull certain pages out as their own files.
For all of the PDFs I've ever encountered, Preview has been sufficient and capable. Thank you for your hard work!
I thought Acrobat had ugly UI — stacks and stacks of toolbars for example (this, BTW, about a decade ago — I haven't launched Reader in some time so can't speak to the current UI).
I met one of the engineers from Adobe and said as much — as politely as I could. He said, yeah, we're modeling our UI on Office.
I saw in an instant that they wanted to be seen as a peer, a co-tool, to the Microsoft suite and it all made sense to me.
Thank you, thank you, THANK YOU for not having put all that cruft in, and by Apple's sheer size, effectively discouraging many from producing and circulating those abominations.
Adobe has an awful track record of security (how many exploits in the past 25 years were in Acrobat (not the PDF spec, the actual Acrobat software) and in Flash?) but PDF is an amazing gift to the world, and, thanks to people like you, effectively safer than how Adobe designed it :))
Unfortunately I have the full Acrobat on my work computer, mandated by my employer, sigh, but that's another story.
When I ordered an official PDF copy of my college diploma, the order form had an option to enable "tracking" in the PDF file. Sure enough, when the recipient opened the PDF file (and when I tried it myself on a different machine), I got a notification from the company that generated the PDF...
PDFs are roughly on par with web pages feature-wise, including JavaScript or other actions that execute on load. Adobe did this, of course, to stave off the competition from the early web. Nowadays, PDF readers disable most of that by default (if they even support it).
No, they are not executable by the OS (generally).
Formats are on a gradient between "completely code" and "completely data" and PDFs are quite close to the "completely code" extreme'; I guess this is what the parent meant.
I would expect so simply because browsers are fairly hardened pieces of software. Adobe Acrobat is decently hardened but it seems to be far behind browsers.
It is worth noting that Chromium and later Firefox both added PDF viewers that live inside the browser sandbox. They are essentially web-apps that render the PDF. When I worked at Google they strongly recommended using Chrome for opening PDF files because they felt much more comfortable about its security and sandboxing than other PDF readers.
On another perspective is that you are likely browsing the internet anyways. In fact you likely got the PDF by visiting a website. So you have already exposed a huge attack surface (your browser) to a possible hostile adversary. It is better to expose them to the same attack surface again (plus whatever security the PDF reader itself provides) than to give them a fresh new attack surface.
about pdf/a... until recently there was not even an easy way to figure out if pdf is really pdf/a; now there is (verapdf) and it's crazy complex piece of software
and maybe I'm wrong but the only way to convert arbitrary pdf to pdf/a with open source software is to convert it to postscript and back with ghostscript - which is affero licensed... with all the possible problems it entails. (there is old version that is just gpl, works on most pdfs but is 15 years old or such.)
i needed to deal with pdf/a in a previous job... was not fun.
I will never forgive the pain PDF caused me when I worked on a project to parse millions of PDF files from various sources. Just reconstructing paragraphs was a huge effort not even mentioning parsing tables. I think we should do better for something that’s basically a standard. PDF manuals also suck big time.
PDF is supposed to a be a printer format, not a word processing document format. While I too would love to nail down a PDF subset to be a standard (for example requiring the accessibility tags that make text extraction easy) perhaps trying to create a hybrid format, one that satisfies both printers and resizable windows, is already an impossible goal.
(I've always had to keep my love of PDF a secret from fellow nerds. But here's another secret, I like printing documents out from time to time.)
I really appreciate what PDF can accomplish, but I also really dislike that it turns into a black box. There really ought to be something that can describe a document structure and also describe document layout in a durable and portable manner. In the range of XML/JSON <-> HTML+CSS <-> PDF <-> PS <-> RAW, it really does feel like there's something missing between HTML and PDF.
And it can't be LaTeX, because the document shouldn't be a programming language at all. "The document is a program" has proven itself to be a terrible scheme overall.
I wonder a bit if we wouldn't have a easier time extracting data, resizing pages etc if we sent HTML files instead of PDF. Are even half of PDFs printed at all?
I'll analyze PNG for comparison. The largest width and height is 2147483647 (2^31 - 1). Using the pHYs chunk (physical pixel dimensions), the lowest density we can specify is 1 pixel per metre. So, 2 billion metres (2 gigametres) is somewhat bigger than the diameter of the sun at 1.39 Gm. https://en.wikipedia.org/wiki/Orders_of_magnitude_(length)#g...
Using the sCAL chunk (physical scale) would allow extremely large dimensions because it uses ASCII floating-point.
> Using the sCAL chunk (physical scale) would allow extremely large dimensions because it uses ASCII floating-point.
AFAIK sCAL is more about the image's subject, not the image itself. A 1:10,000,000 scale world map would be < 10 m wide according to pHYs, but it will be ~40,000 km wide according to sCAL.
Reminds me of this PDF I created more than a decade ago from a Postscript implementation of the game of life. Seems it still works, but causes MacOS preview to crash. https://andrewcutler.net/docs/joke/life.pdf
It doesn't cause Preview to crash on Sonoma. FWIW, I can't see any animation, just the final state, while Firefox's PDF reader does show some animation. Skim has the same behaviour as Preview but doesn't show the grid.
This post reminds me of Umberto Eco's intellectual divertissements. More specifically, this fantastic piece, "On the Impossibility of Drawing a Map of the Empire on a Scale of 1 to 1."
“But unlike Acrobat, the Preview app doesn’t have an upper limit on what we can put in MediaBox. It’s perfectly happy for me to write a width which is a 1 followed by twelve 0s:
Screenshot of Preview’s Document inspector, showing the page size of 352777777777.78 x 10.59 cm.
If you’re curious, that width is approximately the distance between the Earth and the Moon. I’d have to get my ruler to check, but I’m pretty sure that’s larger than Germany.”
The size of every planet in our solar system, put next to each other, can fit in this doc with room to spare
Fun experiment alexwlchan! Two small mistakes in your post: you write "15,000,000,000.00 in" and "that the size of a page is 15 billion inches", but it should be 15 million.
You said you had difficulty formatting text. Here is a "hello world" pdf that just has these two words on a page: copy and paste this text (stripping leading spaces on each line) and save it in a .pdf file. Basically in order to write text you have to define a font (object 5) and then a stream with a Tf command to use the font, a Td command to position the text, and a Tj command to write it.
By the spec, yes. Some PDF readers will parse it anyway, some will not. In my experience depending on the renderer the xref table can be varying degrees of malformed before things go wrong. Edge's old PDF reader (the one before Acrobat and after PDFium) for example seemed to tolerate just about anything, falling back to the latest version of objects if the xref table was broken. There's also other mistakes you can make, like for example, the xref table requires carriage returns (each entry in the table is supposed to be an exact number of bytes) but some PDF readers will still interpret the xref table even if the carriage returns are missing.
As I understand it, the xref entries don’t require a carriage return, but they require a fixed line length. If you don’t want to use a CR, you can pad with a space.
So CR/LF, space/LF, and space/CR are all valid endings.
> The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object. It shall be separated from the generation number by a single SPACE. The generation number shall be a 5-digit number, also padded with leading zeros if necessary. Following the generation number shall be a single SPACE, the keyword n, and
a 2-character end-of-line sequence consisting of one of the following: SP CR, SP LF, or CR LF. Thus, the overall length of the entry shall always be exactly 20 bytes
This is interesting. Never actually saw anything other than CRLF in practice, even inside of PDF files that otherwise were LF-only.
It is required according to the standard. But in practice most PDF viewers don't care. They may complain the PDF is "damaged" or "no valid xref was found", but they will render it perfectly fine.
>If we crank it all the way up to the maximum of UserUnit 75000, Acrobat now reports the size of our page as 15,000,000,000.00 x 15,000,000,000.00 in – 381 km along both sides, matching the original claim. If you’re curious, you can download the PDF.
15 billion inches are 381,000km.
The original claim is the limit is 15 million inches.
I'm somewhat confused by its directions however when I look at the map and want to go somewhere. Is the top-part of the map where I'm moving? Or is the top-part North?
Seems it is not North and that is confusing because maps I've seen before have North at the top always.
If I turn 90 degrees, the map turns around. But I thought it was I who turned around.
And if I stop, the map cannot know where I'm going because I'm not going anywhere. So it is almost like I have to start moving before the map can tell me where to turn.
Or if I hold the smart-phone in front of my eyes the top of the map is towards the sky. Am I supposed to look at the map from above?
What are some good tactics on how to use Google-map on your cell-phone?
There are two modes in Google Maps - one shows the map in a fixed rotation (north on top by default, but you can rotate the map with two fingers), the other mode automatically rotates the map based on what direction you're facing. *Facing*, not moving, so you don't actually have to walk for it to determine the direction.
You can switch between the modes by clicking a compass icon
Part of the confusion might be that it's pointing in the direction the phone is facing. Which is kind of obvious, but notably doesn't work if you put your phone in an upright phone holder, as many people do in their car.
I really hate that too. You are in a intersection and the voice says "Drive north for x miles/km". What is wrong with "turn right and drive for x miles/km"? I normally have zero clue in what direction north is especially when I am in a location i have never been before. I drive a bike and have the phone in my pocket and can therefore not see any arrow that the app might display. I only have the audio to navigate from.
It will do that if it doesn't already know what direction you're travelling, which is usually because you've just activated navigation and you aren't moving yet. Unless I happen to know which direction north is or which way to towards my destination I'll just pick a random direction and it will adjust the route if I guessed wrong.
> You are in a intersection and the voice says "Drive north for x miles/km".
Does that really happen? I have never experienced it. How do they tell which way is north?
Highway 101 runs through San Jose pretty much due east/west, but because it also runs up to San Francisco, it is officially a north-south highway. So you check your position on the map and you're traveling due east along an east/west road. Is that "north"? (Of course not. It's "south".)
> What are some good tactics on how to use Google-map on your cell-phone?
For navigation?
1. Don't activate navigation. It's broken six ways to Sunday, and burns through battery like there's no tomorrow. Use route preview instead (i.e. the step after searching, but before activating the voice nav proper).
2. Use your fingers to rotate the map so it always faces the same way you're going.
3. If confused, recenter and press the compass so it rotates to have North at the top, and continue from there.
Now FWIW, I use Google Maps when navigating on foot/scooter, or as a pilot in the car. If I were a driver... I'd probably buy TomTom or whatever nav that's not shit.
I have a map of the Universe. Dunno, it keeps expanding ...........................................,............................................................................................................................
Sounds like a print bomb waiting to happen. Last time I had a printer it was next to impossible to cancel a print job on Windows. Back when people had wifi printers that were open or ill-secured, those were fun times.
> it was next to impossible to cancel a print job on Windows
It's still impossible. The only reliable method I've found consists of turning the printer off and then deleting the print job in the queue. Only way to get Windows to actually delete it. Doesn't work unless the printer is sitting right next to me, of course. I have no idea why this is so hard.
Some ~12 years ago, I was debugging POS integration with a receipt printer and accidentally sent garbage postscript to the receipt printer, which printed it out verbatim.
Stopping it was impossible. Power cycling that printer had absolutely no effect. It wrote the unfinished print job to some kind of persistent memory, and by god it was going to finish it.
It went through something like 2 1/2 rolls of receipt paper (yes it dutifully awaited the new rolls and then just continued) and due to the thermal printing process it smelled very odd, and I had quite a few metres of raw Postscript afterwards to decorate a wall with.
About 30 years ago I interviewed to be a summer intern at Microsoft, and one of the interviewers asked a question very similar to this but regarding Excel. This is the kind of topic that never gets old for understanding a person’s curiosity and ability to dissect the potential issues.
So what is the actual limit if any? I just had a quick look at ISO 32000-2:2020 [1] and think the answer is none or implementation depended if you want. In the file format a media box is a rectangle, a rectangle is an array of four numbers, and a number is either an integer or a real. Numbers are represented as strings, so there is no a priori limit on their range and there seem to be no requirements on the minimum or maximum range of values an implementation has to support. The appendix only says that IEEE 754 is a commonly used format to represent reals and that this might impose limits.
And of course you can try and produce this pdf using TeX. In this post https://tex.stackexchange.com/a/27482/963 I created a pdf of 15283 pages (lettersize) filled with lorem ipsum text and without the program running out of memory.
On an only slightly related note: is there any good way to check PDFs for malware/executables?
If I'm stuck with an attempt at it, the best I can think of is opening in a new QEMU or docker with no Internet access, but that's 1) a fair but of work to check something, and 2) probably not even that secure. Using some cli tool, like xxx, bat, or ranger, that does some processing to extract the text and looking at just that feels more secure - but I know it really isn't.
What is a simple tool to "clean" PDFs?
An ML tool that does QEMU/docker/no-net to extract the content, turns that into game, and saves a typst/latex template with it would probably be the best possible outcome - but that's a decent (yet potentially very lucrative) task.
For analysis, I’ve used Didier’s tools. If you just want a safe way to open it, upload it to a cloud storage provider which destructively renders the pdf. Box or Google drive should work.
What you mean with "PDFs with malware/executables"?
If you're talking about embedded active content within them, then a reader application can just ignore/not run it.
If you're talking about a crafted PDF that exploits, let's say, font rendering bugs inside the reader than it's near impossible. Keep your applications updated.
Quite possibly perhaps that might be true-ish to some extent, I think, but take that with a grain of salt, I'm not an expert, that's just my wild guess :-p
It's pretty ridiculous to peel that off the following qualifier.
Readers have been aggressively attacked for a long time. It's certainly not impossible that some basic demonstration PDF will cause an issue, but it's probably not reasonable to expect it.
Slightly tangential: if you are hacking on PDFs, manually or otherwise, this is an incredibly useful tool: https://pdfcpu.io/ (not the author, just a user)
I cannot let this opportunity go by without quoting On Exactitude in Science by Borges in its entirety
". . . In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, those Unconscionable Maps no longer satisfied, and the Cartographers Guilds struck a Map of the Empire whose size was that of the Empire, and which coincided point for point with it. The following Generations, who were not so fond of the Study of Cartography as their Forebears had been, saw that that vast map was Useless, and not without some Pitilessness was it, that they delivered it up to the Inclemencies of Sun and Winters. In the Deserts of the West, still today, there are Tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography."
Or a portion of one of it's inspirations: Lewis Carroll's Sylvie and Bruno Concluded
"We actually made a map of the country, on the scale of a mile to the mile!"
"Have you used it much?" I enquired.
"It has never been spread out, yet," said Mein Herr: "the farmers objected: they said it would cover the whole country, and shut out the sunlight ! So we now use the country itself, as its own map, and I assure you it does nearly as well."
He had bought a large map representing the sea,
Without the least vestige of land
And the crew were much pleased when they found it to be
A map they could all understand.
“What’s the good of Mercator’s North Poles and Equators,Tropics, Zones, and Meridian Lines?”
So the Bellman would cry
and the crew would reply
“They are merely conventional signs!
“Other maps are such shapes, with their islands and capes!
But we’ve got our brave Captain to thank
(So the crew would protest) that he’s bought us the best
A perfect and absolute blank!”
Long story short: the original tweet makes a confusion between PDF (the file format) and adobe acrobat (the PDF reader) : the 381km2 is an acrobat limit, not a PDF limit
I don’t think the tweet is relevant at all and it’s a disservice to this post to feature it that prominently in a summary. A more interesting conclusion is that PDF files can have dimensions larger than the Universe, and an example is provided.
I open the PDF in Google Chrome on a Mac. When I Ctrl+P, the dialog says it's 1 Page. I don't try to print it, but I think it will not consume more than 1 page?
Also, PDF preview in Chrome simply showing it like a normal PDF, but Preview seems confused (gray background instead of white)?
I just had the exact same reaction! So I opened a random PDF I had laying around, and yes, it's mostly a text format. Some (most) objects are binary data streams, but some are also text data. Likewise, objects may or may not be compressed, obviously compressed streams are binary data. But the file structure is text, some objects are xml, and you can figure out quite a lot of stuff just by looking at a pdf in a text editor, and it might not even be that long: the single page PDF I just looked at is just over 1500 lines long, I can definitely manually scroll through it (although offsets are in bytes, not lines, which make them not very useful for manual lookup).
I was surprised that the underlying format doesn't implement compression (though I assume objects can be compressed). Perhaps I shouldn't be surprised since I often get text only PDFs with unreasonably large sizes.
While the Germany PDF actually scrolls pretty quickly at 100% zoom (makes one realize just how much text is read in a day), the Universe one is pretty fun, Firefox's PDF reader at 100% zoom obviously doesn't budge the scrollbar at all.
I take offense to that diagram; Germany should refuse to be covered by a PDF that's not in proper DIN format.
In theory, DIN paper sizes go all the way from subatomic to the size of the universe. It seems like A(-39) is barely too small to cover Germany's land mass, but A(-40) should be more than sufficient. That's 882 x 1247 km if I didn't miscalculate.
Anyway, if you like weird PDF hijinks, here’s a polyglot PDF/A CSV file that is also its own original soundtrack as a polyglot Amiga soundtracker mod:
https://www.lab6.com/6