HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.
* PDFs are files
HTML is files
* PDFs are decentralised
This should be "PDFs can be decentralised". PDFs aren't inherently any more decentralised than any other kind of file, including HTML.
The store is the thing that becomes decentralised, not the content.
* PDFs are page-oriented
HTML can be page-oriented. Simply build your website with pagination. PDFs can also be abused to have hugely long pages. Bad UX can be encapsulated in any medium.
* PDFs used to be large (bla bla bla Javascript weighs a lot)
Nope, PDFs are still objectively larger than the equivalent HTML. PDFs don't have any dynamic interaction, rip all that out and produce the HTML of yesteryear and your HTML will be tiny in comparison to the PDF.
Edit: I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?
When you find a page - inherently a document-oriented term - like an article, blog post, how-to, or project writeup that's interesting or useful, and you want to make sure it's available to you later, what do you do?
Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.
No, I cut out some junk I don't need with the Printliminator [1] bookmarklet, then I do a *print-to-PDF.* This gives me a file. I can save the file, back it up to my NAS, search for it later, keep it with other files from a project where it was useful, and otherwise hang onto it. This is so common, in fact, that it's gone from being an obscure thing you could do with a Postscript-to-PDF converter or (before the adware/Ask toolbar scandal) the installing the CutePDF virtual printer. Modern OSes bundle a PDF printer, and print dialogs understand that you want to "Save as PDF". Google Docs and Office 365 editors allow downloading a document as a PDF.
I totally agree that a dynamic, interactive page or a comment section is not compatible with this model of usage. There's a lot of consumption of endless feeds, and a lot of one-time video views that also don't make sense to save as offline files. However, the web for creators, where people write articles that are worth hanging onto, has a definite place for PDFs.
> When you find a page [...] and you want to make sure it's available to you later, what do you do?
Instead of doing a bad and lossy job of archiving the page myself, I notify† our friendly neighbourhood archivists at the Internet Archive of the page; and they then do the best, most lossless job of preserving the page that they're able, given their cumulative experience.
As a side-benefit, they also then take care of keeping the archive they've made around and available online in perpetuity, with no additional marginal effort on my part. The same can't be said for something in my own "private collection."
This may not be well-known, but archive.org can and does remove pages / sites from the archive. Authors can request this, site owners (separate from the authors) can request this. There may be others who can request this.
Just an FYI. If there are critical sites you want copies of, I'd recommend making your own copy. I've lost access to important pages / sites twice before taking this to heart.
There is value in having a personally curated, offline collection of documents. You can search, annotate or otherwise manipulate it to your heart's content, all without having to be connected.
Of course the Internet Archive serves other purposes for which it is (currently) irreplaceable.
Hopefully it really is around a very long time, but the world is unpredictable and things change. It's great to enhance the Internet Archive, but you can bet I'm keeping my local copy too. Just in case.
That's subobtimal as well. The site could come out with a new robots.txt file which is just
<code>User-agent: * Disallow: /</code>
and everything already indexed by the Internet Archive is now inaccessible to you.
I don't think I've ever had such a thing that only appeared as a web page, without being emailed to me. To me, the email is the primary-source document in that arrangement.
This is still not as powerful as my one, simple trick to handle all bookmarks, ever: Print to PDF.
I've been doing it since last century, and I have 10's of thousands of PDF's of every single web page I've ever found interesting, sitting right there in a directory on my computer
——
Including the suggestion that was brought up to use ripgrep to search in the pdf text content.
Sometimes if I'm researching a topic I'll dig up a big number of newspaper articles and want to print them and read them away from the screen while scribbling notes etc, but on a lot of websites banner ads or footers with copyright statements can really mess it up.
I actually dislike HTML per se, but the only two benefits I see for PDFs in the general case are:
- In my experience, it's a little harder and rarer to make PDFs utterly incompatible with different means of viewing them, and it generally requires more overt (if perhaps slightly unintentional, at times) sadism to make that happen.
- PDFs can do some things HTML can't (easily, at least) with document design -- though those things are generally things that would be disallowed in our new "deurbanized" PDF-based web replacement.
Everything else that comes to mind goes the other way, including the fact that the viewing-mechanism incompatibility thing can be even worse with PDFs, even if it's more rare for that to happen at present, and if PDFs became the new standard for the web I'm pretty sure that relative rarity would evaporate anyway. Let's also not forget that HTML can also do some things PDFs can't (as easily, at least) do.
> Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.
I'm too lazy, so I just tend to use SingleFile these days...
You got nerd sniped by the HTML vs. PDF format thing and missed the entire point of TA:
> Isn’t it a good thing that we enjoy rapid progress? To the extent that we get to enjoy things like YouTube and sandspiel, yes! But to the extent that we want the internet to be a place where we can work and live and think and communicate free of malware, surveillance, dark patterns and the insidious influence of advertising, the answer is, empirically, sadly, no. The web has become ad-corrupted hand-in-hand with growth in technological capability, and the symbiotic relationship between web and browser means they feed on each others’ churn. Ads demand new sources of novelty to put themselves on, so the web expands continually, the specs grow in complexity, the browsers grow in sophistication, the barrier to entry grows ever higher, the vast cost of it all demands more ad revenue to fund it... and thus the perpetual motion machine is complete.
The author does identify a problem, and so you want to focus on that. That's fine. There is the issue of triviality, however.
The problem described is widely felt, and also widely discussed. We already know this stuff to be a problem. For the piece to be worthwhile, then, it should do something that is not present in the other instances where the topic has been raised. It should articulate (or at the very least exhibit, without necessarily articulating) a solution for us. It doesn't. A bad remedy to a genuine problem does not yield a solved problem.
The article is called "Deurbanising the Web", and its thesis is:
- Publish in static file formats.
- Date and hash your work.
- Stop spying on your users.
HN is a discussion forum, not project planning software. Not everything has to "yield a solved problem". Are you really setting the bar at "design a technology stack for replacing HTML/CSS/JS"? That's way, way too high.
You say that its thesis is (in part) to generally publish in static file formats, but that's not quite accurate. The piece specifically touts PDF/A as the best format and makes several arguments against the use of html/css. I agree that they're making a broader point than just "use pdf," but "use pdf" is definitely a large part of it.
Those points can be trivially met with static HTML and something like IPFS, and you can still download HTML for local storage and viewing. You can even print to PDF if you really want to do so. Meanwhile, PDFs also allow dynamic files, don't require dating and hashing, and can be used to spy on users or deliver malware.
EDIT: Oh, yeah, and static file formats doesn't necessarily have to mean static document formatting when viewing -- unless you're using PDFs, which tends to break useful stuff like reflowing for paginated documents (one of the worst things about even simple PDFs).
No, the entire point of the article is to convince people to use PDF/A. Which I find comical since you have to go out of your way to check if a PDF is PDF/A compliant. If the web was run by PDF's, there's no reason why any big corporations would abide by those rules, and it'd be just as messy as HTML is today.
You've also been nerd sniped. TA goes on and on about surveillance capitalism and the attention economy. Weird, for an article that's supposedly convincing engineers of the merits of one file format over another.
Did you read beyond the "How did it come to this?" section? TA goes on and on about web standards and the need for PDF/A.
Edit: If the article _was_ all about surveillance capitalism, then it wouldn't be worth upvoting as actionable solutions are much more valuable than preaching to the choir.
If you don't think it's clear that the author's advocacy of PDF is a means to an end, subservient to their desire to dismantle surveillance capitalism and the duopoly that Google/Apple have on the web, I don't know where to go from here.
It's when you trick a technically-minded person into jumping down a rabbit hole of a technical problem/controversy. Here it's PDF vs. HTML, but other classic nerd snipes are UTF-8 vs. anything else, "fixing" election tech, etc.
But, again, the premise is not that "as a file format, PDF is better than HTML". The premise is: because HTML is two-way, it enables surveillance capitalism and allows bad actors to monopolize the attention economy. The author wrote it thus:
> Sure, you can write good HTML. I won’t argue with that. And if you’re writing good HTML, good for you. But HTML is a dual-use technology, the bad guys are dual-using it an awful lot, and I feel that the stone age still has a part to play in the progression of the information age.
The part where you engage with this is where you write:
> I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?
Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?
A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.
> Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?
I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.
> A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.
Oh, yeah I'm not on the PDF train. That's wild. I'm more of a Markdown or Gemtext advocate, or even LaTeX.
> I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.
Yeah, projects like IPFS (which you reference above) are working towards this, but JavaScript still works over IPFS. Plus, fingerprinting techniques are pretty bonkers. Most of it comes down to JS and various state you keep on your local machine (cookies, flash cookies, etc.), but I think you need that. How do you maintain a session with a peer without some kind of token/cookie?
> Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?
Yes, it's call TOR. However, legislation is where we should start. Crippling/abandoning an incredibly useful technology which works very well just because it's often used nefariously seems to be a bit of an overreaction.
Until then, stop using social platforms, use an ad blocker, and use VPN if you really care about "surveillance capitalism".
Saying HTML can be offlineable is like saying C can be provably terminating. There's a subset of programs where that's true, but it's not inherent to the form. A PDF is inherently self-contained, standard web technologies are not. When you open the page and it's a PDF, it gives you certain guarantees, when you open it and it's HTML, you have to have to do further investigation.
Firstly, C being provably terminating is a problem dealing with the full body of C programs written in the world. The OP is dealing with their own self-published content. That's a different problem: if your analogy held it would need to be limited to proving that a subset of C programs written by the author terminate.
Secondly, the level of difficulty in making HTML offlineable is many orders of magnitude simpler than your C analogy: there's really no comparison. For the OP we only need to make HTML documents that they have authored themselves offlineable and yet people have written general purpose tools to do this automatically for most webpages. This is not a hard problem.
This is a helpful post because it gets to the heart of the difference. Many people are saying "if you do HTML in a particular way, you get the same benefits." I'm asking "what's inherent to the form?" That's exactly the point about C--you can write it in a way that's provably terminated, but it's not guaranteed. Consider the consumer's perspective.
When I land on a page that's a PDF, I know certain things--I can easily save it and read it later. How do I know that? Not because I have read the PDF spec, or know that much about it, but because of my experience as a consumer of the web.
When I land on an arbitrary web-page, do I know the same thing? No. I don't know what the page is doing, I don't know what my browser will do when I try to save the page. When I save this page, I have the option to save HTML only, or a complete web page. Will the complete page actually work? I go into the source, and there's a link to the javascript (which is saved locally). Does rendering the page rely on that javascript? Does that javascript do xhr or fetch calls? Since it's Hacker News, I suspect the answer is no. However that's not inherent to the medium.
There are better ways to archive the content of even dynamic JS heavy pages, but they are not things that you learn as an average user of the web.
It's possible to write PDFs that don't "work" (for some useful definition of "work" similar to the case with HTML) offline. Please stop pretending that's not true.
The reason offline utility tends to be true more often for PDFs is that PDFs are not generally regarded as the preferred online-default format of choice, which is in turn a matter of social effects rather than technical capacity. Reverse the socially accepted roles of the two document formats and watch the same complaints get made against PDFs as you're making against HTML. I'd bet money the "normal" state of affairs would remain the same in terms of the perceived benefit/detriment allocation between online/offline formats; only which format was considered which would have changed.
. . . but then all the web would be even heavier documents, and even less customizable for local viewing, thanks in part to that pagination and strict formatting situation.
It's possible, but it takes work. I can't remember the last time a pdf did something unreadably weird, usually my only gripe is with something that's a scan of an old document but whoever turned it into PDF didn't do OCR.
I don't really follow. How does this author converting their entire site to PDF help readers/visitors/users?
The original HTML site[0] was printable as PDF, and save-able as both HTML and "Web page, complete", all of which result in a well-formatted & readable offline experience. (It was also responsive: very readable on mobile, but that's an aside).
The new PDF site is not accessible to some, difficult to read on mobile, and interacts poorly with all of the norms web users are accustomed to (back navigation, anchors, etc.)
It's the difference between "this thing has X property" (termination or able to save for offline reading) and "this thing _obviously_ has X property, in a way that you can tell without any expertise, or doing any investigation".
How important this is to users, or whether it is worth it is something I've not commented on, but it is a difference.
hyperpage's analogy would work if the property was "avoids undefined behaviour", rather than "avoids nontermination". When we encounter a webpage, we are being expected to execute potentially complex, well-being threatening code whose behaviour is about as easy to predict as obfuscated C.
> When you open the page and it's a PDF, it gives you certain guarantees ….
I think that this is a lot less true than we're used to thinking. The PDF spec contains a lot more interactive capabilities than I think most people realise. (It supports JavaScript!) We're not used to seeing those capabilities abused, because there's no point; it is so much easier to abuse HTML. But, if people want to abuse PDF—and, if we somehow convinced the world to move to it, then they would—then they easily can.
(I'm not conversant enough in the spec to know, but I do know that Postscript is Turing complete, and I don't know that PDF isn't. At least HTML on its own certainly isn't—no recursion!—although all bets go out the window once you start layering other tech on top of it.)
I don't buy that the problem with the web is that HTML is not inherently offlineable. HTML may not be inherently offlineable but it can be. PDF isn't inherently a web friendly format, but it can be. There really isn't any good argument for PDFing the web.
Even that usually sucks nowadays, because web developers don't care anymore. Probably 75% of the time before I do that, I have to go into the dev console to delete overlay elements that obscure content and garbage that will waste 10 pages (e.g. grossly oversized images, related article recommendations, etc.).
There was a time when most websites had a print view that gave you a simplified html page that worked well, but I think most of those are gone now. Now it's all some print "media-type" CSS that no one ever put the time in to do properly or keep up to date.
I agree, I don't see why anyone can call publishing in PDF is "dumb". The author of the material gets to choose his medium. If "you" don't like it then move along or convert it to your preferred format. In other words "why not both?"
> A PDF is inherently self-contained, standard web technologies are not
What technologies exactly?
You can have absolutely everything you need inside the HTML. You can inline css, js, svg and images. What technologies you can’t inline?
you are correct that you CAN - but who does. That's no longer considered best practice. The arugment these days is that it's a lot easier to manage css if it's in a separate file, same with js, etc. So none of the serious web developers actually do anything inline anymore. The time it would take to convert a "best practice" website with separate files for html, css, js, etc. is just not worth it. The point he's making is still valid - why not have the option for something static.
But with the same (and even much bigger) success you can declare “I’m switching to self-contained HTML! No more external resources!” instead of “I’m switching to PDF, saying farewell to interactivity and mobile devices”.
It's just the declaration of ONE person, switching ONE site.
> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.
You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.
As I explained, if the author wants to make HTML easily offlineable then inline CSS and Base64 images. Or, you know, make your website printable. If authors actually thought about the print to PDF "problem" it could be solved with traditional CSS and HTML. As someone else said, we used to do this. It used to be part of my every day web design job to make sure the page printed nicely.
The idea that the whole web is going to pander to edge case archivers is asinine. This whole conversation is about supporting the needs of the very, very few and romanticizing about the time when only interesting people used the internet. It's kind of elitist and self serving.
I guess I don’t really understand the point being made. Does it matter that much that saving a page create a single file in your hard drive? If you really want a static rendering of a site why not just print it to a PDF. Why does that have to dictate the file format you use for distribution? With PDFs you don’t have to worry about conversion but they are also comparatively larger over the wire.
> even that may not have the content you want to due to dynamic sites
But PDFs also don’t give you dynamic content. Nothing is stopping people from using HTML to serve static, JS-less content. In fact that’s what it was originally designed to do. All this web app stuff was bolted on afterwards, and it’s optional.
What do we accomplish by having some people switch over to PDFs? The people who don’t care about bloat will continue to not care about it. It’s not like thin content will become more discoverable or more common. It doesn’t really change incentives. The author says using PDFs makes it so you’re not tempted to add cruft to your sites but that’s not really a compelling argument.
Getting content creators to produce content without bloat is not really a technical problem. It’s a cultural and economic one. I don’t see how a file format addresses that.
> Does it matter that much that the artifact of saving a page be a single file in your hard drive?
Yes, it matters a lot. Word/Excel files are actually a zip archive containing many files and sub-directories. Can you imagine people working with exploded Word files, sending over mail and WhatsApp complete directory trees?
The file format restricts the possibilties. You know what to expect when you see a PDF - static, JS-less content. With HTML on the other hand, it depends on what the author decided.
Or I could just make sure that my page prints reasonably well (we used to do this) and use the print-to-pdf functionality available in modern browsers.
But if you want a page in PDF, you can print it to PDF. Sure, non-computer-savvy users might not know how to do it off-the-bat, but browsers make it pretty easy.
Oh, I know that. I just meant that if your goal is for the website to be easily archivable, rather than publishing the website as PDF you could use simple HTML which wouldn't suck when printed to PDF.
Say hello to your new sidecar directory (or broken CSS/images/God knows what else)!
I tried to save an NY Times article, and it 1) needed JS to display anything, 2) even with the sidecar stuff was broken, 3) it was so plastered with ads and other junk I thought it was incomplete (it wasn't, I just had to scroll waaay down past something that looked like a footer and some voids after that).
If you save a PDF, you get that exact PDF on your hard drive, and when you open it (even in 10 years) it will look exactly the same as it did on the site.
This is of course the point of the article - that the web is a giant steaming pile of shit for the most part, plagued by JS and external resource requirements, all of which contribute to massive total page size.
I'll preface by saying I have some expertise in HTML, but none in PDF (the format).
The point of most commenters who suggest that HTML is still a better alternative than PDF (I agree), are assuming that if this is an important issue to you, that you would craft your page in a simpler style compared to most of what we see on the web, making Print to PDF or Save As... more viable.
> PDFs and a PDF tool ecosystem exist today. No need for another ghost town GitHub repo with a promising README and v0.1 in progress.
This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.
In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.
> In we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters.
PDFs can be tiny if they do not embed fonts. Serving fonts is very much a complex technology in HTML world.
Browsing the web is a pain in the ass if you don't use a browser compliant with up-to-date standards, but the whole "HTML can be lightweight" argument pretty much depends on avoiding much of today's standardisation. As an objection to the original argument, it is not comparing like with like.
> This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.
> In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.
PDF is a display format. I once worked on a project parallel to a guy who was parsing PDF to extract text content. IIRC, Text in PDFs is stored in a way that works fine for printing/rendering but not so well for manipulation (e.g. it's a bunch of commands to render line Z at position X,Y with font W). Those commands don't have to be in reading order, nor do they have the semantic meaning you can get from markup like HTML (e.g. superscript can just be nothing more than a different line rendered with a smaller font).
IMHO, PDF is actually less optimal than HTML for what this guy is advocating, except that it's those precisely those limitations that have prevented PDF from becoming the mess than Web HTML has. Though, that's probably in large part because the bloaters have been too distracted by the easier-target that is HTML to bother.
I actually did this pretty recently, in an attempt to get some magazine articles onto my Kobo e-book reader since Pocket couldn’t fetch the paywalled ones (I do pay).
I figured I could just save the page, automate a few edits to get around dynamic stuff, and then use it as, you know, an HTML document.
Even with a nice friendly mostly-text literary magazine, after about five hours I gave up and just copy-pasted the rendered text.
HN is not a good site to illustrate the unpleasantnesses of navigating the modern web. As you'd hope for a hacker news site, it is very friendly to this sort of thing. Most sites aren't.
> You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.
Ctrl+P -> Save as PDF
You don't need the page to be a PDF to save it as a PDF.
The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.
Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.
The internet is plastic not because of HTML, but because of money and people. When you have teens driving content it's going to feel plastic. When Walmart uses the internet to sell you crap it's gonna be plastic. Gossip / social platforms are trash, no matter the medium.
It could be argued that TV is an incredible learning platform ruined by HD. Back in the standard definition days we had proper news, documentaries that were substantial, and no reality TV. We need to go back to black and white standard definition.
Sorry, but the PDF web is not a solution to societal rot.
> The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.
He's actually more of a social observation: it doesn't matter what the technology can do, what matters how how the developers of that technology actually use it.
People who use PDF almost never use 3D graphics and heavy dynamic JS, so PDFs almost always have many of the qualities he's seeking.
Web developers almost never inline anything, and do all kinds of things that are arguably deal-breakers except for a few lowest-common-denominator use cases.
> Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.
The premise is that the web has failed in important and clear ways, it's impossible to fix so we should give up, so many use cases should abandon it for something else, and PDFs are unexpectedly well suited for that.
On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.
Turning PDFs into the replacement for HTML would change the incentives around PDF authoring, and PDFs would then acquire the same problems identified with HTML.
The solution to the identified problems is not to switch to PDFs. Stop reshuffling the chairs on the deck of your sinking ship, and start figuring out how to design, implement, and incentivize the use of, some means of conveyance other than iceberg-vulnerable ships.
> On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.
Not so surprising, really: the PDF standard evolved in parallel with Adobe's Flash between 2005 and 2010, which was then the key technology in Adobe's effort to keep a strategic toehold on the web. If Flash had not been a security clusterfuck, it might still be around. The PDF standard was always meant to be a complementary standard, and Adobe's attempted successor technologies have followed an even closer technological path.
The PDF standard has benefited from the fact that, unlike the W3C and WHATWG, surveillance capitalists have not been in the driving seat of its standardisation effort. Adobe's interests are not identical to those of the public, but they are not as essentially adversarial to them as the web standards bodies have been.
I'm not exactly sure what point you're trying to make here, but I don't think two different formats for encoding formatted text with images constitute different "mediums".
Of course they are, and we run into it constantly in computing. You can encode text with images as a bitmap, as vector graphics, as symbolic content that references bitmaps or vectors, as an algorithm that procedurally generates any of the above...
While you can produce identical outputs from the different methods, it's not hair-splitting to say that the authoring process and hence the nature of the medium to shape expression is affected by choosing one. When you opt towards maximizing generality your production cycle can grow without bound because everything is possible by layering different media, even if all of it is unnecessary. That's how you end up with creative projects that take multiple years to decades to accomplish.
Well, you seem to get the gist of the hot take the author put out. This article is not about PDFs. There is something wrong with the world and we can sense it.
This is close to it:
When you have teens driving content it's going to feel plastic.
Youth is the ultimate quality destroyer. They just fucking suck. I’m quite sick of their drivel honestly, and yet, we let them dictate the world (watch my childish cartoons, even in old age).
And the little shits complicate code bases. All you little rascals under 30, scram, I’m on to you.
And all you little adults acting like children, with your stupid motivational posts on LinkedIn, and your garbage bragging on there, I see you too.
Unless I'm on a paper-sized tablet I would definitely rather have an offline HTML file than a PDF. Nobody likes to pan back and forth on lines of text to read something.
I had the exact opposite reaction. I’m reading this on an iPhone SE2020, and I MUCH appreciate reading this in pdf form. I didn’t have to pan back and forth or even put the phone in landscape orientation. This is one of the smallest smartphones you can still buy, and the experience of PDF is WAY better than the user-hostile auto-flow text forced down mobile users’ throats.
I was skeptical at first, but I think the author made the point fantastically well.
Your browser has a zoom functionality that lets you make the text smaller, essentially replicating the PDF site above. Only the opposite of what you say is correct: I can’t read that PDF’s text without turning my phone into landscape and picking up my glasses.
To get equally small text on my desktop I have to turn the font size all the way down to 7. God forbid you have readers with less than stellar eyesight.
I get what they're going for but the PDF is not exactly an accessible reading experience.
I’m commenting here as a user reading a PDF. The fact that someone else could have laid it out differently doesn’t change the fixed layout of the PDF that I’m trying to read.
There’s a reason responsive design has been a big deal for the last 10+ years and I don’t think the benefits of PDF are worth throwing it out.
> These all seem like technical quibbles that miss the point.
If these all "miss the point", what is the point?
It seems to me that the article's point is that PDF as a format has attributes that satisfy the author's goal, whereas HTML does not. The parent comment says that HTML does have those attributes after all (if you choose to use HTML that way). That is very directly addressing the article's point, as I understand it.
Perhaps I misunderstood, but I believe the author's point was to highlight what a steaming mess the modern web is. The PDF aspect strikes me as illustrating a point, not a seriously proposed solution.
Yes, and many of this things are "in general" not well supported by anything but adobe PDF.
Even most simple interactive things can easily not work correctly even in more widely spread PDF readers.
IMHO PDF is in many ways worse then HTML, it's just that this ways are less commonly used, but if you start a PDF instead of HTML trend it's just a matter of time until this "not so compatible" aspects of PDF become widely used by some people.
Just caveating a technical statement I knew wasn't quite true, not making any sort of assessment either way.
As someone who has had to extract data from large sets of PDFs and modern web presentation formats, I'm not a fan of either, really. Even verifying that a visibly presented string exists in a PDF document programmatically can be a non-trivial task, as with a given website as well. That to me says a lot.
monkeynotes seems to take the line that technical defects in claims others make fatally undermines their case, but technical defects in his/her arguments are irrelevancies.
For what it's worth, the same objection occured to me. The use of scripting I've seen in PDFs has been use-supporting and consistent with their book-like feel.
Also - how are PDFs exactly "discoverable"? I have petabytes of PDFs and making them easily "discoverable" for any mass use, such as analytics, search, or data analysis is a massive pain. I'd rather have them in a non-PDF format.
Not a single researcher or data analyst I know of would prefer "discoverable" content to be in PDF format, regardless of just how awesome the OCR is (which it often isn't, especially for tabular data). Even for all-text, non-tabular documents, OCR does not provide the metadata needed to make sense of the documents. Why PDF is claimed to have superior "discoverability" in the OP essay is a mystery to me. For the sake of "discoverability", PDF is definitely not the way to go.
Honestly, if you're going to put out a manifesto as a PDF, at least take some time "layouting" your design. The one advantage of that format is that you control the aspect ratio. Every font is permissible, everything is absolutely positioned. Using a generator to create it is cringey. Show the art that's possible. Really sell the format.
FWIW I deliver PDFs daily as an art director; not ideal, but they work in most cases. There's certainly nothing rebellious or non-commercial about them.
> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.
I built a tool for this exact purpose[0] since the HTML specification and modern browsers have a lot of nice features for creating and reading documents compared to PDF (reflow and responsive page scaling, accessibility, easily sharable, a lot of styling options that are easy to use, ability for the user to easily modify the document or change the style, integration with existing web technologies, etc.). In general I would rather read an HTML document than the PDF document since I like to modify the styling in various ways (dark theme extensions in the browser for example) which may be hard to do with a PDF, but its more of a personal preference. Some people will prefer that the document adjusts to the screen size of the device (many HTML pages), and others will prefer the exact same or similar rendering regardless of the screen size (PDF).
Either way, kind of a fun idea making a website using just PDFs. Not the most practical choice, but fun none-the-less.
This reminds me of the guy who said drop box was stupid because he could set up an ftp server. It’s the exact same argument.
People understand PDFs, they are extremely common in the academic and business world as “digital paper” standalone documents. Hypothetically, anything in memory can be made into a file but in this scenario what matters is the practical goal of people actually using these files.
I think it makes sense for the web to be made up of discreet primitives not only so that the web can be browsed in an intuitive and frictionless way but also because it lends itself to being backed up and easily re-hosted.
This. Also who hates the huge double margins? The slow rendering? The unnatural break-up of text? Meaningless headers and footers? And the whole page-based layout? PDF is not meant for the web. Period.
All true. Incidentally, I do not see pagination as necessary or in most cases even desirable; rather, I see it as a vestige of the printing technology, while the need for printing has shrunk dramatically over the past 20 years.
What I like best about pdf files is that I can just give them to someone and be almost certain that any questions will be about the content rather than the format of the file.
Sure - if the publisher cares. From the user's standpoint, the safe assumption is that they don't. Of course PDF is No Good for many contexts, but for any sort of long-form document that is primarily meant to be read, it's so often better.
Also, if something is available in pdf, I can be moderately sure that someone else took the time to make sure it would be formatted correctly and print out OK.* If it only exists in HTML it's more of a roulette wheel experience.
* Unless some graphic designer thought 'gee this report would look so cool if the cover pages were black or some other highly saturated block of solid color.'
HTML used to be a very nice format at the age of xhtml 1.1, very formally specified, and a tie with DOM was assured by vert strictly standardised DOM v3. And ACID3 was giving you a pixel for pixel repeatability during rendering.
HTML+JS today... now it's effectively a standard in name only, and Chrome is the new IE6. The standard is now "what has worked in the last stable release"
Now go to http://acid3.acidtests.org/ and see how the latest stable Chrome release can't render a decade old CSS testcase.
> Base64 your images […], put your CSS in the HTML page
Is there a tool that does those two things (or at least the first one) and that can be used by non-programmers (command line use is fine, a Python library would not be)?
"I come to hacker news to engage with thinkers, not just read a published article from a single author."
And how many websites today are anything like HN, in terms of relative simplicity, e.g., no images^1, 3rd party requests or ads, only a tiny bit of (gratuitous)^2 JS.
1. I do not particpate in the voting scheme but I could vote from the command line if I wanted to. I use a text-only browser so the grey, fading text gimmick is irrelevant. I see all comments and treat them according to the thinking not the voting.
2. If we exclude the .ico and a .gif
There seems to be a double-standard, for lack of a better term, where many HN commenters and voters appear to work for companies that make websites with tracking and ads and various gimmicks targeted at "non-thinkers" which are nothing at all like HN. Whatever these commenters and voters see and appreciate in HN they are not working to bring it to the rest of the web. I seriously doubt they comment and vote on HN out of fear of so-called "power users" or a belief that the HN type of simplicity could become more popular and threaten their jobs that depend on surveillance, online ads and a non-thinking audience of "powerless" users. Rather, a more rational explanation might be that they see some value in a website that shows no ads and generally uses no gimmicks; that's something to think about.
"PDF web" may not make sense to many folks who have invested heavily in JS and Big Tech web browsers, but Postscript is arguably more elegant than Javascript. "Thinkers" usually like FORTH.
In a sea of cynicism, I gotta say.. bravo. This genuinely put a smile on my face. It has a lot of problems, sure, but it's a creative use of the Web and would surely work for some use cases. It's certainly no worse than using Flash ever was.
It reminds me a bit of a "newsletter" I'm subscribed to called, ironically, "Not a Newsletter" (http://notanewsletter.com/). You get an email from the author each month and it just points to a Google Doc where he puts the actual content. Why's this good? The content can't set off any spam filters, he can edit the issue after it's "sent" if there are mistakes or broken links..
The content can be censored arbitrarily by google, and when you click on mobile web with the docs app installed, it logs your logged in google account identity (maybe for work?) with the view when it switches to the app.
If the author was concerned about getting censored by Google or feeding their data empire, they could set up a self-hosted Google Docs-like, like NextCloud.
The readers would still need to trust the author's not doing anything nefarious with their IP addresses, but I guess there's a degree of implicit trust when subscribing to a newsletter.
I would just put it on my own server. Are people really worried about clicking a private link and having their IP address logged? Just opening an email with a tracking pixel triggers that already, and you have to assume clicking a link will log your IP whether with Google or Constant Contact or any other mass email provider.
Google Docs are still files. It's just up to the author (or even the readers) to keep copies outside of Google's servers. Unless Lab6 owns their own servers, whoever is hosting these pdfs can delete them as well. At least, in both cases, static files are much easier to backup and copy than entire three-tier dynamic applications. And readers can keep their own copies separate from the original, which isn't possible with an application at all.
Yup. Another way to say it is Google will release a file format the day offline computing drops dead. It should probably amount to an antitrust case or at least a major class action claim at this point. That said, even with PDF specs it's freakin impossible to read/write that format in an intelligible way, if the person creating the document used even the barest amount of block alignment. Adobe started with an innovative notion about layout, but ended up making content extremely hard to parse, and actually tried to open source the engine. Google started with an idea of trapping everyone's data in a format they'd never make fully available, and then charging for the privilege of storing it.
My eyes are not very good. I have trouble reading the font in the PDF. I am using Firefox. HTML lets me pick that a font that I can read easily. I cannot do that with PDF.
> PDFs used to be unreadable on small screens, but now you can reflow them.
I am using Firefox. I cannot do that.
Realistically, how many years will I have to wait until Firefox catches up?
Over twenty years ago, I learnt Web authoring by examining the source which had a profound effect on my career. That serendipitous opportunity I had with human-readable sources will be lost to the next generation with PDF - they have to learn the technology deliberately.
My understanding is that PDF is a monster of a document format, and it's clearly not (usually and historically) meant to be reflowed. Even copy/pasting from PDFs can be very disconcerting because the viewer may not have a good idea of where blocks of text start and end (or even what the characters really are).
I can empathize with the feeling that the web is incredibly bloated, but that's IMO throwing the baby with the bath water. Simple HTML with some optional CSS would do the job much better IMO (and can be easily downloaded, mirrored or offlined with tools like wget).
And if you really don't like writing HTML (I won't blame you) then there's always formats like markdown, org-mode and friends which can easily be converted to pretty much anything.
Dealing with PDFs (as in, coding a system that can import/export/display them) is more obnoxious than dealing with excel spreadsheets.
Unless your system is a PDF library (as in, you make the black-box dependency that other systems use to handle PDF exports), everything you do with PDFs will be through some annoying black-box dependency that is a pain to use.
Even relatively complex HTML is much more fun to work with than PDF.
Brief investigation suggests reflow is a super-clumsy, ultra-coarse-grained view mode that is implemented by few clients, is not easy to access, is not well known, and is vastly inferior to what you can get on the web, especially as it’s basically text-only.
In Adobe Acrobat (and I’m guessing Adobe Reader): Choose View → Zoom → Reflow, and it turns everything into one column of nigh-unformatted text.
(Word looks like it may support it, but that could be more that it’s converted it to a Word document in some way and reflow-like functionality falls out of that naturally, though I imagine the tagging would help with the conversion; and someone in this thread mentions something called “Book Reader” supporting it.)
Source code for websites hasn't been readable for years. Reading a minimized JS document that has mauled the DOM is only slightly more readable than the structure of a PDF.
TBH it's a little bit like complaining you can't open a modern binary executable in a hex editor and learn programming from that. Days of doing your regular coding by writing direct machine code or assembly are (mostly) gone, and for the sake of advancing the craft, I'm (mostly) happy with it.
But I too wish the modern web was simpler. It took an evolutionary path of maintaining just enough backwards compatibility to only keep making things worse. Efforts like Gemini[1] bring some hope but I'm afraid the medium won't be flexible enough for much beyond personal blogs. But maybe that's for the better.
Gemini is as "terminal-only" as Markdown. Just because it's a text format first and foremost, does not mean that you can't display it nicely formatted. It's more like EPUB in that regard.
Gemini sites are not terminal-only and the renderer can make it look beautiful (depending upon one's definition of beautiful). One example is Lagrange:
I read this entire document. If you've ever had to write a PDF-to-text parser - and God help you, I have - you will beg for Flash to come back as a web standard.
[edit] Generally though, I'm sympathetic with your point and it's kind of like why zines regained popularity in the 90s (and samizdat in the Soviet Union before that)... controlling your own publishing is a powerful idea. Anyone can do that though, without resorting to obscure formats, unless obfuscation is the point.
Yeah, 10 second load time, tiny text on a mobile device. No thanks. Sucks that people went for over-styling every site making everything painful to publish. I’d be happy with 90’s static HTML, and a few images when needed. I seek information, not “an experience”.
I had no idea what the content of the site was (besides the title from HN) and around the 50% download point, I had already lost interest. I'm clearly not the only one who loses interest this quick [0][1][2].
Also, as others have mentioned in root level comments, the design & layout of the content within is also severely lacking, which makes waiting for the load to occur even less worth it.
Exactly this. It is by the way one of the main reasons I initially stuck with HN. The lean UI, text based simplicity, efficiently conveying information had me instantly. I would sacrifize styling for speed anytime, everywhere.
On the contrary, I much prefer a small text on a mobile device to the reflowed text on a mobile device that we’re always forced to use. The PDF is also the same view as on a desktop, so if I look at it on another device, my spatial memory of where stuff is remains intact.
Might as well just generate a PNG. The text is too small for me on a mobile device. PDFs main goal was print. The fonts are awful for the screen and no ability to reflow the text.
I can deal with things moving around, I don't need spatial memory for that. Just give good titles, headers, and indexes. Again, we can do this with simple HTML, embed images and styles. It's all there.
Unfortunately, as I mentioned, people don't really publish information anymore. It's mainly for "experience" and for "looks". Marketing, and advertising, now drive the information era. The "Information Super Highway" is now just a crumbling road plastered with billboards. Most content is useless, and is there for clicks. Heck, I'd rather someone post their site in digests in e-book formats than PDF.
I just ran your PDF through an accessibility checker and it failed magnificently. For this reason alone, suggesting people make more use of PDFs instead of well-formatted HTML is a total non-starter for me (and should be for everyone).
Heck, even PDFs produced by Word (or comparable FOSS editors) are so much better (except if you've done it incorrectly by "printing" it) than this particular one.
I find it quite amusing that the author is railing against HTML at least in part because it's practically impossible to build a new web browser at this point, and then moves to PDF instead.
In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.
I would definitely not pick PDF over HTML in regards to how easy it is to implement a good reader or writer.
And there's plenty of authoring tools for HTML already, so the "ecosystem already exists for PDF" doesn't track either.
Even the complaint about churn makes no sense to me, because there's no need to upgrade your tools constantly. If you're using something that produces good HTML today, it'll produce good HTML in a decade, too.
OTOH, if you have a problem that could be automated, you're a lot more likely to be able to create that tool for HTML than PDF, and it's quite likely that someone else already has for HTML, but not PDF.
> In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.
Both pdf readers on my phone can't read the pdf, so this is definitely an issue.
As someone who works with PDFs a lot, please don't. PDFs are awful in every case except those which require a very precise visual layout. From reading the article, I do not see a single case in which PDF is superior to vanilla HTML.
My kids school used to send links to google docs for their announcements, I hated it. I pretty much hate any system like that, it's purely extra steps on the web.
In both email, and the browser I'm already in a program that displays text and images and cool stuff. So then I'm just sent a link to someplace else that does the same thing?
So then what? Is it all just "pdf can do that too", but with extra steps...? I can print to PDF in most browsers if I want, but in this case it isn't a choice.
The idea that I might save and store the school emails or that website and somehow manage those files seems kinda self important in a way ... I don't mean that as a personal attack, just that this idea that they imagine me taking the time to do that with their content? When otherwise it could have just been an accessible web page? How many people care to do that?
If I'm visiting a website I'm almost certainly not interested in saving your content / managing it... almost never.
I'm a little lost on the whole 'page-oriented' idea too. That's just a limitation of paper, and it's a pain / disruptive more often than not. Even the 'page oriented' section is broken up by the page and some extra text at the bottom of the page that is irrelevant to the paragraph...
If folks want a 'save to pdf' option might be nice to add, or the user can just print to pdf...
I certainly get the argument, but using something like hugo or gatsby or jekyll when you want to avoid the "churn" also seems like a perfectly valid solution.
The author addresses this pretty well. Because you can embed whatever you want, static site generators aren't really static. In particular, Jekyll blogs and what not still pretty commonly include comment sections.
Of course, pdfs aren't necessarily static, either, but that is why Lab6 is choosing to use pdf/a, an actually static format intended specifically for long-term archiving of immutable files. This way you can sign the file and guarantee it stays the same forever and everyone's copy is identical.
I'm kind of surprised at the response to this. The author seems well aware of how terrible pdf is as a format and this isn't some treatise of why we should want to use it. It's an unfortunate compromise that, given the requirements they're aiming to meet, of generating a file that supports rich formatting and hyperlink embedding, but which can guarantee immutability and long-term archiving directly in the spec, pdf/a is all there is, so in spite of being a terrible format with a lot of shortcomings, it's what they're using.
Why don't they just use a static subset of HTML? You don't have to include comments sections, just like you don't have to include 3D CAD models and videos in your PDFs (yes you can do both of those, in theory anyway).
> The author addresses this pretty well. Because you can embed whatever you want, static site generators aren't really static. In particular, Jekyll blogs and what not still pretty commonly include comment sections.
But just like you can choose to use PDF/A, you can also choose to have a completely static and self-contained (e.g. using data URLs for images) HTML page.
Nobody is requiring you to use PDF/A. No mainline browser (that I'm aware of) requires it.
So what is being solved? When I click on a PDF on the web, I don't know if it's using PDF/A, I don't know if it's embedding or linking its fonts. So it's the same situation, nothing has changed.
Telling people to use PDF/A when most clients do not enforce it and when there's no indication to users before they click on a link whether or not the link is following the spec -- it is exactly the same as telling them to use a subset of HTML; the author is doing the same thing they complain about.
You can't just say that PDF/A exists. That's not enough, how will you get people to restrict themselves to that format when 99% of their users will never notice the difference and no client is enforcing it?
The only thing I like about PDF compared to HTML is that with PDF, I know for a fact that no web requests are made in the background. That means no fingerprinting, no analytics etc.
With HTML, I have to trust that some random entity does what they state in their privacy policy, and they regularly don't. Sure, I can disable JS, but then 95% of the web doesn't work anymore.
Other than that PDF is quite clearly a less accessible format.
That's not the PDF spec is it? That is a spec for Adobe Acrobat, which is not allowed to make any web requests thanks to my application firewall (Little Snitch).
Pretty sure a PDF opened in the browser can't run any JS, but not completely sure. So you're right: I don't really know it for a fact. Poor choice of words.
The spec is ISO 32000, and it’s expensive and closed, so difficult to reference. But according to Wikipedia at least, JavaScript is normative in it. No idea if SOAP / Web Services is part of it though.
Are you sure? I was under the impression that PDFs can reference web resources, and this is why there are more stringent standards for archiving (PDF/A and friends)
> With HTML, I have to trust that some random entity does what they state in their privacy policy, and they regularly don't. Sure, I can disable JS, but then 95% of the web doesn't work anymore.
If you only allow PDF, then 99.9999% of the web doesn't work anymore.
I'm all for getting sites to be static, but PDF doesn't fix that because the problem has never been the technology used to build the site.
When I click a link you mean? Definitely true, but that way they only have access to my IP and user agent, which is still better than all the WebGL, Font library, display calibration settings, mouse movement etc. that they use otherwise.
I often use Tor, although I'm pretty sure that even then, a good analytics lib can see it's me based on scroll behaviour, mouse movement, time of day, and of course what I browse.
Very surprised to see just few comments mentioning EPUB, which is IMO is much more suitable for document-centric approach. An open standard with freely available[1] specification and never had any problems with EPUBs on PC, tablets and phones.
Not only simple browser plugins per the other reply (and a plethora of non-crashing mobile apps, whereas mobile PDF reading apps crash on me all the time) - the ePub format is just a zip file in disguise with plain text (HTML) inside and maybe some images/etc.
In a manner of speaking, ePub as a design has an inherent built-in fallback mechanism to manually obtain the internal content in case of failure - including ability to try and repair a broken zip format (zip -F/-FF) and grep it in place (zipgrep).
I also enjoyed the sentiment of the article. I used to blog a lot but in the last decade I have preferred more long form writing. Now I use the leanpub.com [1] service so when I write, I get generated PDF/ePub/Kindle formats, and material is readable online as HTML/CSS. For me leanpub is a way to make content free and accessible, but people can pay if they want. The relatively few people who pay for my material have a large effect on what I decide to write about in the future or which writing projects to drop.
I consume the web mostly by following a few very interesting people on social media and following their links. As an author, my goal is to keep producing interesting enough material to be worth people's time reading.
As others have pointed out it's strictly worse than a static HTML site in many, many ways. At the same time though, it's a brilliant criticism of many of the worst aspects of the modern web.
Great article - so much depth and accuracy to this! I see a lot of discussion about the semantics of pdfs but I think those are missing the overarching theme here.
Feels like this is more about the fact that websites have become increasingly dynamic, unstable, unreliable, inconsistent, etc. - pdfs offer something like a book, static, stable, reliable and consistent.
Think about a book you can turn to a specific page no matter how many times you look at it and the print is the same, the information is the same, you can do the same action over and over again and get the same expected result.
Now imagine opening a book and you could have sworn that the chapter you wanted to reference was 11 but now it's 16 and the images are different, the examples are different, in fact the quote that you wanted to use for reference no longer exists in the book.
There's an insanity to this experience but it's exactly what the web is like - a book that is constantly changing, upended changed - even disappearing entirely. I could have sworn I had bought that book on discrete mathematics - how could it be gone? oh that's right the server managing site is powered off - book no longer even exists.
This is true, but I do think a PDF is just conceptually simpler and requires less technical knowledge. Especially in a situation where technical users are scarce.
IMO most people have a mental model of a PDF as being a digital document, whereas a HTML file is somewhat more amorphous.
I use a terminal pager with PDFs quite frequently. It works surprisingly well. Even something you wouldn't expect, like a pay stub, renders fine in the terminal.
> PDFs are universally understood by most people and can be read on phones, desktops, laptops, and eBook readers.
PDFs need a proprietary app to use, most of which are loaded with spyware & trackers. I may be mistaken in this but MacOS/iOS are the only OSes I know of that read them natively? There's absolutely nothing universal about the format.
HTML is truly universal: not only does every OS come with a built in HTML viewer, but it's a plain text file. You can read the source using anything.
> Once you’ve downloaded a local PDF version of the site, there is no risk that it can be changed or removed by the host.
Once you've downloaded a local HTML version of the page there's no risk that it can be changed or removed by the host. Yes, there's caveats to both: people can create PDFs with remote embeds or HTML sites with ajax content but both of these are the fault/responsibility of the individual author. It's as easy to make good downloadable HTML as downloadable PDF.
The so called "churn" is the responsibility of the individual HTML author. If you're making bad HTML, the fix is to start making good HTML. Not to switch to a closed inaccessible format.
PDF is an open format, with multiple FOSS reader implementations. You could argue that a subset of niche features can only be used in Acrobat Reader, but AR is far from the only PDF reader out there.
And the churn is part of the zeitgeist, not really a responsibility of anyone in particular. Individuals are suckered into it, companies are supplying it, and governments are allowing it. We're all part of it. Not new either: I'm hearing it since the 90s how the modern life is rushed, and that's just my limited experience.
I said it wasn't universal, which is somewhat different to the vague idea of being "open", and yes, PDF is technically an "open format" depending on how you define "open". The ISO 32000 spec. costs in the region of ~200 USD/EUR.
What that "openness" translates into in the real world is that there are zero non-Adobe viewers that support all of PDF's features, and even less PDF editors. The standard PDF editor costs ~200 USD/EUR (annual subscription).
This is before we even get into the nightmarish world of PDF parsing. Or PDF accessibility.
PDF is a great format if you're sending a document to someone for them to print immediately. It has no other valid uses imo.
I have done it with a couple of PHP libraries (fpdf and mpdf), but they are primitive, compared to desktop PDF generators. I know that you can use Java (never done that), or even...ugh...XSL (also never done that).
Most desktop operating systems offer a print-to-PDF functionality. It's long been an add-on for Microsoft, but that's really a historical accident / deliberate choice of that platform.
PDFs can be trivially created from Markdown or using LaTeX templates if you're looking for a programmatic solution. Pandox and XeLaTex are helpful, the poppler libraries as well. Again, these are generally and widely available at no charge.
I set up my blog so that the page source would consist of the original markdown and as little markup as possible to make that render. You can read it with telnet and the experience isn't so much worse than using a browser.
(The actual part that makes this work is a pile of opaque javascript doing all sorts of nasty things at runtime, but such is the way of web pages in today's browsers, I don't worry too much about it).
In all browsers that I use that is only true if the server sends a Content-Disposition header with its value set to “attachment” (optionally with a file name), or maybe also in the case where the server specifies incorrect or unspecific Content-Type (such as simply “application/octet-stream” instead of “application/pdf”).
Maybe the author doesn't realize how difficult PDF is to work with. In PDF it's ambiguous whether any two spans of text belong together in the same sentence or paragraph. It can even be unclear where are spaces between words. PDF also allows "optimizing" font usage that makes text unreadable without OCR-ing the custom font. The messy hacks go on and on:
OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.
I do realize how ugly PDFs are to work with (I wrote my own PDF/A generator for issue 2[2]). This is a Tagged PDF though, so you can extract text using standard tools.
To understand the mindset, have a read of the Gemini FAQ[0], specifically the answer to why not use a subset of HTML - and then read Issue 2[2] which is a hybrid Gemini+PDF polyglot, for people who don't like reading PDFs, which is apparently everyone on this thread :)
Issue 1[1] also moves beyond PDF, to try addressing some of the accessibility shortcomings by (a) prepending the content as plain text, and (b) recording myself reading the whole thing out and arranging the file as a polyglot MP3 and PDF file that can be played in an audio player as well as viewed in a PDF reader as well as a text editor.
A mini-FAQ to address some points elsewhere in the thread:
* No, it's not going to replace your blog or the web in general.
* Yes, it's an experimental art project / longitudinal CTF forensics tournament / weirdo personal blog.
> The problem is that deciding upon a strictly limited subset of HTTP and HTML, slapping a label on it and calling it a day would do almost nothing to create a clearly demarcated space where people can go to consume only that kind of content in only that kind of way. It's impossible to know in advance whether what's on the other side of a https:// URL will be within the subset or outside it. It's very tedious to verify that a website claiming to use only the subset actually does, as many of the features we want to avoid are invisible (but not harmless!) to the user
But I don't really know that your PDF website doesn't use some evil invisible PDF feature.
And I have to use a special Gemini browser to access Gemini pages. (Since an HTTPS bridge misses the point)
So why not use Dillo as my "Sane subset of HTML"? It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.
> It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.
Actually, it is. I love Dillo, but it's very limited. I like to make my images "fluid" using max-width and max-height attributes, and Dillo will not support those in any foreseeable future.
> would do almost nothing to create a clearly demarcated space
How do you create that demarcated space where PDF/A, PDF 2.0, and all other PDF versions can be mingled together, and there's no easy way to distinguish them?
I don't like reading PDFs and probably wouldn't read much of your website like that... but I appreciate the intervention drawing our attention to the advantages of PDFs in the disadvantaged present environment, which I think are real and worth thinking about. It seems almost like an artistic project. I'm not mad at you, and am not sure what makes some people seem to be so mad here (probably means you were succesful at something)... but I'm still not gonna read it, PDFs are a mess to read!
I've spent entirely too much time "printing" sites and articles to PDF to save them to read or reference later. Your PDF style was perfect! No need to fuss with anything just save it!
I think the idea of PDFs opens up many new possibilities, and your work is quite an eye opener. Design is largely missing from websites - it’s the same design over and over when it comes to optimizing for clicks.
Designers would thrive in a PDF environment instead of handing their designs over to implementation as it is now.
Maybe PDF is just the beginning and maybe a similar format can be thought up that addresses some of the concerns expressed here, and move over in time.
Case in point: copy-pasting a paragraph from his PDF-website adds line breaks everywhere. It also loses formatting (bold/italics) and the footnote superscript doesn't translate.
PDF is an open standard, which is freely available2, and stable. It has a
version number and many interoperable implementations including
free and open source readers and editors.
I think ease of copy-pasting is one of the coolest things about the document-centric roots of the web (along with the back button and hyperlinks; in other words, hypertext rules), although the modern web does break it (along with the back button and hyperlinks) in many places, so I can see where he is coming from. PDFs aren't the answer, though.
> OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day.
I'm basically in agreement, but the author has a good point that PDF is obviously self-contained and self-contained HTML pages are not necessarily distinguishable from those that aren't. Perhaps we might have to revisit MHTML or embrace Web bundles as an alternative to PDF.
On the other hand, there's nothing stopping you from using a double-barrelled file extension for denoting this sort of thing, e.g. "memex-opus.pub.html"; so long as it ends with something recognizable, double-clicking should still open it in the browser across all the usual platforms, AFAIK.
(I'm fond of using "xyzzy.app.htm" myself to take advantage of this trick for distributing simple, self-contained programs that are designed run in the browser.)
Wait, why?!? When does it render? Who's supposed to have a js engine to do that? What version? How does it load dependencies? Is HTML and DOM carried along with it? So many questions.
Why - because scripting is useful. A big use of PDFs is translating paper forms into digital forms without needing to make a web app out of them. JS is used for client side validation, same reason it was put into browsers. Acrobat can handle this along with many other features that most PDF readers can't handle properly.
Basically in the PDF world, Acrobat Reader is Chrome and everything else is, like, Konqueror or something. Don't be fooled into thinking PDF is a small spec. It's not.
Dependencies? Hah, no such luck. You're stuck with ES5 and Adobe's crufty JS library. There is no HTML and DOM, there are however some pretty thorough PDF document bindings.
> it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.
Completely agree. For instance, NASA's APOD site[1] is a good example of something that'd be nontrivial using both an offline PDF and modern lightweight alternatives like Gemini, but works really well even without fancy modern design. Under 300kB including the image (HTML's under 6 kB) before gzipping.
The author addresses this: “We choose to switch to PDF in this decade, not because it easy, but because it is hard”
– John F. Warnock, September 12th 1962"
The author is obviously making a statements, exploring ideas... not searching for an actual solution to his use case.
Yeah, it's kinda embarrassing that the one quote that gets pulled out in the HN commentary is the one that contains a typo. It's OK: Issue 1[0] contains a patch to fix the issue.
Please somebody bake an icon into the browser that turns green when websites are lightweight and content-only and make it affect Google rankings.
We don’t need PDF sites, we need incentives for publishing acceptable websites.
Side note: I’d honestly love for the government to step in and outright outlaw some obvious and intentional dark patterns (example: California unsubscribe law)
Is that actually an internal Google goal? If so, dear god, no wonder they are so willing to sacrifice the long term health of the internet in return for short term hypergrowth. No company Google's size can grow that fast without some serious dark patterns and user abuse.
You don't end up with that level of growth year over year for 20 years straight by accident. It is an unwritten assumption that missing 20% growth is a fail. I worked at Google almost 10 years and watched the dog and pony show (aka TGIF) from the inside. The real story is on the quarterly financial reports.
I've been doing something similar for 4 years now. I converted my niche website into a monthly magazine, that is released as a PDF (and also uploaded to Issuu).
It has its good sides and bad sides. People will download the PDF every month when there is a new issue, but you don't know if they read it, how much time they spend on it, etc. You won't appear on Google Results as you would do if you posted the articles as HTML, etc.
Based on my experience, I just keep doing it as an experiment and because I enjoy saying I run a digital magazine, but the true is that there is no real advantages on it.
I find this to be a super interesting response. When I settled into my current website design, I ended up basically writing an article for the homepage. I'm not a designer by any stretch, and it was the most attractive homepage I could make, and I still really like it. I used a very similar workflow (and continue to for articles) to the papers I wrote in college, and would really only take one more step to get that to final pdf state.
I'm torn between leaning into the static nature of the site and implementing the wiki I've been thinking about making
We already have a wildly popular website where all the main content is in the form of PDFs. It’s https://arxiv.org/. PDF is what you use when your document needs to have a predictable layout. This is especially important if it contains math, complex tables, or any elements where meaning is carried by positioning on the page. This can include aesthetic meaning, as in some forms of poetry that need to be laid out in a particular way.
There are several which at least strongly resemble that remark.
Project Gutenberg and the Internet Archive's text archives (along with numerous other document-oriented sites, several of the samizdat variety) offer content in PDF and other document-oriented offline downloadable forats.
Wikipedia has a "save to PDF" link on each article (that seems to work through the browser's capabilities, if any, not all browsers support this). The sister Mediawiki site Wikisource offers ePub downloads.
For longer-form content, PDF, DJVU, and a handful of other formats (arguably ePub) are at least reasonably popular.
> PDFs used to be unreadable on small screens, but now you can reflowthem.
(Pasted verbatim, retaining the missing space.)
I don't see this feature in Firefox's viewer, or the default Android one. Can anyone recommend a FOSS PDF viewer that has it? (It must be FOSS, otherwise the point about using PDF to avoid tracking is lost.)
Book Reader can reflow PDFs. It is very simple,, which I like. But it adds any PDF you open to the library when you open the app, which I find only slightly annoying for non-books.
I found "PDFs are files" kind of compelling. Perhaps this was a flaw of the original www concept. Web pages were always technically files & documents, but this was always abstracted away from userland. "Save webpage" was never a core feature. This did disempower users.
PDFs are downloaded, saved, emailed around. They can also be linked to. Userland maintains a closer relationship with what's going on. A typical user know that you can have a copy of a file, which may or may not be identical to the online one. WWW, from its initial version, was mysterious. The transition between the model of requesting files from a server by clicking a link to a programmatically generated stream of code executed on your browser happened below typical users perspective.
The wb has obviously gained a lot, but has also lost something.
I've definitely used saved webpages a lot. When we had dialup email only, my dad would drive to the library with a flash drive and download Web pages to bring home and read. It was great. Of course, it's even greater now that I can load it fresh even faster.
This was a great read. I'm sympathetic! I've had a website (Wordpress) for almost 10 years, but have stopped adding stuff to it lately, because I'm sick of the formatting changing on pages! I look again at a page that used to look great, now the vertical spacing is wrong, or tables have gone out of shape, or the font has changed to something awful. Maybe it's wordpress, maybe it's my bad css/html skills, maybe something else, not sure. I picked up LaTeX skills about 5 years ago and have just been making lovely PDF books of everything I'm into. And they stay just the way I made them. Kind of a shame though, no-one else gets to see them. Yet.
PDF is not a web format and you’re wasting effort trying to shoehorn print content and a print format for display on the web. Just use HTML and don’t update it, it’s probably easier.
It's pretty amazing that the basic HTML that I learned 20 years ago still works - it even displays fine on devices like tablets and phones that did not even exist 20 years ago. I understand the author's sentiment but PDF is an overreaction. Just write static boring HTML.
> it even displays fine on devices like tablets and phones that did not even exist 20 years ago
It would display perfectly if mobile browsers didn't have broken defaults (to work around broken websites) that you need to disable using <meta name="viewport" content="width=device-width, initial-scale=1">.
Indeed, there's a lot of irony packed into the first page:
Featured is a quote from LWN indicting the "software industry" and its "brittle dependencies". What's ironic about this? It's squarely about the parts of the software industry that deal in things that are _not_ meant to be painted in the browser.
If you want a solution to the (perceived) churn, it's funnily enough right in the quote from Mark Pilgrim: "I've migrated to HTML 4". HTML is almost certainly not going to end up drifting in such a way that DJB's qhasm bibliography page[1] is ever going to break. HTML and the Web standards in general are, with extremely rare exceptions, cumulative. It's pretty frightening how many technical people don't understand this; the Web is intentionally engineered to serve as "the infrastructure for handling humanity's publishing needs indefinitely"[2]. More frightening is that the biggest threat to this are people like the author here who treat the Web as if it's like any other thing that the computing industry puts out—i.e., already perennially broken. This is dangerous because it anachronistically cedes power to folks who'd try to argue at some point in the future that the things about the Web that they'd like to break (and might be in a position to break e.g. due to browser monopoly) are justified and no big deal, really.
The author goes on to call out the Web ("of rubbish") as "user-hostile". Shortly afterward, he or she writes that "PDF makes a stand against the churn". More accurately, PDF makes a stand against the user, by prioritizing authors' creative whims over the reader's needs. This happens again later in their remarks about PDFs being page-oriented: "you are fundamentally not in control of the reading experience." The "you" here is not you, the actual reader. The control they refer to is, once again, the author's.
You get other poor arguments—that PDFs are "offlineable" "files" that can be distributed "decentralized", none of which are accurate criticisms against what HTML lacks—unless those Java documentation zipballs that seemingly every university student enrolled in a CS program in the early 2000s was made to download are a collective hallucination.
And it gets worse from there. Cute stunt to grab attention and all, but the arguments are fundamentally bankrupt.
It's not a browser format (though browsers can render it), but that isn't the same not being a web format. The web is just the ability to retrieve files from other people's servers, that may themselves reference other files on yet other people's servers. As long as a file format supports hyperlinks, then it's suitable for the web. If you don't care about being able to actually click the hyperlink to activate your desktop system's uri schema handler, then even plain text works fine.
It would be better if they just used that subset and just published it directly instead of needlessly repackaging it, but if that's what was meant then sure. Maybe we need a better name for simple, semantic HTML and basic CSS.
The point of it is to be a self-contained package. You still need hardware to read it, but not a server. In theory at least, once you have it, it's yours. (of course the commerical ebook vendors are trying to spoil that.)
EPUB is an under-appreciated format that I think can serve as a short to mid-term storage for human knowledge. Can reasonably re-flow itself when necessary, no language run-time required, just a full Unicode support at least at the level of the time the file was published.
That's the Internet of knowledge I'd love to see: things organized in EPUB's, searchable and downloadable.
PDF is very far from an ideal format for the today world of different-sized screens. It is a horrible experience on mobile and even worse on eInk pocket books. I would rather advocate making everything available in ePub. Or even better - FB2, it is an easy to grok/implement (designed with manual authoring, simple scripted processing and low-end devices in mind) single-xml structure decoupling the content from the view even more. I often convert ePubs to FB2 (with Pandoc and Calibre) to make PocketBook render them in its native fonts (which always are better) rather than in the font specified in the ePub.
I would also mention that the text within PDFs often is not machine-readable (you copy-paste it and get text without spaces, with additional spaces or complete garbage) but I believe this is easily avoidable if you bake PDFs a proper way.
I could also suggest publishing everything in Markdown (with images embedded in a Base64 section in the bottom) but this doesn't seem practical because browsers, book-reading apps and eInk devices don't support nice rendering of them directly.
> “But how can I implement shiny whizz-bang features that will engage readers and drive conversions?!” You can’t. PDF is boring
It's not. It supports JavaScript, embedded video and other kinds of active content. Sadly.
One problem I noticed on mobile, is that if I click on a link in the PDF and visit another page, and then try to traverse back, it takes me to the first page in the PDF, rather than the page I linked from.
I honestly can't believe all the praise for HTML and web on HN in the face of this awesome critique. I hugely appreciate the love for actual files.
>• PDFs are decentralised. You may have obtained this PDF from a website, or maybe not! Self-contained static files are liberating! They stand alone and are not dependent on being hosted by any particular web server or under any particular domain. You can publish to one host or to a thousand hosts or to none, and it maintains its identity through content-addressing, not through the blessing of a distributor.
This seems to have gotten lost in the offense everyone has taken over the choice to not use 'simple HTML', despite the document's clear reasoning that to do even that would embed the content deep in the 'urban web'. All of these simple-complex propositions about making some subset language or automating document flows are missing the point entirely.
> You can publish to one host or to a thousand hosts or to none, and it maintains its identity through content-addressing, not through the blessing of a distributor.
It kind of seems like you're describing IPFS, except with worse content addressing guarantees. The vast majority of your users will never check to see if a PDF's content actually match its content address.
> All of these simple-complex propositions about making some subset language or automating document flows are missing the point entirely.
Are they? It's really not that hard to build a self-contained HTML file, and to re-emphasize, signed PDFs and signed HTML files are about the same level of accessibility to most users. Web browsers don't really handle either, if you want those guarantees you need to use a protocol/technology with better support right from the start.
Also to be clear, despite the author's argument that PDFs can be self-contained, no browser guarantees that, and there's no way for me to tell if the PDF is self contained when I click on it in Firefox unless I download it and check it myself offline or in a viewer that guarantees it won't make network requests.
Nothing online that I'm aware of forces authors to use PDF/A, so when I download a PDF, I don't know what I'm getting. It's not actually the magical, re-hostable world that the author claims.
I'm not sure that people are missing the author's point so much as they're saying the author is making claims about the portability of PDFs that aren't necessarily accurate. Yes, it would be good to have better self-contained guarantees about some web-content, but I'm not sure PDFs actually supply any of those guarantees.
"But stable standards are incredibly important.They allow software, at least in theory, to be finished. Why is it importantthat software be finished? Because it gives us hope that we might end thechurn and fix all the bugs! I want to use software whose version number is7
1.0. I want to use software whose every line of code has been studied,analysed, optimised and punishingly tested. I want every component andsubcomponent and every interaction and every configuration to beexquisitely documented, and taught in courses, and painstakinglydeconstructed and proven sound"
Sorry, not possible. Never, ever. Software does not work like that. Bugs will never be fixed (if they could, the software in question would have become obsolete long ago). By the way, this is what you get when you try to copypaste text from this "website".
I read it on my phone. I then clicked an external link at the end and then hit my browser back button. I had to wait for the PDF to re-load and was unhappy when I found myself back at the top of the document.
"PDFs are self contained, and can't be broken by an API going down"
Is directly broken by "PDFs are part of the web, and part of the content can be by reference to a webpage"
If that webpage goes down, that link it broken.
That decentralized bit still needs to conform to broken copyright laws too.
You can't just download a pdf then rehost it on your own without a license to do so
.... There's also a big difference between a city and the modern web. We own the infrastructure in a city, vs rich people own it on the web.
Rather than a city, the web is more like a company town. I don't think that's any different for pdfs either. The distribution is still coming from a web server owned by a company -- the real response is self hosting of your stuff, and self hosting by your friends for their stuff. The file format doesn't make it self hosted
PDF-fing everything on your website is one way to go about it...
I personally use the service at printfriendly [1] and Arc90's Readability to make un-crufted and readable PDF files of web content that is worth saving for the coming decades.
Added bonus: by saving these very small files on my system pressing the Command + Spacebar on my system I can easily search through my multiple decades of interesting files...
There are good points here, but I think the author slightly undermines his message because the layout and typography of this particular PDF is so poor. Probably because it “was written in the world’s greatest web authoring tool: LibreOffice Writer”.
In other words, one advantage of PDF is that free authoring tools such as the TeX family can create typographically beautiful results that are nearly impossible to achieve with HTML, but he leaves that on the table.
I cannot tell if this is satirical or not. Assuming it is not, every single “pro” of PDFs is just plain incorrect except for the one about being “self-contained” to which I point to https://gwern.net as a good example of self-contained HTML. Gwern archives all the pages he references so that they are always available.
In the case this is satire, I applaud it because I did get a few chuckles.
Useless rant. His choice won't change the rest of the internet and for his site he could easily write lean html without all the stuff he complains about.
* PDFs are files. We must not lose sight of the fact that files are a basic freedom.
This seems like the core belief of the article. And it's at odds with the nature of the web.
In the beginning, the web was a network of devices transmitting files with addressable locations on the device, creating a more or less 1:1 relationship between the devices and the web - the devices WERE the web.
But this inevitably faded as information wants to be... fast and it became easier to whip small data packets around describing state, not files.
I agree with the Unixy belief - files are freedom. But trying to model the entire web on those files is fighting gravity. They're not going anywhere. They just have to travel through the Web Soup sometimes now.
All the technologies enabling a global network of file sharing are still there, the author is just bemoaning today's lingua franca. (json?) And perhaps there is a fear that we will lose sight of "device-based computing" / file ownership.
It has political overtones too... individualism vs collectivism. The web is a very interesting place to hash through those ideas in code before we hash through them in legislation.
The point about the size of the W3C spec is hilarious, but I wonder how much of that hundred million plus words is actually necessary to implement the parts of the spec that people use?
Surely it would be possible to create a spec that captured the most useful subset of HTML and CSS functionality.
In any case if the spec really is that huge the W3C should be written off. Any organization that produces a spec like that is worthless.
Well, sort of. Can't HTML contain script tags with external references (xmlHttpRequest or any async fetch) that a simple crawler/browser may not save to disk?
They could, but if he's the one create the file, he can choose. And if he's just hosting the file, I'm sure there are tools that will inline all the external resources.
When I click on the submitted link with Chrome on Android, it asks me if I want to redownload "0.pdf". Such a confusing question. If I pick the wrong answer, I end up with some restaurant menu I must have looked at months ago, not what the global poster intended.
So for non-confusing real-world UX I'd recommend extra care with file names if you want to go PDF only.
The whole post boils down to: "HTML is bad because it has scope creep and people use it for bad things, but PDF is good because I made this particular document in a way I like for a use case I prefer."
You do you, man! Some people run Archie servers, some people create a directory full of PDFs.
Thanks. I am starting self-hosted blog about design fundamentals, best-practices, etc. Using only PDF is not a solution for me. Combining minimalistic web-site design with pdf/e-pub will suit me well.
I like your approach as a statement against web "pollution".
I don't agree with author's choices (yes, I'm disciplined enough not to add irrelevant elements to my content), but it's really sad that things got to the point where someone actually suggests PDF as an alternative to the web.
We are drowning in churn and noise.
I am fighting by switching this
site to PDF
I find the "actual" title unhelpful, unenlightening, uninformative, and uninviting, which I why I originally chose text taken directly from the page, so people would know what it was about before taking the time to click and read.
I know why the HN mods have changed it to "Deurbanising the Web", but I wish they'd keep more informative titles, especially when taken from the article in question.
While I agree with the thesis, I believe it it possible to do things like this with vanilla HTML. For example, I created a search engine that is just a static HTML page: www.locserendipity.com
I'm not old enough to remember Gopher being "the internet" but I have browsed a few retro sites that still run it. I wouldn't mind seeing some slightly upgraded gopher-like protocol that allowed for embedding images and maybe form submissions (without any scripting). Most of what I want to do online is read, and I'd be more than happy for everything to come with a standardized look and feel rather than whatever scroll jacking weirdo design every website feels like having.
Why not just extremely simple, plain HTML? No frameworks, not even CSS. In fact, you could make your life even simpler by using markdown files and having the browser convert that to HTML in real time with a single JS library (there are a few, I am not promoting anything one particular), so it doesn’t even require a “back end”! Plain HTML, while not having all the “portable” attributes of PDF, is still pretty darn robust and most browsers handle printing (or conversion to PDF) quite well.
Some of the listed benefits don't apply. Notably paginated (PDF) vs. scrolled navigation, but also features such as formulae displays and specific typesetting / layout elements, in-page bookmarks, highlighting, and notes.
For shorter documents that's not much of a problem. For anything much over ~chapter length (about 20 pages or 10,000 words), navigation within a single HTML page becomes painful. Well below that level on smaller devices
This experiment is interesting, but not so bold or novel when you consider the culture around making zines (small, DIY, often quirky magazines). The creativity there is amazing and medium-wise it's often "hybrid" (print-oriented but shared online).
Naming or framing things in a difficult or obtuse way can be a good way to limit your audience. However, if it works others will follow and it will no longer be effective.
I had a similar experience with a Meetup I once hosted which I specifically put in a location that was difficult (but admittedly becoming trendy). It worked for a bit but eventually attracted the crowd I was trying to alienate.
Most people think that pdf's have to be letter or A4 size, but you can make them at A7 or A8 for a phone screen, or for that matter, any size you want.
PDF is size-agnostic. There's nothing to stop you from creating documents the size of a phone screen. So you could put the phone screen-sized pdf at m.mysite.com and this small screen illegible complaint is solved.
I like the idea of keeping HTML's document-centric original design, but accessing the documents using p2p protocols (instead of the client-server model used on the web).
Why does it seem like almost everyone doesn't realize that PDFs can easily be made to support all the horrors we see in HTML? No, it's fucking well not impossible -- or even notably difficult -- to jam some malicious dynamic code into a PDF. The only reason a period of widespread fear about PDF viruses hasn't developed as it has for websites spreading malicious code is the fact that websites got much more widely adopted. PDFs have been used as malicious code vectors before, and replacing HTML with PDFs would only result in PDFs being the new common vector for the same problem, with at least the same scale and intensity.
This only seems like a solution if you don't know what PDFs can do -- and, by the way, sometimes pagination is bad, especially static (non-reflow) pagination.
EDIT:
Let's make this clearer.
You can actually embed an entire JavaScript application in a PDF. Tell me again how PDFs somehow prevent the problem of dynamic pages on the web. All using PDFs instead of HTML pages would do is wrap the horrors of the web in forms that are generally more hostile to various viewing contexts for the less harmful use cases (e.g. static pages suddenly being harder to read in some contexts with PDFs than with HTML pages).
"Files are a basic human freedom" - that definitely resonates with me.
There's an assortment of trade-offs though. In particular, linking between files breaks if you ever want to move or rename a file. Also, by self-encapsulating every file, you end up using space less efficiently.
I don't consider using pdf for this purpose a good idea. It would be better to have a static html pages, with reference to epub with the same content. One can have both generated from the same source with a static site generator.
It also wouldn't be upvoted on HN. I agree that a static page generator would have been a much more fitting technology (for example). But sometimes you gotta sacrifice that for visibility.
Th author has a point in that many people want an online presence but the way the imagine it is more akin to a pamphlet or poster than a hyperlinked website.
If that is the case, then pdf or a resizable image makes sense.
I've always wondered why some sites can serve PDFs that my browser (firefox) can view inline (my preferred method), rather than forcing me to download the file and open in a separate application
This sounds like the Creative Director I worked with, ca. 1998, who bemoaned that he couldn't have pixel-perfect layouts over a wide variety of devices/browsers/operating systems.
Using xelatex, I got only the text, no pushbutton. Using pdflatex, I got a pushbutton, but it was not a hyperlink, just an image. What engine do you use to get this to work?
Comments here are disappointing. The problem with any of this is getting any momentum, so given the level of pushback, pdf might not be it. Having to be a specific version of pdf probably doesn't help. Creating new spec is hopeless as well unless you are someone very famous and can manage to get enough people to adopt. There's text/markdown mediatype which can also serve this purpose but it needs a boost from someone with some street cred. People work in predictable ways and this is a political project.
For my part, I expressed bafflement because the end result seems worse than the starting point in almost every way, including those that the author was complaining about the web for.
This is terrible for accessibility. Please just use semantic HTML and your web will be usable on 10yo devices and unknown devices 10 years in the future.
It's ironical that the author is pitching for PDF, and yet he is using a plethora of hyper-links.
The big "invention" of the Web was linking pages together. That's what made it great. That's what created "Google" in the first place. Links in a PDF are supposed to take you to a browser or open a different PDF file?
PDF is a step back. If you are angry about the overblown size of JavaScript and resources consumption, use a simple static website. It doesn't get easier than that.
I guess by modern standards this load time is acceptable, but when you argue that PDFs are a way to move forward, you're competing with HTML 4/5. And by that standard:
- Crud this website is so slow. Unacceptably slow. If your technology stack is spending 10 seconds just to fetch and render 13 pages of large-screen text, then either you're doing something wrong or it's a bad technology stack. That load time alone should kill this idea.
- There's no way for me to turn off images. This is the opposite of a client-respecting webpage, the only way you could make it worse is by rendering to Canvas or shipping me a PNG. My mobile browser doesn't fetch fonts by default. You're overriding my choice to do that.
- Mobile? Reflow? Responsive design? Adjustable font sizes? The author kind of offhandedly says that PDFs can do reflow right now, but how many clients actually support that. Does the PDF format handle this by default?
- Saying "you can technically make PDF accessible" is exactly the same as saying "you can technically use just a subset of HTML." It's the same argument. Nobody does it, PDFs are generally hostile to accessibility, and there's no way to signal that a PDF is accessible or enforce it as a community standard.
So, the much bigger question: what's wrong with Gemini[0]? I've been critical of Gemini in the past on multiple fronts, but if you are in this space where you want to burn everything down and make your blog static, Gemini really does seem to solve every problem that the author has, except better. It's also trivial to proxy Gemini documents or statically re-render them to HTML, which makes them accessible to people outside the community. And by default, they're both pretty accessible to screen readers, and much more efficient than what the author is proposing.
The author argues that using static HTML wouldn't be good enough because there's no standard that forces you to exclude Javascript. Then they point to PDF/A, which is not a standard that is enforced by most browser PDF viewers. To me, this argument isn't any different from telling website authors to choose not to use Javascript, what is going to force anyone to use PDF/A? Every web browser PDF reader supports Javascript. NoScript support in Firefox is better than the controls/extensions for disabling PDF scripting.
And Gemini is right there: for the most part it's actually working today. So I just don't get it. Why pick a technology that's tangibly worse than the web on (and I mean this quite literally) almost every single axis and every single metric, when you could instead switch to a markup language that actually does have use-cases, that does simplify deployment and blogging in some instances, that does have a real community, that does have some real advantages over HTML, that does have some real momentum behind it, and that doesn't disrespect my choices about what fonts/images I want to download?
While this may be extreme, I do notice that it is becoming harder and harder to print webpages to PDF/paper. Is there a good approach for this besides the standard print dialog?
For sites without print-specific media queries (so basically all websites) I use dev tools to delete all the DOM nodes I don’t want to appear in print.
Definitely less freedom. On html I the reader can change the size of text or even the font and the text will reflow so you don't need scroll horizontally to read each line. How do you do that to a pdf?
Long sympathetic with the Jacob Nielsen / PDF bad camp ... I've had some recent changes of heart. Not a full convert, but PDF is often superior to HTML, especially for longer-form and complex noninteractive content.
Books are an artefact whose design has evolved over the centuries to accommodate human-scale ergonomics: font size, paper and ink colour, words per line, lines per page, pages per volume, overall weight and dimensions. Standard-sized books are all larger than the largest current mobile phones, with diagonal measures of about 9--12 inches. There are smaller and larger books, but those are compromises either to portability (pocketbooks) or to large-format resolution and detail ("coffee table" books, atlases, and the like). Magazines tend to run even larger (about 13"), broadsheet newspapers larger yet. Most criticisms of PDFs are actually criticisms of the devices and displays used to read them.
Poor resolution, incorrect aspect ratios, and small display sizes (especially mobile devices) are the key problems.
Reading PDFs on a tablet, especially a larger e-ink device, is a game-changer. I now actively avoid HTML, or at least launch it in a browser designed with e-ink in mind (EInkBro: https://github.com/plateaukao/browser). Otherwise, my large (13.3") high-DPI (200+) B&W ebook reader is an excellent long-form immersive reading tool.
The key requirement of a mobile phone is that it fit in a pocket, handbag, or purse. They are too small for reading, and aren't designed for that purpose. Current devices feature screen sizes of roughly 5--7 inches (diagonal measurement). At the lower end, that's smaller than a 3x5 index card (6"), and the largest barely the size of a 4x6 card (7").
On desktops, the first display that offered what I felt was a truly comfortable two-pages-up PDF reading experience was the 27" Retina iMac. Its 5K display (itself an oddball size) suits document work well. Even not fully maximised, most books are highly readable (leaving screen space for other tasks), and at full maximisation, details really stand out especially from scans of historical editions. (Such details aren't always relevent, but often are.)
PDF also provides capabilities HTML either cannot or does not by default (and few seem to be persuaded to offer), especially pagination, formulae, and a spatially-persistent layout (if you have a spatial memory, this is very valuable).
PDFs can though often do not include internal navigation (chapters, sections, etc.), search (if full text is included), and most critically, metadata (at a minimum, author, title, date, and publisher, see the full Dublin Core metadata specification for what should be required).
PDFs can also be published directly to device sizes (or to a set of form factors encompassing typical devices), as several others note.
Some of the issues aren't entirely intrinsic, and my feeling is that wider use of PDFs for online content would lead to a proliferation of PDF annoyances to match present-day Web annoyances. In each case, the fundamental problem is that publishers rather than readers have final say over presentation. An alternative, of distributing raw minimum markup and formatting that to user specifications following a set number of templates ... might help.
It's ironic that the article here embodies a number of PDF annoyances:
- The shaded background renders quite poorly on a B&W e-ink reader (though can be eliminated with a watermark-removal setting).
- The filename provides no clues as to contents or provenance, and is likely to collide with other content.
- I'm a fan of serif fonts, not sans serif, for high-DPI reading.
- Internal and external hyperlink support is ... variable. At times utterly missing, at others, inconsistent or inconvenient.
- PDFs are not trivially directly editable, which means both authors and readers can change errors or address issues.
- Many PDFs lack internal structure, even where the document they encompass do. The number of books lacking PDF table-of-contents support is ... large.
- Metadata standards and practices are abysmal. See the Dublin Core standards.
- Naming conventions similarly. "Report", "Resume", "Project", or "0.pdf" are names which should never be used. Describe author, content, and date, as a minimum, if possible.
The sad thing is, this is what the web was SUPPOSED to be, more or less: a series of static documents, text and images. The only interactivity (setting aside the occasional CGI forms) was that you could click certain images or text and go to other static documents. Documents linked to documents.
Then everyone lost their minds and decided webpages needed to be PROGRAMS and we've been paying the price ever since.
While I appreciate the sentiment, I don't think PDF is the way, at least in the way you're currently doing it. PDF maybe supported by browsers, but they're not intended for it, it's secondary feature. Same for search engines. Same for mobile.
Most browsers have Print to PDF. If you want people to be able to download an immutable version of your content, then just have a simple static version of your page with a valid print css, better yet, leave everything default.
There are also other lightweight alternatives. The Gopher protocol has a small, but disturbed following : http://gopher.muffinlabs.com/gopher.floodgap.com (you can actually use netcat as your gopher client). Gemini is a more modern gopher-inspired protocol https://gemini.circumlunar.space/. Personally, I'd be pleased to see a text-first approach gain adoption. I don't think anyone looks at the thick-client model browsers have evolved into and sees an optimal solution.
I think evangelistic energy should probably be directed at complaining to organizations that share content through JS-framework monstrosities. Getting rank-and-file web-devs excited about lean websites doesn't hurt, but clients and CTOs have real decision making power.
THANK YOU! HTML semantics are a trap, just enough to make you think something is there but anemic enough to be a giant excersize in bikesheading. Ask yourself this: If HTML semantics were adequate, why do we have ARIA and 90 different microformats?
Other than that, I read the article expecting to be annoyed by the PDF presentation but was pleasantly surprised by how it read just like I would want a content page to read. My only complaint is that browsers (at least Brave) do not preserve scroll position in PDFs. If the browsers fix that the author may be onto something here.
Sounds to me like ePub would fit better. It’s designed for reflow and it’s built out of a subset of HTML. Worse case the contents of the file can be expanded.
HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.
* PDFs are files
HTML is files
* PDFs are decentralised
This should be "PDFs can be decentralised". PDFs aren't inherently any more decentralised than any other kind of file, including HTML.
The store is the thing that becomes decentralised, not the content.
* PDFs are page-oriented
HTML can be page-oriented. Simply build your website with pagination. PDFs can also be abused to have hugely long pages. Bad UX can be encapsulated in any medium.
* PDFs used to be large (bla bla bla Javascript weighs a lot)
Nope, PDFs are still objectively larger than the equivalent HTML. PDFs don't have any dynamic interaction, rip all that out and produce the HTML of yesteryear and your HTML will be tiny in comparison to the PDF.
Edit: I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?